From patchwork Mon Jul 10 22:32:53 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 701396 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0D61DC001B0 for ; Mon, 10 Jul 2023 22:34:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230391AbjGJWeB (ORCPT ); Mon, 10 Jul 2023 18:34:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58890 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230374AbjGJWd4 (ORCPT ); Mon, 10 Jul 2023 18:33:56 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EB59EE5F for ; Mon, 10 Jul 2023 15:33:49 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-c83284edf0eso1866170276.3 for ; Mon, 10 Jul 2023 15:33:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028429; x=1691620429; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=K10CMsALExOmOcR/YTwAYoH6sOUce0sACyLlyAd3oCA=; b=PmjrhD360OT99Lc8iPVcJYoR6DDqdeqIAZ5+tJAhFT5ahoktvUJx3QBnBPzN/WyXln kbA5Ozqmx/gY8WIxPOw00lEwu6Irt1PeBniAsNqQA8De/a7Gmw+d3Ar8j+YKrZssZ8uA BuSorRLnfwVf9kLGmV78yHD3Id1Ztae6MVmUiqkzc97p1exlkKRRIKvklbv9OqJSUqmP Qk97r7IUXC5lwlf/gsOz/X4tKxwtoTHHXF4t3kR8h/ivJz2kuJv7koREvjKNpq8Fdh4o MhP81DlKHR9GRSfe72HCOY8i4aU5RoyUS+8Tp6rfoY7xD/75fsJWZLCBFtePU5WPjwxt ZKhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028429; x=1691620429; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=K10CMsALExOmOcR/YTwAYoH6sOUce0sACyLlyAd3oCA=; b=Yyq8Sg4YkuGuYs20y83dpp66HBPB/vk0aorN53+nSMX6v9GVUOCvWT0vSkY2aMikka mZh1ahTL1Kxj1hGLCxTWArWnxSvV+tdLcnFv2r4ZWN/i1m28wk4erlR+Z0uYck4mg1mu IX6236H2XGR7IlsW2QKXQ/7TS8ZqC41jw8SvGRml9tIfQk8eEy0lwPbUrEELxqikyT5H ArqTdYHWGyYVFjbDlGrY4MGGhyAAL+t+WKG+udwRQeWl78RfEoKaDA94Yt1hPAbqYcZf LBlbgh3IDj9Bj8J5nkaMWiBNqFo1It9WYmFOUydoYtOZCZ6aiEqiaM1OlRaqsZs2Npll VIKA== X-Gm-Message-State: ABy/qLYxRbH2ImBjQjONqaHgXNhMqRX8MGMntZWBW6lvqLruOFgIbtwH q+bj72j5FSGPSZUpJnMWXcZvdce2eBj0nGmgpA== X-Google-Smtp-Source: APBJJlELfkUsONGQv1JgPO9H1X9zJoqx1LLA0iy04SlqfvxZOZRiNmkJzPsHKMMO+0eNN0K5dDAd6ysBuYoO3e4CPQ== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a25:1ed4:0:b0:c48:b822:36db with SMTP id e203-20020a251ed4000000b00c48b82236dbmr78934ybe.10.1689028429046; Mon, 10 Jul 2023 15:33:49 -0700 (PDT) Date: Mon, 10 Jul 2023 15:32:53 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-3-almasrymina@google.com> Subject: [RFC PATCH 02/10] dma-buf: add support for NET_RX pages From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Mina Almasry , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Shuah Khan , jgg@ziepe.ca Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org Use the paged attachment mappings support to create NET_RX pages. NET_RX pages are pages that can be used in the networking receive path: Bind the pages to the driver's rx queues specified by the create_flags param, and create a gen_pool to hold the free pages available for the driver to allocate. Signed-off-by: Mina Almasry --- drivers/dma-buf/dma-buf.c | 174 +++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 20 ++++ include/linux/netdevice.h | 1 + include/uapi/linux/dma-buf.h | 2 + 4 files changed, 197 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 50b1d813cf5c..acb86bf406f4 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include @@ -1681,6 +1682,8 @@ static void dma_buf_pages_destroy(struct percpu_ref *ref) pci_dev_put(priv->pci_dev); } +const struct dma_buf_pages_type_ops net_rx_ops; + static long dma_buf_create_pages(struct file *file, struct dma_buf_create_pages_info *create_info) { @@ -1793,6 +1796,9 @@ static long dma_buf_create_pages(struct file *file, priv->create_flags = create_info->create_flags; switch (priv->type) { + case DMA_BUF_PAGES_NET_RX: + priv->type_ops = &net_rx_ops; + break; default: err = -EINVAL; goto out_put_new_file; @@ -1966,3 +1972,171 @@ static void __exit dma_buf_deinit(void) dma_buf_uninit_sysfs_statistics(); } __exitcall(dma_buf_deinit); + +/******************************** + * dma_buf_pages_net_rx * + ********************************/ + +void dma_buf_pages_net_rx_release(struct dma_buf_pages *priv, struct file *file) +{ + struct netdev_rx_queue *rxq; + unsigned long xa_idx; + + xa_for_each(&priv->net_rx.bound_rxq_list, xa_idx, rxq) + if (rxq->dmabuf_pages == file) + rxq->dmabuf_pages = NULL; +} + +static int dev_is_class(struct device *dev, void *class) +{ + if (dev->class != NULL && !strcmp(dev->class->name, class)) + return 1; + + return 0; +} + +int dma_buf_pages_net_rx_init(struct dma_buf_pages *priv, struct file *file) +{ + struct netdev_rx_queue *rxq; + struct net_device *netdev; + int xa_id, err, rxq_idx; + struct device *device; + + priv->net_rx.page_pool = + gen_pool_create(PAGE_SHIFT, dev_to_node(&priv->pci_dev->dev)); + + if (!priv->net_rx.page_pool) + return -ENOMEM; + + /* + * We start with PAGE_SIZE instead of 0 since gen_pool_alloc_*() returns + * NULL on error + */ + err = gen_pool_add_virt(priv->net_rx.page_pool, PAGE_SIZE, 0, + PAGE_SIZE * priv->num_pages, + dev_to_node(&priv->pci_dev->dev)); + if (err) + goto out_destroy_pool; + + xa_init_flags(&priv->net_rx.bound_rxq_list, XA_FLAGS_ALLOC); + + device = device_find_child(&priv->pci_dev->dev, "net", dev_is_class); + if (!device) { + err = -ENODEV; + goto out_destroy_xarray; + } + + netdev = to_net_dev(device); + if (!netdev) { + err = -ENODEV; + goto out_put_dev; + } + + for (rxq_idx = 0; rxq_idx < (sizeof(priv->create_flags) * 8); + rxq_idx++) { + if (!(priv->create_flags & (1ULL << rxq_idx))) + continue; + + if (rxq_idx >= netdev->num_rx_queues) { + err = -ERANGE; + goto out_release_rx; + } + + rxq = __netif_get_rx_queue(netdev, rxq_idx); + + err = xa_alloc(&priv->net_rx.bound_rxq_list, &xa_id, rxq, + xa_limit_32b, GFP_KERNEL); + if (err) + goto out_release_rx; + + /* We previously have done a dma_buf_attach(), which validates + * that the net_device we're trying to attach to can reach the + * dmabuf, so we don't need to check here as well. + */ + rxq->dmabuf_pages = file; + } + put_device(device); + return 0; + +out_release_rx: + dma_buf_pages_net_rx_release(priv, file); +out_put_dev: + put_device(device); +out_destroy_xarray: + xa_destroy(&priv->net_rx.bound_rxq_list); +out_destroy_pool: + gen_pool_destroy(priv->net_rx.page_pool); + return err; +} + +void dma_buf_pages_net_rx_free(struct dma_buf_pages *priv) +{ + xa_destroy(&priv->net_rx.bound_rxq_list); + gen_pool_destroy(priv->net_rx.page_pool); +} + +static unsigned long dma_buf_page_to_gen_pool_addr(struct page *page) +{ + struct dma_buf_pages *priv; + struct dev_pagemap *pgmap; + unsigned long offset; + + pgmap = page->pgmap; + priv = container_of(pgmap, struct dma_buf_pages, pgmap); + offset = page - priv->pages; + /* Offset + 1 is due to the fact that we want to avoid 0 virt address + * returned from the gen_pool. The gen_pool returns 0 on error, and virt + * address 0 is indistinguishable from an error. + */ + return (offset + 1) << PAGE_SHIFT; +} + +static struct page * +dma_buf_gen_pool_addr_to_page(unsigned long addr, struct dma_buf_pages *priv) +{ + /* - 1 is due to the fact that we want to avoid 0 virt address + * returned from the gen_pool. See comment in dma_buf_create_pages() + * for details. + */ + unsigned long offset = (addr >> PAGE_SHIFT) - 1; + return &priv->pages[offset]; +} + +void dma_buf_page_free_net_rx(struct dma_buf_pages *priv, struct page *page) +{ + unsigned long addr = dma_buf_page_to_gen_pool_addr(page); + + if (gen_pool_has_addr(priv->net_rx.page_pool, addr, PAGE_SIZE)) + gen_pool_free(priv->net_rx.page_pool, addr, PAGE_SIZE); +} + +const struct dma_buf_pages_type_ops net_rx_ops = { + .dma_buf_pages_init = dma_buf_pages_net_rx_init, + .dma_buf_pages_release = dma_buf_pages_net_rx_release, + .dma_buf_pages_destroy = dma_buf_pages_net_rx_free, + .dma_buf_page_free = dma_buf_page_free_net_rx, +}; + +struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv) +{ + unsigned long gen_pool_addr; + struct page *pg; + + if (!(priv->type & DMA_BUF_PAGES_NET_RX)) + return NULL; + + gen_pool_addr = gen_pool_alloc(priv->net_rx.page_pool, PAGE_SIZE); + if (!gen_pool_addr) + return NULL; + + if (!PAGE_ALIGNED(gen_pool_addr)) { + net_err_ratelimited("dmabuf page pool allocation not aligned"); + gen_pool_free(priv->net_rx.page_pool, gen_pool_addr, PAGE_SIZE); + return NULL; + } + + pg = dma_buf_gen_pool_addr_to_page(gen_pool_addr, priv); + + percpu_ref_get(&priv->pgmap.ref); + return pg; +} diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 5789006180ea..e8e66d6407d0 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -22,6 +22,9 @@ #include #include #include +#include +#include +#include struct device; struct dma_buf; @@ -552,6 +555,11 @@ struct dma_buf_pages_type_ops { struct page *page); }; +struct dma_buf_pages_net_rx { + struct gen_pool *page_pool; + struct xarray bound_rxq_list; +}; + struct dma_buf_pages { /* fields for dmabuf */ struct dma_buf *dmabuf; @@ -568,6 +576,10 @@ struct dma_buf_pages { unsigned int type; const struct dma_buf_pages_type_ops *type_ops; __u64 create_flags; + + union { + struct dma_buf_pages_net_rx net_rx; + }; }; /** @@ -671,6 +683,8 @@ static inline bool is_dma_buf_pages_file(struct file *file) return file->f_op == &dma_buf_pages_fops; } +struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv); + static inline bool is_dma_buf_page(struct page *page) { return (is_zone_device_page(page) && page->pgmap && @@ -718,6 +732,12 @@ static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg, { return 0; } + +static inline struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv) +{ + return NULL; +} + #endif diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c2f0c6002a84..7a087ffa9baa 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -796,6 +796,7 @@ struct netdev_rx_queue { #ifdef CONFIG_XDP_SOCKETS struct xsk_buff_pool *pool; #endif + struct file __rcu *dmabuf_pages; } ____cacheline_aligned_in_smp; /* diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h index d0f63a2ab7e4..b392cef9d3c6 100644 --- a/include/uapi/linux/dma-buf.h +++ b/include/uapi/linux/dma-buf.h @@ -186,6 +186,8 @@ struct dma_buf_create_pages_info { __u64 create_flags; }; +#define DMA_BUF_PAGES_NET_RX (1 << 0) + #define DMA_BUF_CREATE_PAGES _IOW(DMA_BUF_BASE, 4, struct dma_buf_create_pages_info) #endif From patchwork Mon Jul 10 22:32:55 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 701395 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9A227C001B0 for ; Mon, 10 Jul 2023 22:34:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230491AbjGJWeR (ORCPT ); Mon, 10 Jul 2023 18:34:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59164 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230345AbjGJWeJ (ORCPT ); Mon, 10 Jul 2023 18:34:09 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BF88DE75 for ; Mon, 10 Jul 2023 15:33:56 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-c118efd0c3cso4456657276.0 for ; Mon, 10 Jul 2023 15:33:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028436; x=1691620436; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=QDPrrkJolhYj8+JA2ecQJjxVsJNIQec1wNqvH72xEAE=; b=bliRqYBYA7SP+SPxK2IjcbatX7Omsm+v4X2DL5ulNehUSHXhYkbBCjCoPA4+lde/+r Mv4vn1NPHsq9zQaJ/BS5q4yOT28xnB+GR7b90+Msgrod+u9besvi+CHAzr3d2lqkB/AA yxX32+4hCEHv/rYkC3zNVkxpRlZhCActP39jEwWjryIEygF1XkF2wIN5uXGWgJII1qn3 4AzQQnHnn+AHSw/qKKc5jLJMbzVKVGyCcgjIEEOOjxsfJU/XS8Za+1qyJC6zoQp8SC9Z eZmHcI7zYkSKm3we5mSp3N6GMumLtl7F3sh7WJbaYkTjS79SpPYbISoFN9wtRlENHs+q GjoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028436; x=1691620436; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=QDPrrkJolhYj8+JA2ecQJjxVsJNIQec1wNqvH72xEAE=; b=D4Ezio6qmh/VIyYXv+o0oTVACAnQ/iMhtiKhcuP0eSJm+LTzNd9Wk1i4xdcaH4rhYq 7DGMTnk5iY2kHKyjCsVruYDwFlyoOSG9ZE3qioV0b9N27wnnq+p+wGPDQyEfogilcoi5 dtfILkiMMC9C2SE8lEajwdagiLF5vfO3AERKdeTo+583aON7zAgHDPqOkniUsQautj+v uTwbNtgEWvpNaIXRBQT5WckJGoaPm+C4N0qgsiMn9OLTRIBN/mC3akXxAMADA3iIrcAk 8ErcL3Xd93i+NcfqimvKaMaUvG8uthwQe9Vl7sW7WDoutQuee+a/n6n0GxoH1Qb3zXl5 HL5g== X-Gm-Message-State: ABy/qLaXqa782NDt0ptO8GWzgJVd+w1VJQM/3Y9yHcNaEF6WxVTSADnH 8875HI7R3EzPp3+3KhR6B6A4OY2RmlJl24GJmw== X-Google-Smtp-Source: APBJJlG3bwEoPIkQ3o2GIl51Ddf3aoB7B54zxP7HixWCIvf2JZmkwqiiEqZGTtJ7VvQ3rxT2A7pGP8pckAgyDOt8qg== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a25:41ca:0:b0:c6b:c65d:6d02 with SMTP id o193-20020a2541ca000000b00c6bc65d6d02mr65909yba.9.1689028435810; Mon, 10 Jul 2023 15:33:55 -0700 (PDT) Date: Mon, 10 Jul 2023 15:32:55 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-5-almasrymina@google.com> Subject: [RFC PATCH 04/10] net: add support for skbs with unreadable frags From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Mina Almasry , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Shuah Khan , jgg@ziepe.ca Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org For device memory TCP, we expect the skb headers to be available in host memory for access, and we expect the skb frags to be in device memory and unaccessible to the host. We expect there to be no mixing and matching of device memory frags (unaccessible) with host memory frags (accessible) in the same skb. Add a skb->devmem flag which indicates whether the frags in this skb are device memory frags or not. __skb_fill_page_desc() & skb_fill_page_desc_noacc() now checks frags added to skbs for dmabuf pages, and marks the skb as skb->devmem if the page is a device memory page. Add checks through the network stack to avoid accessing the frags of devmem skbs and avoid coallescing devmem skbs with non devmem skbs. Signed-off-by: Mina Almasry --- include/linux/skbuff.h | 15 +++++++++ include/net/tcp.h | 6 ++-- net/core/skbuff.c | 73 ++++++++++++++++++++++++++++++++++-------- net/ipv4/tcp.c | 3 ++ net/ipv4/tcp_input.c | 13 ++++++-- net/ipv4/tcp_output.c | 5 ++- net/packet/af_packet.c | 4 +-- 7 files changed, 97 insertions(+), 22 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0b40417457cd..f5e03aa84160 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -38,6 +38,7 @@ #endif #include #include +#include /** * DOC: skb checksums @@ -805,6 +806,8 @@ typedef unsigned char *sk_buff_data_t; * @csum_level: indicates the number of consecutive checksums found in * the packet minus one that have been verified as * CHECKSUM_UNNECESSARY (max 3) + * @devmem: indicates that all the fragments in this skb is backed by + * device memory. * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB * @slow_gro: state present at GRO time, slower prepare step required @@ -992,6 +995,7 @@ struct sk_buff { __u8 csum_not_inet:1; #endif + __u8 devmem:1; #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ #endif @@ -1766,6 +1770,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb) __skb_zcopy_downgrade_managed(skb); } +/* Return true if frags in this skb are not readable by the host. */ +static inline bool skb_frags_not_readable(const struct sk_buff *skb) +{ + return skb->devmem; +} + static inline void skb_mark_not_on_list(struct sk_buff *skb) { skb->next = NULL; @@ -2469,6 +2479,8 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i, page = compound_head(page); if (page_is_pfmemalloc(page)) skb->pfmemalloc = true; + if (is_dma_buf_page(page)) + skb->devmem = true; } /** @@ -2511,6 +2523,9 @@ static inline void skb_fill_page_desc_noacc(struct sk_buff *skb, int i, __skb_fill_page_desc_noacc(shinfo, i, page, off, size); shinfo->nr_frags = i + 1; + + if (is_dma_buf_page(page)) + skb->devmem = true; } void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off, diff --git a/include/net/tcp.h b/include/net/tcp.h index 5066e4586cf0..6d86ed3736ad 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -986,7 +986,7 @@ static inline int tcp_skb_mss(const struct sk_buff *skb) static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb) { - return likely(!TCP_SKB_CB(skb)->eor); + return likely(!TCP_SKB_CB(skb)->eor && !skb_frags_not_readable(skb)); } static inline bool tcp_skb_can_collapse(const struct sk_buff *to, @@ -994,7 +994,9 @@ static inline bool tcp_skb_can_collapse(const struct sk_buff *to, { return likely(tcp_skb_can_collapse_to(to) && mptcp_skb_can_collapse(to, from) && - skb_pure_zcopy_same(to, from)); + skb_pure_zcopy_same(to, from) && + skb_frags_not_readable(to) == + skb_frags_not_readable(from)); } /* Events passed to congestion control interface */ diff --git a/net/core/skbuff.c b/net/core/skbuff.c index cea28d30abb5..9b83da794641 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1191,11 +1191,16 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt) skb_frag_size(frag), p, p_off, p_len, copied) { seg_len = min_t(int, p_len, len); - vaddr = kmap_atomic(p); - print_hex_dump(level, "skb frag: ", - DUMP_PREFIX_OFFSET, - 16, 1, vaddr + p_off, seg_len, false); - kunmap_atomic(vaddr); + if (!is_dma_buf_page(p)) { + vaddr = kmap_atomic(p); + print_hex_dump(level, "skb frag: ", + DUMP_PREFIX_OFFSET, 16, 1, + vaddr + p_off, seg_len, false); + kunmap_atomic(vaddr); + } else { + printk("%sskb frag: devmem", level); + } + len -= seg_len; if (!len) break; @@ -1764,6 +1769,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) return -EINVAL; + if (skb_frags_not_readable(skb)) + return -EFAULT; + if (!num_frags) goto release; @@ -1934,8 +1942,10 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask) { int headerlen = skb_headroom(skb); unsigned int size = skb_end_offset(skb) + skb->data_len; - struct sk_buff *n = __alloc_skb(size, gfp_mask, - skb_alloc_rx_flag(skb), NUMA_NO_NODE); + struct sk_buff *n = skb_frags_not_readable(skb) ? NULL : + __alloc_skb(size, gfp_mask, + skb_alloc_rx_flag(skb), + NUMA_NO_NODE); if (!n) return NULL; @@ -2266,9 +2276,10 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb, /* * Allocate the copy buffer */ - struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom, - gfp_mask, skb_alloc_rx_flag(skb), - NUMA_NO_NODE); + struct sk_buff *n = skb_frags_not_readable(skb) ? NULL : + __alloc_skb(newheadroom + skb->len + newtailroom, + gfp_mask, skb_alloc_rx_flag(skb), + NUMA_NO_NODE); int oldheadroom = skb_headroom(skb); int head_copy_len, head_copy_off; @@ -2609,6 +2620,9 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta) */ int i, k, eat = (skb->tail + delta) - skb->end; + if (skb_frags_not_readable(skb)) + return NULL; + if (eat > 0 || skb_cloned(skb)) { if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0, GFP_ATOMIC)) @@ -2762,6 +2776,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len) to += copy; } + if (skb_frags_not_readable(skb)) + goto fault; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; skb_frag_t *f = &skb_shinfo(skb)->frags[i]; @@ -2835,7 +2852,7 @@ static struct page *linear_to_page(struct page *page, unsigned int *len, { struct page_frag *pfrag = sk_page_frag(sk); - if (!sk_page_frag_refill(sk, pfrag)) + if (!sk_page_frag_refill(sk, pfrag) || is_dma_buf_page(pfrag->page)) return NULL; *len = min_t(unsigned int, *len, pfrag->size - pfrag->offset); @@ -3164,6 +3181,9 @@ int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len) from += copy; } + if (skb_frags_not_readable(skb)) + goto fault; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; int end; @@ -3243,6 +3263,9 @@ __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len, pos = copy; } + if (skb_frags_not_readable(skb)) + return 0; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; @@ -3343,6 +3366,9 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, pos = copy; } + if (skb_frags_not_readable(skb)) + return 0; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; @@ -3800,7 +3826,9 @@ static inline void skb_split_inside_header(struct sk_buff *skb, skb_shinfo(skb1)->frags[i] = skb_shinfo(skb)->frags[i]; skb_shinfo(skb1)->nr_frags = skb_shinfo(skb)->nr_frags; + skb1->devmem = skb->devmem; skb_shinfo(skb)->nr_frags = 0; + skb->devmem = 0; skb1->data_len = skb->data_len; skb1->len += skb1->data_len; skb->data_len = 0; @@ -3814,11 +3842,13 @@ static inline void skb_split_no_header(struct sk_buff *skb, { int i, k = 0; const int nfrags = skb_shinfo(skb)->nr_frags; + const int devmem = skb->devmem; skb_shinfo(skb)->nr_frags = 0; skb1->len = skb1->data_len = skb->len - len; skb->len = len; skb->data_len = len - pos; + skb->devmem = skb1->devmem = 0; for (i = 0; i < nfrags; i++) { int size = skb_frag_size(&skb_shinfo(skb)->frags[i]); @@ -3847,6 +3877,12 @@ static inline void skb_split_no_header(struct sk_buff *skb, pos += size; } skb_shinfo(skb1)->nr_frags = k; + + if (skb_shinfo(skb)->nr_frags) + skb->devmem = devmem; + + if (skb_shinfo(skb1)->nr_frags) + skb1->devmem = devmem; } /** @@ -4082,6 +4118,9 @@ unsigned int skb_seq_read(unsigned int consumed, const u8 **data, return block_limit - abs_offset; } + if (skb_frags_not_readable(st->cur_skb)) + return 0; + if (st->frag_idx == 0 && !st->frag_data) st->stepped_offset += skb_headlen(st->cur_skb); @@ -5681,7 +5720,10 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, (from->pp_recycle && skb_cloned(from))) return false; - if (len <= skb_tailroom(to)) { + if (skb_frags_not_readable(from) != skb_frags_not_readable(to)) + return false; + + if (len <= skb_tailroom(to) && !skb_frags_not_readable(from)) { if (len) BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len)); *delta_truesize = 0; @@ -5997,6 +6039,9 @@ int skb_ensure_writable(struct sk_buff *skb, unsigned int write_len) if (!pskb_may_pull(skb, write_len)) return -ENOMEM; + if (skb_frags_not_readable(skb)) + return -EFAULT; + if (!skb_cloned(skb) || skb_clone_writable(skb, write_len)) return 0; @@ -6656,8 +6701,8 @@ EXPORT_SYMBOL(pskb_extract); void skb_condense(struct sk_buff *skb) { if (skb->data_len) { - if (skb->data_len > skb->end - skb->tail || - skb_cloned(skb)) + if (skb->data_len > skb->end - skb->tail || skb_cloned(skb) || + skb_frags_not_readable(skb)) return; /* Nice, we can free page frag(s) right now */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8d20d9221238..51e8d5872670 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4520,6 +4520,9 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp, if (crypto_ahash_update(req)) return 1; + if (skb_frags_not_readable(skb)) + return 1; + for (i = 0; i < shi->nr_frags; ++i) { const skb_frag_t *f = &shi->frags[i]; unsigned int offset = skb_frag_off(f); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index bf8b22218dd4..8d28d96a3c24 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5188,6 +5188,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, for (end_of_skbs = true; skb != NULL && skb != tail; skb = n) { n = tcp_skb_next(skb, list); + if (skb_frags_not_readable(skb)) + goto skip_this; + /* No new bits? It is possible on ofo queue. */ if (!before(start, TCP_SKB_CB(skb)->end_seq)) { skb = tcp_collapse_one(sk, skb, list, root); @@ -5208,17 +5211,20 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, break; } - if (n && n != tail && mptcp_skb_can_collapse(skb, n) && + if (n && n != tail && !skb_frags_not_readable(n) && + mptcp_skb_can_collapse(skb, n) && TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(n)->seq) { end_of_skbs = false; break; } +skip_this: /* Decided to skip this, advance start seq. */ start = TCP_SKB_CB(skb)->end_seq; } if (end_of_skbs || - (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) + (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || + skb_frags_not_readable(skb)) return; __skb_queue_head_init(&tmp); @@ -5262,7 +5268,8 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, if (!skb || skb == tail || !mptcp_skb_can_collapse(nskb, skb) || - (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) + (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || + skb_frags_not_readable(skb)) goto end; #ifdef CONFIG_TLS_DEVICE if (skb->decrypted != nskb->decrypted) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index cfe128b81a01..eddade864c7f 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2310,7 +2310,8 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len) if (unlikely(TCP_SKB_CB(skb)->eor) || tcp_has_tx_tstamp(skb) || - !skb_pure_zcopy_same(skb, next)) + !skb_pure_zcopy_same(skb, next) || + skb->devmem != next->devmem) return false; len -= skb->len; @@ -3087,6 +3088,8 @@ static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb) return false; if (skb_cloned(skb)) return false; + if (skb_frags_not_readable(skb)) + return false; /* Some heuristics for collapsing over SACK'd could be invented */ if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) return false; diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index a2dbeb264f26..9b31f688163c 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2152,7 +2152,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev, } } - snaplen = skb->len; + snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len; res = run_filter(skb, sk, snaplen); if (!res) @@ -2275,7 +2275,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, } } - snaplen = skb->len; + snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len; res = run_filter(skb, sk, snaplen); if (!res) From patchwork Mon Jul 10 22:32:57 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 701394 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD55FC001B0 for ; Mon, 10 Jul 2023 22:34:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230478AbjGJWel (ORCPT ); Mon, 10 Jul 2023 18:34:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59534 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230291AbjGJWee (ORCPT ); Mon, 10 Jul 2023 18:34:34 -0400 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ED2AFE51 for ; Mon, 10 Jul 2023 15:34:07 -0700 (PDT) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-565a33c35b1so59603177b3.0 for ; Mon, 10 Jul 2023 15:34:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028447; x=1691620447; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=eMznknX6LhCZzxkpGK1anYqC2AReGQzyhNKGt0dhRMs=; b=BtdvUbd29zwhezN9vWnqcJKpf0bV2LKN/+c8VWC9oiyT99FFcNEJDOvUA5o+m21DRJ e9aNejKksk1rdwKl8O1MZ9yP/O1OkMbpULehfNB94qYTmzji2HykYi+osFgOuhdAF+qD KGUZyMgcGBl8J91gT7Vb/XlYX07mdXJEBkEmKhCq++SE1RIb3ka/J0r4X0WZmUp1LyEo SjYNX6E3UyUwTUHmBfJ7fNPviqSEF3jRznivgjuuqoMmWVaZBvnRoFgKdOZiBTRle0wA 2DGPZWSVDlrn2z8oZhGDAjAlxNPaIc+/JHcTW0CD86xRjmhJfySPnG/Os836KYedkH0U WpvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028447; x=1691620447; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=eMznknX6LhCZzxkpGK1anYqC2AReGQzyhNKGt0dhRMs=; b=F6KS2iSj/fR2Lk6qMT1w0wSLMAANqrRLXR/KEpANLzRbK43VgBj0wpyFgQuVTRxHbb r8K01K08HLZ5Vc+bFJaHVInQNyKwZHX18Xw4pci/hErM5ii3oyCXO5x1nQUcYSUylfSR cORrxsMpfM9DYqb/NoehG1uvQy6tHK1w4bzk1oQ2fCBrss7wGLGR/t8ynrl1TnpX8/UA g2pIpledrO+zFiGU9VSf1ujXOos4Bwz+emONMub+q7oQkt6XjeA4KVaTv0x01MejbAzU pGRf5OV09lJrWIDQURwH+7drgpk25wSd+4Vv0UWHK6uTaWfYWfgwWyNVPu/zmBD8dWSg Xk6g== X-Gm-Message-State: ABy/qLaAYrIHxJN/AwKxCVkGBvAdyp4/TU5Q6xNVdxxw/VRKhS5lW/PB Lwnut3MC6Nujso6NZ4NJH0FT7/CzJdEofzX5/A== X-Google-Smtp-Source: APBJJlHZxtTyAip5d5fJgZLpVb+rUoGbj/Pap2HolJ5oEOujITpsR+hl7i9eHYLZIaU+exV6ykL+cfJhHKzieUdutg== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a81:e90b:0:b0:573:285a:c2a3 with SMTP id d11-20020a81e90b000000b00573285ac2a3mr95478ywm.1.1689028446916; Mon, 10 Jul 2023 15:34:06 -0700 (PDT) Date: Mon, 10 Jul 2023 15:32:57 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-7-almasrymina@google.com> Subject: [RFC PATCH 06/10] net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Mina Almasry , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Shuah Khan , jgg@ziepe.ca Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org Add an interface for the user to notify the kernel that it is done reading the NET_RX dmabuf pages returned as cmsg. The kernel will drop the reference on the NET_RX pages to make them available for re-use. Signed-off-by: Mina Almasry --- include/uapi/asm-generic/socket.h | 1 + include/uapi/linux/uio.h | 4 +++ net/core/sock.c | 41 +++++++++++++++++++++++++++++++ 3 files changed, 46 insertions(+) diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index 88f9234f78cb..2a5a7f5da358 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -132,6 +132,7 @@ #define SO_RCVMARK 75 +#define SO_DEVMEM_DONTNEED 97 #define SO_DEVMEM_HEADER 98 #define SCM_DEVMEM_HEADER SO_DEVMEM_HEADER #define SO_DEVMEM_OFFSET 99 diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 8b0be0f50838..faaa765fd5a4 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -26,6 +26,10 @@ struct cmsg_devmem { __u32 frag_token; }; +struct devmemtoken { + __u32 token_start; + __u32 token_count; +}; /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/core/sock.c b/net/core/sock.c index 24f2761bdb1d..f9b9d9ec7322 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1531,7 +1531,48 @@ int sk_setsockopt(struct sock *sk, int level, int optname, /* Paired with READ_ONCE() in tcp_rtx_synack() */ WRITE_ONCE(sk->sk_txrehash, (u8)val); break; + case SO_DEVMEM_DONTNEED: { + struct devmemtoken tokens[128]; + unsigned int num_tokens, i, j; + if (sk->sk_type != SOCK_STREAM || + sk->sk_protocol != IPPROTO_TCP) { + ret = -EBADF; + break; + } + + if (optlen % sizeof(struct devmemtoken) || + optlen > sizeof(tokens)) { + ret = -EINVAL; + break; + } + + num_tokens = optlen / sizeof(struct devmemtoken); + if (copy_from_sockptr(tokens, optval, optlen)) { + ret = -EFAULT; + break; + } + + ret = 0; + + for (i = 0; i < num_tokens; i++) { + for (j = 0; j < tokens[i].token_count; j++) { + struct page *pg = xa_erase(&sk->sk_pagepool, + tokens[i].token_start + j); + + if (pg) + put_page(pg); + else + /* -EINTR here notifies the userspace + * that not all tokens passed to it have + * been freed. + */ + ret = -EINTR; + } + } + + break; + } default: ret = -ENOPROTOOPT; break; From patchwork Mon Jul 10 22:33:00 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 701393 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57AE1C001B0 for ; Mon, 10 Jul 2023 22:35:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231185AbjGJWfd (ORCPT ); Mon, 10 Jul 2023 18:35:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60306 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231124AbjGJWfa (ORCPT ); Mon, 10 Jul 2023 18:35:30 -0400 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AB1821704 for ; Mon, 10 Jul 2023 15:34:53 -0700 (PDT) Received: by mail-yb1-xb49.google.com with SMTP id 3f1490d57ef6-c5d16402b4eso5915270276.3 for ; Mon, 10 Jul 2023 15:34:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028459; x=1691620459; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=UnMZL8UhzWb/0aBP8bCpuACSexb4BfJiKzPPZTajpks=; b=ZycEIM/tzN0od6DMdpG2ZBlkJqdG/YaEAiXfT+38tW9bimX5mFjLfnmha8sMd1FwJJ P1YTcXkuScRdFZ+6Nke3Sdy4NjMA2W+JEp3F6Wzh3QRgc3G5LGn1Zu1iHZ4RyIqBjoHq k9jtQN4DN+tPVpc1G/NOC+sNs6OMyg6jWMhgxeuaJcoDHd6HbM1Xh1fktrKhCJh7t8TZ 2mOqQGiKrSnTwbDIJsz/RpKI8MeOMnJHXoAtfD0XSqgWhXjbLkfzJLjhD09BOkgZES0H docmIVXKHe4UX6CgXt/bh6mysKACg9wyOus054kiN2CxJmmkeSAQzjRyHNih7/Suny73 Z7kA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028459; x=1691620459; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=UnMZL8UhzWb/0aBP8bCpuACSexb4BfJiKzPPZTajpks=; b=fHGNfXfjMKLBc9zOdEXWZ0HeO8K8A1+VmClO2gqSFHFHWGxwoM7jJm00jH1gXFKpdL Y6JFdUWKu7hKeZDBZyfi5Jv5qULjhPmVSTJ0PV71cDXijrWmiAIQaC/txAxbF1fA/f4e VXBbidVlBIC2tk6Jn70zhYnKHiC4ZPgrfU+D8C2SEld2rgHFcvCYJYH8Ay9/kF836L1N o27/+x4XiH5xq2POoBvz+L5mK107QF5dQqiVp2RmiZYOWF6tKE9X40qXwIJEMjI/R+Ju wSlrH9Zc/li45815edF7GOjIotjCloCrGdAN4KNj0k1KPJ3myUszUqlSG36r9Fx6couU Ua9A== X-Gm-Message-State: ABy/qLZpDkh+amH/K7AgqNGXmGzU/xsoRXTacYwNjHl9T+Y9tMuDWfar geas651KKT3ZCvDsnmEQDD11utNGRmXiU/omWQ== X-Google-Smtp-Source: APBJJlEXtctaXcsu/1jDVZnB9gs4bZSM839Y7nw3IybEX7aGlwuQfZwRkT28XLvKo+wcjjO6pG3pnxDTATxBbfxAfA== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a25:11c6:0:b0:c85:934:7ad2 with SMTP id 189-20020a2511c6000000b00c8509347ad2mr21545ybr.8.1689028458996; Mon, 10 Jul 2023 15:34:18 -0700 (PDT) Date: Mon, 10 Jul 2023 15:33:00 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-10-almasrymina@google.com> Subject: [RFC PATCH 09/10] memory-provider: updates core provider API for devmem TCP From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Mina Almasry , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Shuah Khan , jgg@ziepe.ca Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org Implement a few updates to Jakub's RFC memory provider API to make it suitable for device memory TCP: 1. Currently for devmem TCP the driver's netdev_rx_queue holds a reference to the dma_buf_pages struct and needs to pass that to the page_pool's memory provider somehow. For PoC purposes, create a pp->mp_priv field that is set by the driver. Likely needs a better API (likely dependant on the general memory provider API). 2. The current memory_provider API gives the memory_provider the option to override put_page(), but tries page_pool_clear_pp_info() after the memory provider has released the page. IMO if the page freeing is delegated to the provider then the page_pool should not modify the page after release_page() has been called. Signed-off-by: Mina Almasry --- include/net/page_pool.h | 1 + net/core/page_pool.c | 7 ++++--- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 364fe6924258..7b6668479baf 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -78,6 +78,7 @@ struct page_pool_params { struct device *dev; /* device, for DMA pre-mapping purposes */ struct napi_struct *napi; /* Sole consumer of pages, otherwise NULL */ u8 memory_provider; /* haaacks! should be user-facing */ + void *mp_priv; /* argument to pass to the memory provider */ enum dma_data_direction dma_dir; /* DMA mapping direction */ unsigned int max_len; /* max DMA sync memory size */ unsigned int offset; /* DMA addr offset */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index d50f6728e4f6..df3f431fcff3 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -241,6 +241,7 @@ static int page_pool_init(struct page_pool *pool, goto free_ptr_ring; } + pool->mp_priv = pool->p.mp_priv; if (pool->mp_ops) { err = pool->mp_ops->init(pool); if (err) { @@ -564,16 +565,16 @@ void page_pool_return_page(struct page_pool *pool, struct page *page) else __page_pool_release_page_dma(pool, page); - page_pool_clear_pp_info(page); - /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. */ count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); trace_page_pool_state_release(pool, page, count); - if (put) + if (put) { + page_pool_clear_pp_info(page); put_page(page); + } /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a * __page_cache_release() call). From patchwork Mon Jul 10 22:33:01 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 701392 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 088B1C001B0 for ; Mon, 10 Jul 2023 22:35:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231210AbjGJWfg (ORCPT ); Mon, 10 Jul 2023 18:35:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60184 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231163AbjGJWfd (ORCPT ); Mon, 10 Jul 2023 18:35:33 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7A807170D for ; Mon, 10 Jul 2023 15:34:55 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-c78d98b0213so2767678276.2 for ; Mon, 10 Jul 2023 15:34:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689028463; x=1691620463; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=oQJYRa4cYbFlxJ1mDmha+0HxPANtp6LWuy6ed2Yjbu4=; b=CetzuFarHJgPUiGBwDAWLpdzt0w3V9hQbqMPhuZfIenfXirykHD2HLGYdWQSkINvJh xJG2uW65yyAAb+o06RvuUp3WpLKZAzNYnV7EddNjt5dfXFCFq8AZmqCWdcr58enYZIQE n5JK8gg0yI5kVZJS/Tnc0gRJy7edUPwC0tcHLkojiMtzkNIt/J3L+ITk1kszc47h7Vz7 7HfhvqbXuGyD3D189Y0m4jp9NOnOMRg/24QxNfglicfLYXQlHIKScA6XdF716DYeVEDF C2/NnZaF4OaiJaWHPvY80aoJTwZFVHVTEJFSi6QbKnAMO9h++YLyvxmo53PBx0z1bgkV UbJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689028463; x=1691620463; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=oQJYRa4cYbFlxJ1mDmha+0HxPANtp6LWuy6ed2Yjbu4=; b=iuNDsWQVMT6e2y0wWWx46Imhn+QYGTE4B7Dnc2YPsDx32joqupqIjuqtp1aGAqs091 MQKy41fNg+GWXX8Dm3ZA1gnTi472mWx0OiLbJd/wyiiJbKi+4EQDQ6uGhgdmFpTvNX4q TrLuQLbzNl9YqsYOg1hn3GDYFdOq/eXIzGfB9uc96YbTsDyMfB/z2sRt6312awOz1tTQ yGfX8uz9ySMsZTYi04MthTJh4GVM4huJDcR1nXbBdldhocgHTCT0FKEnD8l/QFWvUnH8 v5hbNzXkkxwE3YZqv5ZeylpiP49x6b0jMRSg663TVTiYm/NykZhLTN5cbp0YXq22iVNv dvHg== X-Gm-Message-State: ABy/qLbLQDk5Y4BxdEF7iM7X+5GtpeuyIUMfSMQQkqgg8eFcw44qPPox 6oy/ArXxt/thmx44aBCn+CLoH/N8/uC2hHfj6A== X-Google-Smtp-Source: APBJJlHJiw9FD0eEFF7lSrjMDR+hUHliNX9UFzA9i3ufkboc97+XMgqpqjgotP7HgV453EARzYkdxteZjWVekbKkug== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:4c0f:bfb6:9942:8c53]) (user=almasrymina job=sendgmr) by 2002:a25:4252:0:b0:c6f:6ffe:f904 with SMTP id p79-20020a254252000000b00c6f6ffef904mr50579yba.9.1689028463039; Mon, 10 Jul 2023 15:34:23 -0700 (PDT) Date: Mon, 10 Jul 2023 15:33:01 -0700 In-Reply-To: <20230710223304.1174642-1-almasrymina@google.com> Mime-Version: 1.0 References: <20230710223304.1174642-1-almasrymina@google.com> X-Mailer: git-send-email 2.41.0.390.g38632f3daf-goog Message-ID: <20230710223304.1174642-11-almasrymina@google.com> Subject: [RFC PATCH 10/10] memory-provider: add dmabuf devmem provider From: Mina Almasry To: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, netdev@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Mina Almasry , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jesper Dangaard Brouer , Ilias Apalodimas , Arnd Bergmann , David Ahern , Willem de Bruijn , Shuah Khan , jgg@ziepe.ca Precedence: bulk List-ID: X-Mailing-List: linux-media@vger.kernel.org Use Jakub's memory provider PoC API: https://github.com/kuba-moo/linux/tree/pp-providers To implement a dmabuf devmem memory provider. The provider allocates NET_RX dmabuf pages to the page pool. This abstracts any custom memory allocation or freeing changes for devmem TCP from drivers using the page pool. The memory provider allocates NET_RX pages from the dmabuf pages provided by the driver. These pages are ZONE_DEVICE pages with the sg dma_addrs stored in the zone_device_data entry in the page. The page pool entries in struct page are in a union with the ZONE_DEVICE entries, and - without special handling - the page pool would accidentally overwrite the data in the ZONE_DEVICE fields. To solve this, the memory provider converts the page from a ZONE_DEVICE page to a ZONE_NORMAL page upon giving it to the page pool, and converts it back to ZONE_DEVICE page upon getting it back from the page pool. This is safe to do because the NET_RX pages are dmabuf pages created to hold the dma_addr in the dma_buf_map_attachement sg_table entries, and are only used with code that handles them specifically. However, since dmabuf pages can now also be page pool page, we need to update 2 places to detect this correctly: 1. is_dma_buf_page() needs to be updated to correctly detect dmabuf pages after they've been inserted into the pool. 2. dma_buf_page_to_dma_addr() needs to be updated. For page pool pages, the dma_addr exists in page->dma_addr. For non page pool pages, the dma_addr exists in page->zone_device_data. Signed-off-by: Mina Almasry --- include/linux/dma-buf.h | 29 ++++++++++- include/net/page_pool.h | 20 ++++++++ net/core/page_pool.c | 104 ++++++++++++++++++++++++++++++++++++---- 3 files changed, 143 insertions(+), 10 deletions(-) diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 93228a2fec47..896359fa998d 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -692,15 +692,26 @@ static inline bool is_dma_buf_pages_file(struct file *file) struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv); +static inline bool is_dma_buf_page_net_rx(struct page *page) +{ + struct dma_buf_pages *priv; + + return (is_page_pool_page(page) && (priv = page->pp->mp_priv) && + priv->pgmap.ops == &dma_buf_pgmap_ops); +} + static inline bool is_dma_buf_page(struct page *page) { return (is_zone_device_page(page) && page->pgmap && - page->pgmap->ops == &dma_buf_pgmap_ops); + page->pgmap->ops == &dma_buf_pgmap_ops) || + is_dma_buf_page_net_rx(page); } static inline dma_addr_t dma_buf_page_to_dma_addr(struct page *page) { - return (dma_addr_t)page->zone_device_data; + return is_dma_buf_page_net_rx(page) ? + (dma_addr_t)page->dma_addr : + (dma_addr_t)page->zone_device_data; } static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg, @@ -718,6 +729,16 @@ static inline int dma_buf_map_sg(struct device *dev, struct scatterlist *sg, return nents; } + +static inline bool is_dma_buf_pages_priv(void *ptr) +{ + struct dma_buf_pages *priv = (struct dma_buf_pages *)ptr; + + if (!priv || priv->pgmap.ops != &dma_buf_pgmap_ops) + return false; + + return true; +} #else static inline bool is_dma_buf_page(struct page *page) { @@ -745,6 +766,10 @@ static inline struct page *dma_buf_pages_net_rx_alloc(struct dma_buf_pages *priv return NULL; } +static inline bool is_dma_buf_pages_priv(void *ptr) +{ + return false; +} #endif diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 7b6668479baf..a57757a13cc8 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -157,6 +157,7 @@ enum pp_memory_provider_type { PP_MP_HUGE_SPLIT, /* 2MB, online page alloc */ PP_MP_HUGE, /* 2MB, all memory pre-allocated */ PP_MP_HUGE_1G, /* 1G pages, MEP, pre-allocated */ + PP_MP_DMABUF_DEVMEM, /* dmabuf devmem provider */ }; struct pp_memory_provider_ops { @@ -170,6 +171,7 @@ extern const struct pp_memory_provider_ops basic_ops; extern const struct pp_memory_provider_ops hugesp_ops; extern const struct pp_memory_provider_ops huge_ops; extern const struct pp_memory_provider_ops huge_1g_ops; +extern const struct pp_memory_provider_ops dmabuf_devmem_ops; struct page_pool { struct page_pool_params p; @@ -420,4 +422,22 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) page_pool_update_nid(pool, new_nid); } +static inline bool is_page_pool_page(struct page *page) +{ + /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation + * in order to preserve any existing bits, such as bit 0 for the + * head page of compound page and bit 1 for pfmemalloc page, so + * mask those bits for freeing side when doing below checking, + * and page_is_pfmemalloc() is checked in __page_pool_put_page() + * to avoid recycling the pfmemalloc page. + */ + if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) + return false; + + if (!page->pp) + return false; + + return true; +} + #endif /* _NET_PAGE_POOL_H */ diff --git a/net/core/page_pool.c b/net/core/page_pool.c index df3f431fcff3..e626d4e309c1 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -236,6 +236,9 @@ static int page_pool_init(struct page_pool *pool, case PP_MP_HUGE_1G: pool->mp_ops = &huge_1g_ops; break; + case PP_MP_DMABUF_DEVMEM: + pool->mp_ops = &dmabuf_devmem_ops; + break; default: err = -EINVAL; goto free_ptr_ring; @@ -975,14 +978,7 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe) page = compound_head(page); - /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation - * in order to preserve any existing bits, such as bit 0 for the - * head page of compound page and bit 1 for pfmemalloc page, so - * mask those bits for freeing side when doing below checking, - * and page_is_pfmemalloc() is checked in __page_pool_put_page() - * to avoid recycling the pfmemalloc page. - */ - if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE)) + if (!is_page_pool_page(page)) return false; pp = page->pp; @@ -1538,3 +1534,95 @@ const struct pp_memory_provider_ops huge_1g_ops = { .alloc_pages = mp_huge_1g_alloc_pages, .release_page = mp_huge_1g_release, }; + +/*** "Dmabuf devmem page" ***/ + +/* Dmabuf devmem memory provider allocates DMA_BUF_PAGES_NET_RX pages which are + * backing the dma_buf_map_attachment() from the NIC to the device memory. + * + * These pages are wrappers around the dma_addr of the sg entries in the + * sg_table returned from dma_buf_map_attachment(). They can be passed to the + * networking stack, which will generate devmem skbs from them and process them + * correctly. + */ +static int mp_dmabuf_devmem_init(struct page_pool *pool) +{ + struct dma_buf_pages *priv; + + priv = pool->mp_priv; + if (!is_dma_buf_pages_priv(priv)) + return -EINVAL; + + return 0; +} + +static void mp_dmabuf_devmem_destroy(struct page_pool *pool) +{ +} + +static struct page *mp_dmabuf_devmem_alloc_pages(struct page_pool *pool, + gfp_t gfp) +{ + struct dma_buf_pages *priv = pool->mp_priv; + dma_addr_t dma_addr; + struct page *page; + + page = dma_buf_pages_net_rx_alloc(priv); + if (!page) + return page; + + /* It shouldn't be possible for the allocation to give us a page not + * belonging to this page_pool's pgmap. + */ + BUG_ON(page->pgmap != &priv->pgmap); + + /* netdev_rxq_alloc_dma_buf_page() allocates a ZONE_DEVICE page. + * Prepare to convert it into a page_pool page. We need to hold pgmap + * and zone_device_data (which holds the dma_addr). + * + * DMA_BUF_PAGES_NET_RX are dmabuf pages created specifically to wrap + * the dma_addr of the sg_table into a struct page. These pages are + * used by code specifically equipped to handle them, so this + * conversation from ZONE_DEVICE page to page pool page should be safe. + */ + dma_addr = (dma_addr_t)page->zone_device_data; + + set_page_zone(page, ZONE_NORMAL); + page->pp_magic = 0; + page_pool_set_pp_info(pool, page); + + page->dma_addr = dma_addr; + + return page; +} + +static bool mp_dmabuf_devmem_release_page(struct page_pool *pool, + struct page *page) +{ + struct dma_buf_pages *priv = pool->mp_priv; + unsigned long dma_addr = page->dma_addr; + + page_pool_clear_pp_info(page); + + /* As the page pool releases the page, restore it back to a ZONE_DEVICE + * page so it gets freed according to the + * page->pgmap->ops->page_free(). + */ + set_page_zone(page, ZONE_DEVICE); + page->zone_device_data = (void*)dma_addr; + page->pgmap = &priv->pgmap; + put_page(page); + + /* Return false here as we don't want the page pool touching the page + * after it's released to us. + */ + return false; +} + +const struct pp_memory_provider_ops dmabuf_devmem_ops = { + .init = mp_dmabuf_devmem_init, + .destroy = mp_dmabuf_devmem_destroy, + .alloc_pages = mp_dmabuf_devmem_alloc_pages, + .release_page = mp_dmabuf_devmem_release_page, +}; +EXPORT_SYMBOL(dmabuf_devmem_ops);