[net-next,v3,0/5] page_pool: recycle buffers

Message ID	20210409223801.104657-1-mcroce@linux.microsoft.com
Headers	show Return-Path: <netdev-owner@kernel.org> From: Matteo Croce <mcroce@linux.microsoft.com> To: netdev@vger.kernel.org, linux-mm@kvack.org Cc: Ayush Sawal <ayush.sawal@chelsio.com>, Vinay Kumar Yadav <vinay.yadav@chelsio.com>, Rohit Maheshwari <rohitm@chelsio.com>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Thomas Petazzoni <thomas.petazzoni@bootlin.com>, Marcin Wojtas <mw@semihalf.com>, Russell King <linux@armlinux.org.uk>, Mirko Lindner <mlindner@marvell.com>, Stephen Hemminger <stephen@networkplumber.org>, Tariq Toukan <tariqt@nvidia.com>, Jesper Dangaard Brouer <hawk@kernel.org>, Ilias Apalodimas <ilias.apalodimas@linaro.org>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, John Fastabend <john.fastabend@gmail.com>, Boris Pismenny <borisp@nvidia.com>, Arnd Bergmann <arnd@arndb.de>, Andrew Morton <akpm@linux-foundation.org>, "Peter Zijlstra (Intel)" <peterz@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, Yu Zhao <yuzhao@google.com>, Will Deacon <will@kernel.org>, Michel Lespinasse <walken@google.com>, Fenghua Yu <fenghua.yu@intel.com>, Roman Gushchin <guro@fb.com>, Hugh Dickins <hughd@google.com>, Peter Xu <peterx@redhat.com>, Jason Gunthorpe <jgg@ziepe.ca>, Guoqing Jiang <guoqing.jiang@cloud.ionos.com>, Jonathan Lemon <jonathan.lemon@gmail.com>, Alexander Lobakin <alobakin@pm.me>, Cong Wang <cong.wang@bytedance.com>, wenxu <wenxu@ucloud.cn>, Kevin Hao <haokexin@gmail.com>, Aleksandr Nogikh <nogikh@google.com>, Jakub Sitnicki <jakub@cloudflare.com>, Marco Elver <elver@google.com>, Willem de Bruijn <willemb@google.com>, Miaohe Lin <linmiaohe@huawei.com>, Yunsheng Lin <linyunsheng@huawei.com>, Guillaume Nault <gnault@redhat.com>, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, bpf@vger.kernel.org, Matthew Wilcox <willy@infradead.org>, Eric Dumazet <edumazet@google.com>, David Ahern <dsahern@gmail.com>, Lorenzo Bianconi <lorenzo@kernel.org>, Saeed Mahameed <saeedm@nvidia.com>, Andrew Lunn <andrew@lunn.ch>, Paolo Abeni <pabeni@redhat.com> Subject: [PATCH net-next v3 0/5] page_pool: recycle buffers Date: Sat, 10 Apr 2021 00:37:56 +0200 Message-Id: <20210409223801.104657-1-mcroce@linux.microsoft.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	page_pool: recycle buffers \| expand [net-next,v3,0/5] page_pool: recycle buffers [net-next,v3,2/5] mm: add a signature in struct page [net-next,v3,3/5] page_pool: Allow drivers to hint on SKB recycling [net-next,v3,4/5] mvpp2: recycle buffers

Matteo Croce April 9, 2021, 10:37 p.m. UTC

From: Matteo Croce <mcroce@microsoft.com>

This is a respin of [1]

This  patchset shows the plans for allowing page_pool to handle and
maintain DMA map/unmap of the pages it serves to the driver.  For this
to work a return hook in the network core is introduced.

The overall purpose is to simplify drivers, by providing a page
allocation API that does recycling, such that each driver doesn't have
to reinvent its own recycling scheme.  Using page_pool in a driver
does not require implementing XDP support, but it makes it trivially
easy to do so.  Instead of allocating buffers specifically for SKBs
we now allocate a generic buffer and either wrap it on an SKB
(via build_skb) or create an XDP frame.
The recycling code leverages the XDP recycle APIs.

The Marvell mvpp2 and mvneta drivers are used in this patchset to
demonstrate how to use the API, and tested on a MacchiatoBIN
and EspressoBIN boards respectively.

Please let this going in on a future -rc1 so to allow enough time
to have wider tests.

[1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/

v2 -> v3:
- added missing SOBs
- CCed the MM people

v1 -> v2:
- fix a commit message
- avoid setting pp_recycle multiple times on mvneta
- squash two patches to avoid breaking bisect

Ilias Apalodimas (1):
  page_pool: Allow drivers to hint on SKB recycling

Jesper Dangaard Brouer (1):
  xdp: reduce size of struct xdp_mem_info

Matteo Croce (3):
  mm: add a signature in struct page
  mvpp2: recycle buffers
  mvneta: recycle buffers

 .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c |  2 +-
 drivers/net/ethernet/marvell/mvneta.c         |  7 ++-
 .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 17 +++----
 drivers/net/ethernet/marvell/sky2.c           |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    |  2 +-
 include/linux/mm_types.h                      |  1 +
 include/linux/skbuff.h                        | 35 ++++++++++++--
 include/net/page_pool.h                       | 15 ++++++
 include/net/xdp.h                             |  5 +-
 net/core/page_pool.c                          | 47 +++++++++++++++++++
 net/core/skbuff.c                             | 20 +++++++-
 net/core/xdp.c                                | 14 ++++--
 net/tls/tls_device.c                          |  2 +-
 13 files changed, 142 insertions(+), 27 deletions(-)

Yunsheng Lin April 29, 2021, 8:27 a.m. UTC | #1

On 2021/4/10 6:37, Matteo Croce wrote:
> From: Matteo Croce <mcroce@microsoft.com>

> 

> This is a respin of [1]

> 

> This  patchset shows the plans for allowing page_pool to handle and

> maintain DMA map/unmap of the pages it serves to the driver.  For this

> to work a return hook in the network core is introduced.

> 

> The overall purpose is to simplify drivers, by providing a page

> allocation API that does recycling, such that each driver doesn't have

> to reinvent its own recycling scheme.  Using page_pool in a driver

> does not require implementing XDP support, but it makes it trivially

> easy to do so.  Instead of allocating buffers specifically for SKBs

> we now allocate a generic buffer and either wrap it on an SKB

> (via build_skb) or create an XDP frame.

> The recycling code leverages the XDP recycle APIs.

> 

> The Marvell mvpp2 and mvneta drivers are used in this patchset to

> demonstrate how to use the API, and tested on a MacchiatoBIN

> and EspressoBIN boards respectively.

> 


Hi, Matteo
     I added the skb frag page recycling in hns3 based on this patchset,
and it has above 10%~20% performance improvement for one thread iperf
TCP flow(IOMMU is off, there may be more performance improvement if
considering the DMA map/unmap avoiding for IOMMU), thanks for the job.

    The skb frag page recycling support in hns3 driver is not so simple
as the mvpp2 and mvneta driver, because:

1. the hns3 driver do not have XDP support yet, so "struct xdp_rxq_info"
   is added to assist relation binding between the "struct page" and
   "struct page_pool".

2. the hns3 driver has already a page reusing based on page spliting and
   page reference count, but it may not work if the upper stack can not
   handle skb and release the corresponding page fast enough.

3. the hns3 driver support page reference count updating batching, see:
   aeda9bf87a45 ("net: hns3: batch the page reference count updates")

So it would be better if：

1. skb frag page recycling do not need "struct xdp_rxq_info" or
   "struct xdp_mem_info" to bond the relation between "struct page" and
   "struct page_pool", which seems uncessary at this point if bonding
   a "struct page_pool" pointer directly in "struct page" does not cause
   space increasing.

2. it would be good to do the page reference count updating batching
   in page pool instead of specific driver.


page_pool_atomic_sub_if_positive() is added to decide who can call
page_pool_put_full_page(), because the driver and stack may hold
reference to the same page, only if last one which hold complete
reference to a page can call page_pool_put_full_page() to decide if
recycling is possible, if not, the page is released, so I am wondering
if a similar page_pool_atomic_sub_if_positive() can added to specific
user space address unmapping path to allow skb recycling for RX zerocopy
too?

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index c21dd11..8b01a7d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2566,7 +2566,10 @@ static int hns3_alloc_buffer(struct hns3_enet_ring *ring,
        unsigned int order = hns3_page_order(ring);
        struct page *p;

-       p = dev_alloc_pages(order);
+       if (ring->page_pool)
+               p = page_pool_dev_alloc_pages(ring->page_pool);
+       else
+               p = dev_alloc_pages(order);
        if (!p)
                return -ENOMEM;

@@ -2582,13 +2585,32 @@ static int hns3_alloc_buffer(struct hns3_enet_ring *ring,
        return 0;
 }

+static void hns3_page_frag_cache_drain(struct hns3_enet_ring *ring,
+                                      struct hns3_desc_cb *cb)
+{
+       if (ring->page_pool) {
+               struct page *p = cb->priv;
+
+               if (page_pool_atomic_sub_if_positive(p, cb->pagecnt_bias))
+                       return;
+
+               if (cb->pagecnt_bias > 1)
+                       page_ref_sub(p, cb->pagecnt_bias - 1);
+
+               page_pool_put_full_page(ring->page_pool, p, false);
+               return;
+       }
+
+       __page_frag_cache_drain(cb->priv, cb->pagecnt_bias);
+}
+
 static void hns3_free_buffer(struct hns3_enet_ring *ring,
                             struct hns3_desc_cb *cb, int budget)
 {
        if (cb->type == DESC_TYPE_SKB)
                napi_consume_skb(cb->priv, budget);
        else if (!HNAE3_IS_TX_RING(ring) && cb->pagecnt_bias)
-               __page_frag_cache_drain(cb->priv, cb->pagecnt_bias);
+               hns3_page_frag_cache_drain(ring, cb);
        memset(cb, 0, sizeof(*cb));
 }

@@ -2892,13 +2914,15 @@ static void hns3_nic_reuse_page(struct sk_buff *skb, int i,
        skb_add_rx_frag(skb, i, desc_cb->priv, desc_cb->page_offset + pull_len,
                        size - pull_len, truesize);

+       skb_mark_for_recycle(skb, desc_cb->priv, &ring->rxq_info.mem);
+
        /* Avoid re-using remote and pfmemalloc pages, or the stack is still
         * using the page when page_offset rollback to zero, flag default
         * unreuse
         */
        if (!dev_page_is_reusable(desc_cb->priv) ||
            (!desc_cb->page_offset && !hns3_can_reuse_page(desc_cb))) {
-               __page_frag_cache_drain(desc_cb->priv, desc_cb->pagecnt_bias);
+               hns3_page_frag_cache_drain(ring, desc_cb);
                return;
        }

@@ -2911,7 +2935,7 @@ static void hns3_nic_reuse_page(struct sk_buff *skb, int i,
                desc_cb->reuse_flag = 1;
                desc_cb->page_offset = 0;
        } else if (desc_cb->pagecnt_bias) {
-               __page_frag_cache_drain(desc_cb->priv, desc_cb->pagecnt_bias);
+               hns3_page_frag_cache_drain(ring, desc_cb);
                return;
        }

@@ -3156,8 +3180,7 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, unsigned int length,
                if (dev_page_is_reusable(desc_cb->priv))
                        desc_cb->reuse_flag = 1;
                else /* This page cannot be reused so discard it */
-                       __page_frag_cache_drain(desc_cb->priv,
-                                               desc_cb->pagecnt_bias);
+                       hns3_page_frag_cache_drain(ring, desc_cb);

                hns3_rx_ring_move_fw(ring);
                return 0;
@@ -4028,6 +4051,33 @@ static int hns3_alloc_ring_memory(struct hns3_enet_ring *ring)
                goto out_with_desc_cb;

        if (!HNAE3_IS_TX_RING(ring)) {
+               struct page_pool_params pp_params = {
+               /* internal DMA mapping in page_pool */
+               .flags = 0,
+               .order = 0,
+               .pool_size = 1024,
+               .nid = dev_to_node(ring_to_dev(ring)),
+               .dev = ring_to_dev(ring),
+               .dma_dir = DMA_FROM_DEVICE,
+               .offset = 0,
+               .max_len = 0,
+               };
+
+               ring->page_pool = page_pool_create(&pp_params);
+               if (IS_ERR(ring->page_pool)) {
+                       dev_err(ring_to_dev(ring), "page pool creation failed\n");
+                       ring->page_pool = NULL;
+               }
+
+               ret = xdp_rxq_info_reg(&ring->rxq_info, ring_to_netdev(ring), ring->queue_index, 0);
+               if (ret)
+                       dev_err(ring_to_dev(ring), "xdp_rxq_info_reg failed\n");
+
+               ret = xdp_rxq_info_reg_mem_model(&ring->rxq_info, MEM_TYPE_PAGE_POOL,
+                                                ring->page_pool);
+               if (ret)
+                       dev_err(ring_to_dev(ring), "xdp_rxq_info_reg_mem_model failed\n");
+
                ret = hns3_alloc_ring_buffers(ring);
                if (ret)
                        goto out_with_desc;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index daa04ae..fd53fcc 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -6,6 +6,9 @@

 #include <linux/if_vlan.h>

+#include <net/page_pool.h>
+#include <net/xdp.h>
+
 #include "hnae3.h"

 enum hns3_nic_state {
@@ -408,6 +411,8 @@ struct hns3_enet_ring {
        struct hnae3_queue *tqp;
        int queue_index;
        struct device *dev; /* will be used for DMA mapping of descriptors */
+       struct page_pool *page_pool;
        struct hnae3_queue *tqp;
        int queue_index;
        struct device *dev; /* will be used for DMA mapping of descriptors */
+       struct page_pool *page_pool;
+       struct xdp_rxq_info rxq_info;

        /* statistic */
        struct ring_stats stats;
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 75fffc1..70c310e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -195,6 +195,8 @@ static inline void page_pool_put_full_page(struct page_pool *pool,
 #endif
 }

+bool page_pool_atomic_sub_if_positive(struct page *page, int i);
+
 /* Same as above but the caller must guarantee safe context. e.g NAPI */
 static inline void page_pool_recycle_direct(struct page_pool *pool,
                                            struct page *page)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 43bfd2e..8bc8b7e 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -596,6 +596,26 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
 }
 EXPORT_SYMBOL(page_pool_update_nid);

+bool page_pool_atomic_sub_if_positive(struct page *page, int i)
+{
+       atomic_t *v = &page->_refcount;
+       int dec, c;
+
+       do {
+               c = atomic_read(v);
+
+               dec = c - i;
+               if (unlikely(dec == 0))
+                       return false;
+               else if (unlikely(dec < 0)) {
+                       pr_err("c: %d, dec: %d, i: %d\n", c, dec, i);
+                       return false;
+               }
+       } while (!atomic_try_cmpxchg(v, &c, dec));
+
+       return true;
+}
+
 bool page_pool_return_skb_page(void *data)
 {
        struct xdp_mem_info mem_info;
@@ -606,6 +626,9 @@ bool page_pool_return_skb_page(void *data)
        if (unlikely(page->signature != PP_SIGNATURE))
                return false;

+       if (page_pool_atomic_sub_if_positive(page, 1))
+               return true;
+
        info.raw = page_private(page);
        mem_info = info.mem_info;

Ilias Apalodimas April 29, 2021, 6:51 p.m. UTC | #2

Hi Yunsheng,

On Thu, Apr 29, 2021 at 04:27:21PM +0800, Yunsheng Lin wrote:
> On 2021/4/10 6:37, Matteo Croce wrote:

> > From: Matteo Croce <mcroce@microsoft.com>

> > 

> > This is a respin of [1]

> > 

> > This  patchset shows the plans for allowing page_pool to handle and

> > maintain DMA map/unmap of the pages it serves to the driver.  For this

> > to work a return hook in the network core is introduced.

> > 

> > The overall purpose is to simplify drivers, by providing a page

> > allocation API that does recycling, such that each driver doesn't have

> > to reinvent its own recycling scheme.  Using page_pool in a driver

> > does not require implementing XDP support, but it makes it trivially

> > easy to do so.  Instead of allocating buffers specifically for SKBs

> > we now allocate a generic buffer and either wrap it on an SKB

> > (via build_skb) or create an XDP frame.

> > The recycling code leverages the XDP recycle APIs.

> > 

> > The Marvell mvpp2 and mvneta drivers are used in this patchset to

> > demonstrate how to use the API, and tested on a MacchiatoBIN

> > and EspressoBIN boards respectively.

> > 

> 

> Hi, Matteo

>      I added the skb frag page recycling in hns3 based on this patchset,

> and it has above 10%~20% performance improvement for one thread iperf

> TCP flow(IOMMU is off, there may be more performance improvement if

> considering the DMA map/unmap avoiding for IOMMU), thanks for the job.

> 

>     The skb frag page recycling support in hns3 driver is not so simple

> as the mvpp2 and mvneta driver, because:

> 

> 1. the hns3 driver do not have XDP support yet, so "struct xdp_rxq_info"

>    is added to assist relation binding between the "struct page" and

>    "struct page_pool".

> 

> 2. the hns3 driver has already a page reusing based on page spliting and

>    page reference count, but it may not work if the upper stack can not

>    handle skb and release the corresponding page fast enough.

> 

> 3. the hns3 driver support page reference count updating batching, see:

>    aeda9bf87a45 ("net: hns3: batch the page reference count updates")

> 

> So it would be better if：

> 

> 1. skb frag page recycling do not need "struct xdp_rxq_info" or

>    "struct xdp_mem_info" to bond the relation between "struct page" and

>    "struct page_pool", which seems uncessary at this point if bonding

>    a "struct page_pool" pointer directly in "struct page" does not cause

>    space increasing.


We can't do that. The reason we need those structs is that we rely on the
existing XDP code, which already recycles it's buffers, to enable
recycling.  Since we allocate a page per packet when using page_pool for a
driver , the same ideas apply to an SKB and XDP frame. We just recycle the
payload and we don't really care what's in that.  We could rename the functions
to something more generic in the future though ?

> 

> 2. it would be good to do the page reference count updating batching

>    in page pool instead of specific driver.

> 

> 

> page_pool_atomic_sub_if_positive() is added to decide who can call

> page_pool_put_full_page(), because the driver and stack may hold

> reference to the same page, only if last one which hold complete

> reference to a page can call page_pool_put_full_page() to decide if

> recycling is possible, if not, the page is released, so I am wondering

> if a similar page_pool_atomic_sub_if_positive() can added to specific

> user space address unmapping path to allow skb recycling for RX zerocopy

> too?

> 


I would prefer a different page pool type if we wanted to support the split
page model.  The changes as is are quite intrusive, since they change the 
entire skb return path.  So I would prefer introducing the changes one at a 
time. 

The fundamental difference between having the recycling in the driver vs
having it in a generic API is pretty straightforward.  When a driver holds
the extra page references he is free to decide what to reuse, when he is about
to refill his Rx descriptors.  So TCP zerocopy might work even if the
userspace applications hold the buffers for an X amount of time.
On this proposal though we *need* to decide what to do with the buffer when we
are about to free the skb.

[...]


Cheers
/Ilias

Yunsheng Lin April 30, 2021, 3:01 a.m. UTC | #3

On 2021/4/30 2:51, Ilias Apalodimas wrote:
> Hi Yunsheng,

> 

> On Thu, Apr 29, 2021 at 04:27:21PM +0800, Yunsheng Lin wrote:

>> On 2021/4/10 6:37, Matteo Croce wrote:

>>> From: Matteo Croce <mcroce@microsoft.com>

[...]

>>

>> 1. skb frag page recycling do not need "struct xdp_rxq_info" or

>>    "struct xdp_mem_info" to bond the relation between "struct page" and

>>    "struct page_pool", which seems uncessary at this point if bonding

>>    a "struct page_pool" pointer directly in "struct page" does not cause

>>    space increasing.

> 

> We can't do that. The reason we need those structs is that we rely on the

> existing XDP code, which already recycles it's buffers, to enable

> recycling.  Since we allocate a page per packet when using page_pool for a

> driver , the same ideas apply to an SKB and XDP frame. We just recycle the

I am not really familar with XDP here, but a packet from hw is either a
"struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,
a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at
the same time, right?

What does not really make sense to me is that the page has to be from page
pool when a skb's frag page can be recycled, right? If it is ture, the switch
case in __xdp_return() does not really make sense for skb recycling, why go
all the trouble of checking the mem->type and mem->id to find the page_pool
pointer when recyclable page for skb can only be from page pool?

> payload and we don't really care what's in that.  We could rename the functions

> to something more generic in the future though ?

> 

>>

>> 2. it would be good to do the page reference count updating batching

>>    in page pool instead of specific driver.

>>

>>

>> page_pool_atomic_sub_if_positive() is added to decide who can call

>> page_pool_put_full_page(), because the driver and stack may hold

>> reference to the same page, only if last one which hold complete

>> reference to a page can call page_pool_put_full_page() to decide if

>> recycling is possible, if not, the page is released, so I am wondering

>> if a similar page_pool_atomic_sub_if_positive() can added to specific

>> user space address unmapping path to allow skb recycling for RX zerocopy

>> too?

>>

> 

> I would prefer a different page pool type if we wanted to support the split

> page model.  The changes as is are quite intrusive, since they change the 

> entire skb return path.  So I would prefer introducing the changes one at a 

> time. 

I understand there may be fundamental semantic change when split page model
is supported by page pool, but the split page support change mainly affect the
skb recycling path and the driver that uses page pool(XDP too) if we are careful
enough, not the entire skb return path as my understanding.

Anyway, one changes at a time is always prefered if supporting split page is
proved to be non-trivel and intrusive.

> 

> The fundamental difference between having the recycling in the driver vs

> having it in a generic API is pretty straightforward.  When a driver holds

> the extra page references he is free to decide what to reuse, when he is about

> to refill his Rx descriptors.  So TCP zerocopy might work even if the

> userspace applications hold the buffers for an X amount of time.

> On this proposal though we *need* to decide what to do with the buffer when we

> are about to free the skb.

I am not sure I understand what you meant by "free the skb", does it mean
that kfree_skb() is called to free the skb.

As my understanding, if the skb completely own the page(which means page_count()
== 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise
page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive()
try to handle it atomically.

> 

> [...]

> 

> 

> Cheers

> /Ilias

> 

> .

>

Ilias Apalodimas April 30, 2021, 4:24 p.m. UTC | #4

[...]
> >>

> >> 1. skb frag page recycling do not need "struct xdp_rxq_info" or

> >>    "struct xdp_mem_info" to bond the relation between "struct page" and

> >>    "struct page_pool", which seems uncessary at this point if bonding

> >>    a "struct page_pool" pointer directly in "struct page" does not cause

> >>    space increasing.

> > 

> > We can't do that. The reason we need those structs is that we rely on the

> > existing XDP code, which already recycles it's buffers, to enable

> > recycling.  Since we allocate a page per packet when using page_pool for a

> > driver , the same ideas apply to an SKB and XDP frame. We just recycle the

> 

> I am not really familar with XDP here, but a packet from hw is either a

> "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,

> a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at

> the same time, right?

> 


Yes, but the payload is irrelevant in both cases and that's what we use
page_pool for.  You can't use this patchset unless your driver usues
build_skb().  So in both cases you just allocate memory for the payload and
decide what the wrap the buffer with (XDP or SKB) later.

> What does not really make sense to me is that the page has to be from page

> pool when a skb's frag page can be recycled, right? If it is ture, the switch

> case in __xdp_return() does not really make sense for skb recycling, why go

> all the trouble of checking the mem->type and mem->id to find the page_pool

> pointer when recyclable page for skb can only be from page pool?


In any case you need to find in which pool the buffer you try to recycle
belongs.  In order to make the whole idea generic and be able to recycle skb 
fragments instead of just the skb head you need to store some information on 
struct page.  That's the fundamental difference of this patchset compared to 
the RFC we sent a few years back [1] which was just storing information on the 
skb.  The way this is done on the current patchset is that we store the 
struct xdp_mem_info in page->private and then look it up on xdp_return().

Now that being said Matthew recently reworked struct page, so we could see if
we can store the page pool pointer directly instead of the struct
xdp_mem_info. That would allow us to call into page pool functions directly.
But we'll have to agree if that makes sense to go into struct page to begin
with and make sure the pointer is still valid when we take the recycling path.

> > payload and we don't really care what's in that.  We could rename the functions

> > to something more generic in the future though ?

> > 

> >>

> >> 2. it would be good to do the page reference count updating batching

> >>    in page pool instead of specific driver.

> >>

> >>

> >> page_pool_atomic_sub_if_positive() is added to decide who can call

> >> page_pool_put_full_page(), because the driver and stack may hold

> >> reference to the same page, only if last one which hold complete

> >> reference to a page can call page_pool_put_full_page() to decide if

> >> recycling is possible, if not, the page is released, so I am wondering

> >> if a similar page_pool_atomic_sub_if_positive() can added to specific

> >> user space address unmapping path to allow skb recycling for RX zerocopy

> >> too?

> >>

> > 

> > I would prefer a different page pool type if we wanted to support the split

> > page model.  The changes as is are quite intrusive, since they change the 

> > entire skb return path.  So I would prefer introducing the changes one at a 

> > time. 

> 

> I understand there may be fundamental semantic change when split page model

> is supported by page pool, but the split page support change mainly affect the

> skb recycling path and the driver that uses page pool(XDP too) if we are careful

> enough, not the entire skb return path as my understanding.


It affects those drivers only, but in order to do so is intercepts the
packet in skb_free_head(), which pretty much affects the entire network path.

> 

> Anyway, one changes at a time is always prefered if supporting split page is

> proved to be non-trivel and intrusive.

> 

> > 

> > The fundamental difference between having the recycling in the driver vs

> > having it in a generic API is pretty straightforward.  When a driver holds

> > the extra page references he is free to decide what to reuse, when he is about

> > to refill his Rx descriptors.  So TCP zerocopy might work even if the

> > userspace applications hold the buffers for an X amount of time.

> > On this proposal though we *need* to decide what to do with the buffer when we

> > are about to free the skb.

> 

> I am not sure I understand what you meant by "free the skb", does it mean

> that kfree_skb() is called to free the skb.


Yes

> 

> As my understanding, if the skb completely own the page(which means page_count()

> == 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise

> page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive()

> try to handle it atomically.

> 


Not really, the opposite is happening here. If the pp_recycle bit is set we
will always call page_pool_return_skb_page().  If the page signature matches
the 'magic' set by page pool we will always call xdp_return_skb_frame() will
end up calling __page_pool_put_page(). If the refcnt is 1 we'll try
to recycle the page.  If it's not we'll release it from page_pool (releasing
some internal references we keep) unmap the buffer and decrement the refcnt.

[1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/

Cheers
/Ilias

Ilias Apalodimas April 30, 2021, 5:32 p.m. UTC | #5

(-cc invalid emails)
Replying to my self here but....

[...]
> > >

> > > We can't do that. The reason we need those structs is that we rely on the

> > > existing XDP code, which already recycles it's buffers, to enable

> > > recycling.  Since we allocate a page per packet when using page_pool for a

> > > driver , the same ideas apply to an SKB and XDP frame. We just recycle the

> >

> > I am not really familar with XDP here, but a packet from hw is either a

> > "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,

> > a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at

> > the same time, right?

> >

>

> Yes, but the payload is irrelevant in both cases and that's what we use

> page_pool for.  You can't use this patchset unless your driver usues

> build_skb().  So in both cases you just allocate memory for the payload and

> decide what the wrap the buffer with (XDP or SKB) later.

>

> > What does not really make sense to me is that the page has to be from page

> > pool when a skb's frag page can be recycled, right? If it is ture, the switch

> > case in __xdp_return() does not really make sense for skb recycling, why go

> > all the trouble of checking the mem->type and mem->id to find the page_pool

> > pointer when recyclable page for skb can only be from page pool?

>

> In any case you need to find in which pool the buffer you try to recycle

> belongs.  In order to make the whole idea generic and be able to recycle skb

> fragments instead of just the skb head you need to store some information on

> struct page.  That's the fundamental difference of this patchset compared to

> the RFC we sent a few years back [1] which was just storing information on the

> skb.  The way this is done on the current patchset is that we store the

> struct xdp_mem_info in page->private and then look it up on xdp_return().

>

> Now that being said Matthew recently reworked struct page, so we could see if

> we can store the page pool pointer directly instead of the struct

> xdp_mem_info. That would allow us to call into page pool functions directly.

> But we'll have to agree if that makes sense to go into struct page to begin

> with and make sure the pointer is still valid when we take the recycling path.

>


Thinking more about it the reason that prevented us from storing a
page pool pointer directly is not there anymore. Jesper fixed that
already a while back. So we might as well store the page_pool ptr in
page->private and call into the functions directly.  I'll have a look
before v4.

[...]

Thanks
/Ilias

Jesper Dangaard Brouer May 3, 2021, 7:29 a.m. UTC | #6

On Fri, 30 Apr 2021 20:32:07 +0300
Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:

> (-cc invalid emails)

> Replying to my self here but....

> 

> [...]

> > > >

> > > > We can't do that. The reason we need those structs is that we rely on the

> > > > existing XDP code, which already recycles it's buffers, to enable

> > > > recycling.  Since we allocate a page per packet when using page_pool for a

> > > > driver , the same ideas apply to an SKB and XDP frame. We just recycle the  

> > >

> > > I am not really familar with XDP here, but a packet from hw is either a

> > > "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,

> > > a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at

> > > the same time, right?

> > >  

> >

> > Yes, but the payload is irrelevant in both cases and that's what we use

> > page_pool for.  You can't use this patchset unless your driver usues

> > build_skb().  So in both cases you just allocate memory for the payload and

> > decide what the wrap the buffer with (XDP or SKB) later.

> >  

> > > What does not really make sense to me is that the page has to be from page

> > > pool when a skb's frag page can be recycled, right? If it is ture, the switch

> > > case in __xdp_return() does not really make sense for skb recycling, why go

> > > all the trouble of checking the mem->type and mem->id to find the page_pool

> > > pointer when recyclable page for skb can only be from page pool?  

> >

> > In any case you need to find in which pool the buffer you try to recycle

> > belongs.  In order to make the whole idea generic and be able to recycle skb

> > fragments instead of just the skb head you need to store some information on

> > struct page.  That's the fundamental difference of this patchset compared to

> > the RFC we sent a few years back [1] which was just storing information on the

> > skb.  The way this is done on the current patchset is that we store the

> > struct xdp_mem_info in page->private and then look it up on xdp_return().

> >

> > Now that being said Matthew recently reworked struct page, so we could see if

> > we can store the page pool pointer directly instead of the struct

> > xdp_mem_info. That would allow us to call into page pool functions directly.

> > But we'll have to agree if that makes sense to go into struct page to begin

> > with and make sure the pointer is still valid when we take the recycling path.

> >  

> 

> Thinking more about it the reason that prevented us from storing a

> page pool pointer directly is not there anymore. Jesper fixed that

> already a while back. So we might as well store the page_pool ptr in

> page->private and call into the functions directly.  I'll have a look

> before v4.


I want to give credit to Jonathan Lemon whom came up with the idea of
storing the page_pool object that "owns" the page directly in struct
page.  I see this as an optimization that we can add later, so it
doesn't block this patchset.  As Ilias mention, it required some
work/changes[1]+[2] to guarantee that the page_pool object life-time
were longer than all the outstanding in-flight page-objects, but that
have been stable for some/many kernel releases now.  This is already
need/used for making sure the DMA-mappings can be safely released[1],
but I on-purpose enabled the same in-flight tracking for page_pool
users that doesn't use the DMA-mapping feature (making sure the code is
exercised).


[1] 99c07c43c4ea ("xdp: tracking page_pool resources and safe removal")
[2] c3f812cea0d7 ("page_pool: do not release pool until inflight == 0.")
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Yunsheng Lin May 6, 2021, 12:34 p.m. UTC | #7

On 2021/5/1 0:24, Ilias Apalodimas wrote:
> [...]

>>>>

>>>> 1. skb frag page recycling do not need "struct xdp_rxq_info" or

>>>>    "struct xdp_mem_info" to bond the relation between "struct page" and

>>>>    "struct page_pool", which seems uncessary at this point if bonding

>>>>    a "struct page_pool" pointer directly in "struct page" does not cause

>>>>    space increasing.

>>>

>>> We can't do that. The reason we need those structs is that we rely on the

>>> existing XDP code, which already recycles it's buffers, to enable

>>> recycling.  Since we allocate a page per packet when using page_pool for a

>>> driver , the same ideas apply to an SKB and XDP frame. We just recycle the

>>

>> I am not really familar with XDP here, but a packet from hw is either a

>> "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,

>> a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at

>> the same time, right?

>>

> 

> Yes, but the payload is irrelevant in both cases and that's what we use

> page_pool for.  You can't use this patchset unless your driver usues

> build_skb().  So in both cases you just allocate memory for the payload and

I am not sure I understood why build_skb() matters here. If the head data of
a skb is a page frag and is from page pool, then it's page->signature should be
PP_SIGNATURE, otherwise it's page->signature is zero, so a recyclable skb does
not require it's head data being from a page pool, right?

> decide what the wrap the buffer with (XDP or SKB) later.

[...]

>>

>> I am not sure I understand what you meant by "free the skb", does it mean

>> that kfree_skb() is called to free the skb.

> 

> Yes

> 

>>

>> As my understanding, if the skb completely own the page(which means page_count()

>> == 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise

>> page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive()

>> try to handle it atomically.

>>

> 

> Not really, the opposite is happening here. If the pp_recycle bit is set we

> will always call page_pool_return_skb_page().  If the page signature matches

> the 'magic' set by page pool we will always call xdp_return_skb_frame() will

> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try

> to recycle the page.  If it's not we'll release it from page_pool (releasing

> some internal references we keep) unmap the buffer and decrement the refcnt.

Yes, I understood the above is what the page pool do now.

But the question is who is still holding an extral reference to the page when
kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral
reference to the same page? So why not just do a page_ref_dec() if the orginal skb
is freed first, and call __page_pool_put_page() when the cloned skb is freed later?
So that we can always reuse the recyclable page from a recyclable skb. This may
make the page_pool_destroy() process delays longer than before, I am supposed the
page_pool_destroy() delaying for cloned skb case does not really matters here.

If the above works, I think the samiliar handling can be added to RX zerocopy if
the RX zerocopy also hold extral references to the recyclable page from a recyclable
skb too?

> 

> [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/

> 

> Cheers

> /Ilias

> 

> .

>

Ilias Apalodimas May 6, 2021, 12:58 p.m. UTC | #8

On Thu, May 06, 2021 at 08:34:48PM +0800, Yunsheng Lin wrote:
> On 2021/5/1 0:24, Ilias Apalodimas wrote:

> > [...]

> >>>>

> >>>> 1. skb frag page recycling do not need "struct xdp_rxq_info" or

> >>>>    "struct xdp_mem_info" to bond the relation between "struct page" and

> >>>>    "struct page_pool", which seems uncessary at this point if bonding

> >>>>    a "struct page_pool" pointer directly in "struct page" does not cause

> >>>>    space increasing.

> >>>

> >>> We can't do that. The reason we need those structs is that we rely on the

> >>> existing XDP code, which already recycles it's buffers, to enable

> >>> recycling.  Since we allocate a page per packet when using page_pool for a

> >>> driver , the same ideas apply to an SKB and XDP frame. We just recycle the

> >>

> >> I am not really familar with XDP here, but a packet from hw is either a

> >> "struct xdp_frame/xdp_buff" for XDP or a "struct sk_buff" for TCP/IP stack,

> >> a packet can not be both "struct xdp_frame/xdp_buff" and "struct sk_buff" at

> >> the same time, right?

> >>

> > 

> > Yes, but the payload is irrelevant in both cases and that's what we use

> > page_pool for.  You can't use this patchset unless your driver usues

> > build_skb().  So in both cases you just allocate memory for the payload and

> 

> I am not sure I understood why build_skb() matters here. If the head data of

> a skb is a page frag and is from page pool, then it's page->signature should be

> PP_SIGNATURE, otherwise it's page->signature is zero, so a recyclable skb does

> not require it's head data being from a page pool, right?

> 


Correct, and that's the big improvement compared to the original RFC.
The wording was a bit off in my initial response.  I was trying to point out
you can recycle *any* buffer coming from page_pool and one of the ways you can
do that in your driver, is use build_skb() while the payload is allocated by
page_pool.

> > decide what the wrap the buffer with (XDP or SKB) later.

> 

> [...]

> 

> >>

> >> I am not sure I understand what you meant by "free the skb", does it mean

> >> that kfree_skb() is called to free the skb.

> > 

> > Yes

> > 

> >>

> >> As my understanding, if the skb completely own the page(which means page_count()

> >> == 1) when kfree_skb() is called, __page_pool_put_page() is called, otherwise

> >> page_ref_dec() is called, which is exactly what page_pool_atomic_sub_if_positive()

> >> try to handle it atomically.

> >>

> > 

> > Not really, the opposite is happening here. If the pp_recycle bit is set we

> > will always call page_pool_return_skb_page().  If the page signature matches

> > the 'magic' set by page pool we will always call xdp_return_skb_frame() will

> > end up calling __page_pool_put_page(). If the refcnt is 1 we'll try

> > to recycle the page.  If it's not we'll release it from page_pool (releasing

> > some internal references we keep) unmap the buffer and decrement the refcnt.

> 

> Yes, I understood the above is what the page pool do now.

> 

> But the question is who is still holding an extral reference to the page when

> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral

> reference to the same page? So why not just do a page_ref_dec() if the orginal skb

> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?

> So that we can always reuse the recyclable page from a recyclable skb. This may

> make the page_pool_destroy() process delays longer than before, I am supposed the

> page_pool_destroy() delaying for cloned skb case does not really matters here.

> 

> If the above works, I think the samiliar handling can be added to RX zerocopy if

> the RX zerocopy also hold extral references to the recyclable page from a recyclable

> skb too?

> 


Right, this sounds doable, but I'll have to go back code it and see if it
really makes sense.  However I'd still prefer the support to go in as-is
(including the struct xdp_mem_info in struct page, instead of a page_pool
pointer).

There's a couple of reasons for that.  If we keep the struct xdp_mem_info we
can in the future recycle different kind of buffers using __xdp_return().
And this is a non intrusive change if we choose to store the page pool address
directly in the future.  It just affects the internal contract between the
page_pool code and struct page.  So it won't affect any drivers that already
use the feature.
Regarding the page_ref_dec(), which as I said sounds doable, I'd prefer
playing it safe for now and getting rid of the buffers that somehow ended up
holding an extra reference.  Once this gets approved we can go back and try to
save the extra space.  I hope I am not wrong but the changes required to
support a few extra refcounts should not change the current patches much.

Thanks for taking the time on this!
/Ilias

> > 

> > [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/

> > 

> > Cheers

> > /Ilias

> > 

> > .

> > 

>

Yunsheng Lin May 7, 2021, 3:23 a.m. UTC | #9

On 2021/5/6 20:58, Ilias Apalodimas wrote:
>>>>

>>>

>>> Not really, the opposite is happening here. If the pp_recycle bit is set we

>>> will always call page_pool_return_skb_page().  If the page signature matches

>>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will

>>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try

>>> to recycle the page.  If it's not we'll release it from page_pool (releasing

>>> some internal references we keep) unmap the buffer and decrement the refcnt.

>>

>> Yes, I understood the above is what the page pool do now.

>>

>> But the question is who is still holding an extral reference to the page when

>> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral

>> reference to the same page? So why not just do a page_ref_dec() if the orginal skb

>> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?

>> So that we can always reuse the recyclable page from a recyclable skb. This may

>> make the page_pool_destroy() process delays longer than before, I am supposed the

>> page_pool_destroy() delaying for cloned skb case does not really matters here.

>>

>> If the above works, I think the samiliar handling can be added to RX zerocopy if

>> the RX zerocopy also hold extral references to the recyclable page from a recyclable

>> skb too?

>>

> 

> Right, this sounds doable, but I'll have to go back code it and see if it

> really makes sense.  However I'd still prefer the support to go in as-is

> (including the struct xdp_mem_info in struct page, instead of a page_pool

> pointer).

> 

> There's a couple of reasons for that.  If we keep the struct xdp_mem_info we

> can in the future recycle different kind of buffers using __xdp_return().

> And this is a non intrusive change if we choose to store the page pool address

> directly in the future.  It just affects the internal contract between the

> page_pool code and struct page.  So it won't affect any drivers that already

> use the feature.


This patchset has embeded a signature field in "struct page", and xdp_mem_info
is stored in page_private(), which seems not considering the case for associating
the page pool with "struct page" directly yet? Is the page pool also stored in
page_private() and a different signature is used to indicate that?

I am not saying we have to do it in this patchset, but we have to consider it
while we are adding new signature field to "struct page", right?

> Regarding the page_ref_dec(), which as I said sounds doable, I'd prefer

> playing it safe for now and getting rid of the buffers that somehow ended up

> holding an extra reference.  Once this gets approved we can go back and try to

> save the extra space.  I hope I am not wrong but the changes required to

> support a few extra refcounts should not change the current patches much.

> 

> Thanks for taking the time on this!


Thanks all invovled in the effort improving page pool too:)

> /Ilias

> 

>>>

>>> [1] https://lore.kernel.org/netdev/154413868810.21735.572808840657728172.stgit@firesoul/

>>>

>>> Cheers

>>> /Ilias

>>>

>>> .

>>>

>>

> 

> .

>

Ilias Apalodimas May 7, 2021, 7:06 a.m. UTC | #10

On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote:
> On 2021/5/6 20:58, Ilias Apalodimas wrote:

> >>>>

> >>>

> >>> Not really, the opposite is happening here. If the pp_recycle bit is set we

> >>> will always call page_pool_return_skb_page().  If the page signature matches

> >>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will

> >>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try

> >>> to recycle the page.  If it's not we'll release it from page_pool (releasing

> >>> some internal references we keep) unmap the buffer and decrement the refcnt.

> >>

> >> Yes, I understood the above is what the page pool do now.

> >>

> >> But the question is who is still holding an extral reference to the page when

> >> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral

> >> reference to the same page? So why not just do a page_ref_dec() if the orginal skb

> >> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?

> >> So that we can always reuse the recyclable page from a recyclable skb. This may

> >> make the page_pool_destroy() process delays longer than before, I am supposed the

> >> page_pool_destroy() delaying for cloned skb case does not really matters here.

> >>

> >> If the above works, I think the samiliar handling can be added to RX zerocopy if

> >> the RX zerocopy also hold extral references to the recyclable page from a recyclable

> >> skb too?

> >>

> > 

> > Right, this sounds doable, but I'll have to go back code it and see if it

> > really makes sense.  However I'd still prefer the support to go in as-is

> > (including the struct xdp_mem_info in struct page, instead of a page_pool

> > pointer).

> > 

> > There's a couple of reasons for that.  If we keep the struct xdp_mem_info we

> > can in the future recycle different kind of buffers using __xdp_return().

> > And this is a non intrusive change if we choose to store the page pool address

> > directly in the future.  It just affects the internal contract between the

> > page_pool code and struct page.  So it won't affect any drivers that already

> > use the feature.

> 

> This patchset has embeded a signature field in "struct page", and xdp_mem_info

> is stored in page_private(), which seems not considering the case for associating

> the page pool with "struct page" directly yet? 


Correct

> Is the page pool also stored in

> page_private() and a different signature is used to indicate that?


No only struct xdp_mem_info as you mentioned before

> 

> I am not saying we have to do it in this patchset, but we have to consider it

> while we are adding new signature field to "struct page", right?


We won't need a new signature.  The signature in both cases is there to 
guarantee the page you are trying to recycle was indeed allocated by page_pool.

Basically we got two design choices here: 
- We store the page_pool ptr address directly in page->private and then,
  we call into page_pool APIs directly to do the recycling.
  That would eliminate the lookup through xdp_mem_info and the
  XDP helpers to locate page pool pointer (through __xdp_return).
- You store the xdp_mem_info on page_private.  In that case you need to go
  through __xdp_return()  to locate the page_pool pointer. Although we might
  loose some performance that would allow us to recycle additional memory types
  and not only MEM_TYPE_PAGE_POOL (in case we ever need it).


I think both choices are sane.  What I am trying to explain here, is
regardless of what we choose now, we can change it in the future without
affecting the API consumers at all.  What will change internally is the way we
lookup the page pool pointer we are trying to recycle.

[...]


Cheers
/Ilias

Yunsheng Lin May 7, 2021, 8:28 a.m. UTC | #11

On 2021/5/7 15:06, Ilias Apalodimas wrote:
> On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote:

>> On 2021/5/6 20:58, Ilias Apalodimas wrote:

>>>>>>

>>>>>

>>>>> Not really, the opposite is happening here. If the pp_recycle bit is set we

>>>>> will always call page_pool_return_skb_page().  If the page signature matches

>>>>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will

>>>>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try

>>>>> to recycle the page.  If it's not we'll release it from page_pool (releasing

>>>>> some internal references we keep) unmap the buffer and decrement the refcnt.

>>>>

>>>> Yes, I understood the above is what the page pool do now.

>>>>

>>>> But the question is who is still holding an extral reference to the page when

>>>> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral

>>>> reference to the same page? So why not just do a page_ref_dec() if the orginal skb

>>>> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?

>>>> So that we can always reuse the recyclable page from a recyclable skb. This may

>>>> make the page_pool_destroy() process delays longer than before, I am supposed the

>>>> page_pool_destroy() delaying for cloned skb case does not really matters here.

>>>>

>>>> If the above works, I think the samiliar handling can be added to RX zerocopy if

>>>> the RX zerocopy also hold extral references to the recyclable page from a recyclable

>>>> skb too?

>>>>

>>>

>>> Right, this sounds doable, but I'll have to go back code it and see if it

>>> really makes sense.  However I'd still prefer the support to go in as-is

>>> (including the struct xdp_mem_info in struct page, instead of a page_pool

>>> pointer).

>>>

>>> There's a couple of reasons for that.  If we keep the struct xdp_mem_info we

>>> can in the future recycle different kind of buffers using __xdp_return().

>>> And this is a non intrusive change if we choose to store the page pool address

>>> directly in the future.  It just affects the internal contract between the

>>> page_pool code and struct page.  So it won't affect any drivers that already

>>> use the feature.

>>

>> This patchset has embeded a signature field in "struct page", and xdp_mem_info

>> is stored in page_private(), which seems not considering the case for associating

>> the page pool with "struct page" directly yet? 

> 

> Correct

> 

>> Is the page pool also stored in

>> page_private() and a different signature is used to indicate that?

> 

> No only struct xdp_mem_info as you mentioned before

> 

>>

>> I am not saying we have to do it in this patchset, but we have to consider it

>> while we are adding new signature field to "struct page", right?

> 

> We won't need a new signature.  The signature in both cases is there to 

> guarantee the page you are trying to recycle was indeed allocated by page_pool.

> 

> Basically we got two design choices here: 

> - We store the page_pool ptr address directly in page->private and then,

>   we call into page_pool APIs directly to do the recycling.

>   That would eliminate the lookup through xdp_mem_info and the

>   XDP helpers to locate page pool pointer (through __xdp_return).

> - You store the xdp_mem_info on page_private.  In that case you need to go

>   through __xdp_return()  to locate the page_pool pointer. Although we might

>   loose some performance that would allow us to recycle additional memory types

>   and not only MEM_TYPE_PAGE_POOL (in case we ever need it).


So the signature field  in "struct page" is used to only indicate a page is
from a page pool, then how do we tell the content of page_private() if both of
the above choices are needed, we might still need an extra indicator to tell
page_private() is page_pool ptr or xdp_mem_info.

It seems storing the page pool ptr in page_private() is clear for recyclable
page from a recyclable skb use case, and the use case for storing xdp_mem_info
in page_private() is unclear yet? As XDP seems to have the xdp_mem_info in the
"struct xdp_frame", so it does not need the xdp_mem_info from page_private().

If the above is true, what does not really makes sense to me here is that:
why do we first implement a unclear use case for storing xdp_mem_info in
page_private(), why not implement the clear use case for storing page pool ptr
in page_private() first?

> 

> 

> I think both choices are sane.  What I am trying to explain here, is

> regardless of what we choose now, we can change it in the future without

> affecting the API consumers at all.  What will change internally is the way we

> lookup the page pool pointer we are trying to recycle.


It seems the below API need changing?
+static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page,
+					struct xdp_mem_info *mem)


> 

> [...]

> 

> 

> Cheers

> /Ilias

> 

> .

>

Jesper Dangaard Brouer May 7, 2021, 10:19 a.m. UTC | #12

On Fri, 7 May 2021 16:28:30 +0800
Yunsheng Lin <linyunsheng@huawei.com> wrote:

> On 2021/5/7 15:06, Ilias Apalodimas wrote:

> > On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote:  

> >> On 2021/5/6 20:58, Ilias Apalodimas wrote:  

> >>>>>>  

> >>>>>

> >>>>> Not really, the opposite is happening here. If the pp_recycle bit is set we

> >>>>> will always call page_pool_return_skb_page().  If the page signature matches

> >>>>> the 'magic' set by page pool we will always call xdp_return_skb_frame() will

> >>>>> end up calling __page_pool_put_page(). If the refcnt is 1 we'll try

> >>>>> to recycle the page.  If it's not we'll release it from page_pool (releasing

> >>>>> some internal references we keep) unmap the buffer and decrement the refcnt.  

> >>>>

> >>>> Yes, I understood the above is what the page pool do now.

> >>>>

> >>>> But the question is who is still holding an extral reference to the page when

> >>>> kfree_skb()? Perhaps a cloned and pskb_expand_head()'ed skb is holding an extral

> >>>> reference to the same page? So why not just do a page_ref_dec() if the orginal skb

> >>>> is freed first, and call __page_pool_put_page() when the cloned skb is freed later?

> >>>> So that we can always reuse the recyclable page from a recyclable skb. This may

> >>>> make the page_pool_destroy() process delays longer than before, I am supposed the

> >>>> page_pool_destroy() delaying for cloned skb case does not really matters here.

> >>>>

> >>>> If the above works, I think the samiliar handling can be added to RX zerocopy if

> >>>> the RX zerocopy also hold extral references to the recyclable page from a recyclable

> >>>> skb too?

> >>>>  

> >>>

> >>> Right, this sounds doable, but I'll have to go back code it and see if it

> >>> really makes sense.  However I'd still prefer the support to go in as-is

> >>> (including the struct xdp_mem_info in struct page, instead of a page_pool

> >>> pointer).

> >>>

> >>> There's a couple of reasons for that.  If we keep the struct xdp_mem_info we

> >>> can in the future recycle different kind of buffers using __xdp_return().

> >>> And this is a non intrusive change if we choose to store the page pool address

> >>> directly in the future.  It just affects the internal contract between the

> >>> page_pool code and struct page.  So it won't affect any drivers that already

> >>> use the feature.  

> >>

> >> This patchset has embeded a signature field in "struct page", and xdp_mem_info

> >> is stored in page_private(), which seems not considering the case for associating

> >> the page pool with "struct page" directly yet?   

> > 

> > Correct

> >   

> >> Is the page pool also stored in

> >> page_private() and a different signature is used to indicate that?  

> > 

> > No only struct xdp_mem_info as you mentioned before

> >   

> >>

> >> I am not saying we have to do it in this patchset, but we have to consider it

> >> while we are adding new signature field to "struct page", right?  

> > 

> > We won't need a new signature.  The signature in both cases is there to 

> > guarantee the page you are trying to recycle was indeed allocated by page_pool.

> > 

> > Basically we got two design choices here: 

> > - We store the page_pool ptr address directly in page->private and then,

> >   we call into page_pool APIs directly to do the recycling.

> >   That would eliminate the lookup through xdp_mem_info and the

> >   XDP helpers to locate page pool pointer (through __xdp_return).

> > - You store the xdp_mem_info on page_private.  In that case you need to go

> >   through __xdp_return()  to locate the page_pool pointer. Although we might

> >   loose some performance that would allow us to recycle additional memory types

> >   and not only MEM_TYPE_PAGE_POOL (in case we ever need it).  

>

> So the signature field  in "struct page" is used to only indicate a page is

> from a page pool, then how do we tell the content of page_private() if both of

> the above choices are needed, we might still need an extra indicator to tell

> page_private() is page_pool ptr or xdp_mem_info.


The signature field in "struct page" and "xdp_mem_info" is a double
construct that was introduced in this patchset.  AFAIK Matteo took the
idea from Jonathan's patchset.  I'm not convinced we need both, maybe
later we do (when someone introduce a new mem model ala NetGPU).

I think Jonathan's use-case was NetGPU[1] (which got shutdown due to
Nvidia[2] being involved which I think was unfair).  The general idea
behind NetGPU makes sense to me, to allow packet headers to live in
first page, and second page belongs to hardware.  This implies that an
SKB can can point to two different pages with different memory types,
which need to be handled correctly when freeing the SKB and the pages it
points to.  Thus, placing (xdp_)mem_info in SKB is wrong as it implies
all pages belong the same mem_info.type.

The point is, when designing this I want us to think about how our
design can handle other memory models than just page_pool.

In this patchset design, we use a single bit in SKB to indicate that
the pages pointed comes from another memory model, in this case
page_pool is the only user of this bit.  The remaining info about the
memory model (page_pool) is stored in struct-page, which we look at
when freeing the pages that the SKB points to (that is at the layer
above the MM-calls that would free the page for real).


[1] https://linuxplumbersconf.org/event/7/contributions/670/
[2] https://lwn.net/Articles/827596/

> It seems storing the page pool ptr in page_private() is clear for recyclable

> page from a recyclable skb use case, and the use case for storing xdp_mem_info

> in page_private() is unclear yet? As XDP seems to have the xdp_mem_info in the

> "struct xdp_frame", so it does not need the xdp_mem_info from page_private().

> 

> If the above is true, what does not really makes sense to me here is that:

> why do we first implement a unclear use case for storing xdp_mem_info in

> page_private(), why not implement the clear use case for storing page pool ptr

> in page_private() first?


I'm actually not against storing page_pool object ptr directly in
struct-page.  It is a nice optimization.  Perhaps we should implement
this optimization outside this patchset first, and let __xdp_return()
for XDP-redirected packets also take advantage to this optimization?

Then it would feel more natural to also used this optimization in this
patchset, right?

> > 

> > 

> > I think both choices are sane.  What I am trying to explain here, is

> > regardless of what we choose now, we can change it in the future without

> > affecting the API consumers at all.  What will change internally is the way we

> > lookup the page pool pointer we are trying to recycle.  

> 

> It seems the below API need changing?

> +static inline void skb_mark_for_recycle(struct sk_buff *skb, struct page *page,

> +					struct xdp_mem_info *mem)


I don't think we need to change this API, to support future memory
models.  Notice that xdp_mem_info have a 'type' member.

Naming in Computer Science is a hard problem ;-). Something that seems
to confuse a lot of people is the naming of the struct "xdp_mem_info".  
Maybe we should have named it "mem_info" instead or "net_mem_info", as
it doesn't indicate that the device is running XDP.

I see XDP as the RX-layer before the network stack, that helps drivers
to support different memory models, also for handling normal packets
that doesn't get process by XDP, and the drivers doesn't even need to
support XDP to use the "xdp_mem_info" type.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

struct xdp_mem_info {
	u32 type; /* enum xdp_mem_type, but known size type */
	u32 id;
};

enum xdp_mem_type {
	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
	MEM_TYPE_PAGE_POOL,
	MEM_TYPE_XSK_BUFF_POOL,
	MEM_TYPE_MAX,
};

Christoph Hellwig May 7, 2021, 11:31 a.m. UTC | #13

On Fri, May 07, 2021 at 12:19:53PM +0200, Jesper Dangaard Brouer wrote:
> Nvidia[2] being involved which I think was unfair).  The general idea

> behind NetGPU makes sense to me,

Sorry, but that is utter bullshit.  It was rejected because it did
nothing but injecting hooks for an out of tree module while ignoring the
existing kernel infrastructure for much of what it tries to archive.

Shay Agroskin May 9, 2021, 5:11 a.m. UTC | #14

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Fri, 7 May 2021 16:28:30 +0800

> Yunsheng Lin <linyunsheng@huawei.com> wrote:

>

>> On 2021/5/7 15:06, Ilias Apalodimas wrote:

>> > On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote:  

>> >> On 2021/5/6 20:58, Ilias Apalodimas wrote:  

>> >>>>>>  

>> >>>>>

> ...

>> > 

>> > 

>> > I think both choices are sane.  What I am trying to explain 

>> > here, is

>> > regardless of what we choose now, we can change it in the 

>> > future without

>> > affecting the API consumers at all.  What will change 

>> > internally is the way we

>> > lookup the page pool pointer we are trying to recycle.  

>> 

>> It seems the below API need changing?

>> +static inline void skb_mark_for_recycle(struct sk_buff *skb, 

>> struct page *page,

>> +					struct xdp_mem_info *mem)

>

> I don't think we need to change this API, to support future 

> memory

> models.  Notice that xdp_mem_info have a 'type' member.


Hi,
Providing that we will (possibly as a future optimization) store 
the pointer to the page pool in struct page instead of strcut 
xdp_mem_info, passing
xdp_mem_info * instead of struct page_pool * would mean that for 
every packet we'll need to call
             xa = rhashtable_lookup(mem_id_ht, &mem->id, 
             mem_id_rht_params);
             xa->page_pool;

which might pressure the Dcache to fetch a pointer that might be 
present already in cache as part of driver's data-structures.

I tend to agree with Yunsheng that it makes more sense to adjust 
the API for the clear use-case now rather than using xdp_mem_info 
indirection. It seems to me like
the page signature provides the same information anyway and allows 
to support different memory types.

Shay

>

> Naming in Computer Science is a hard problem ;-). Something that 

> seems

> to confuse a lot of people is the naming of the struct 

> "xdp_mem_info".  

> Maybe we should have named it "mem_info" instead or 

> "net_mem_info", as

> it doesn't indicate that the device is running XDP.

>

> I see XDP as the RX-layer before the network stack, that helps 

> drivers

> to support different memory models, also for handling normal 

> packets

> that doesn't get process by XDP, and the drivers doesn't even 

> need to

> support XDP to use the "xdp_mem_info" type.

Yunsheng Lin May 10, 2021, 2:20 a.m. UTC | #15

On 2021/5/7 18:19, Jesper Dangaard Brouer wrote:
> On Fri, 7 May 2021 16:28:30 +0800

> Yunsheng Lin <linyunsheng@huawei.com> wrote:

> 

[...]

> 

> I'm actually not against storing page_pool object ptr directly in

> struct-page.  It is a nice optimization.  Perhaps we should implement

> this optimization outside this patchset first, and let __xdp_return()

> for XDP-redirected packets also take advantage to this optimization?

> 

> Then it would feel more natural to also used this optimization in this

> patchset, right?

Yes, it would be good if XDP can take advantage of this optimization too.

Then it seems we can remove the "mem_id_ht"? if we want to support different
type of page pool in the future, the type field is in the page pool too
when page_pool object ptr directly in struct-page.

>

Ilias Apalodimas May 11, 2021, 8:41 a.m. UTC | #16

Hi Shay,

On Sun, May 09, 2021 at 08:11:35AM +0300, Shay Agroskin wrote:
> 

> Jesper Dangaard Brouer <brouer@redhat.com> writes:

> 

> > On Fri, 7 May 2021 16:28:30 +0800

> > Yunsheng Lin <linyunsheng@huawei.com> wrote:

> > 

> > > On 2021/5/7 15:06, Ilias Apalodimas wrote:

> > > > On Fri, May 07, 2021 at 11:23:28AM +0800, Yunsheng Lin wrote:  >>

> > > On 2021/5/6 20:58, Ilias Apalodimas wrote:  >>>>>>  >>>>>

> > ...

> > > > > > I think both choices are sane.  What I am trying to explain >

> > > here, is

> > > > regardless of what we choose now, we can change it in the > future

> > > without

> > > > affecting the API consumers at all.  What will change > internally

> > > is the way we

> > > > lookup the page pool pointer we are trying to recycle.

> > > 

> > > It seems the below API need changing?

> > > +static inline void skb_mark_for_recycle(struct sk_buff *skb, struct

> > > page *page,

> > > +					struct xdp_mem_info *mem)

> > 

> > I don't think we need to change this API, to support future memory

> > models.  Notice that xdp_mem_info have a 'type' member.

> 

> Hi,

> Providing that we will (possibly as a future optimization) store the pointer

> to the page pool in struct page instead of strcut xdp_mem_info, passing

> xdp_mem_info * instead of struct page_pool * would mean that for every

> packet we'll need to call

>             xa = rhashtable_lookup(mem_id_ht, &mem->id,

> mem_id_rht_params);

>             xa->page_pool;

> 

> which might pressure the Dcache to fetch a pointer that might be present

> already in cache as part of driver's data-structures.

> 

> I tend to agree with Yunsheng that it makes more sense to adjust the API for

> the clear use-case now rather than using xdp_mem_info indirection. It seems

> to me like

> the page signature provides the same information anyway and allows to

> support different memory types.


We've switched the patches already.  We didn't notice any performance boost
by doing so (tested on a machiattobin), but I agree as well.  As I
explained the only thing that will change if we ever the need the struct
xdp_mem_info in struct page is the internal contract between struct page
and the recycling function, so let's start clean and see if we ever need
that.


Cheers
/Ilias
> 

> Shay

> 

> > 

> > Naming in Computer Science is a hard problem ;-). Something that seems

> > to confuse a lot of people is the naming of the struct "xdp_mem_info".

> > Maybe we should have named it "mem_info" instead or "net_mem_info", as

> > it doesn't indicate that the device is running XDP.

> > 

> > I see XDP as the RX-layer before the network stack, that helps drivers

> > to support different memory models, also for handling normal packets

> > that doesn't get process by XDP, and the drivers doesn't even need to

> > support XDP to use the "xdp_mem_info" type.

>

[net-next,v3,0/5] page_pool: recycle buffers

Message

Comments