Message ID | 20210409223801.104657-3-mcroce@linux.microsoft.com |
---|---|
State | New |
Headers | show |
Series | page_pool: recycle buffers | expand |
On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote: > This is needed by the page_pool to avoid recycling a page not allocated > via page_pool. Is the PageType mechanism more appropriate to your needs? It wouldn't be if you use page->_mapcount (ie mapping it to userspace). > Signed-off-by: Matteo Croce <mcroce@microsoft.com> > --- > include/linux/mm_types.h | 1 + > include/net/page_pool.h | 2 ++ > net/core/page_pool.c | 4 ++++ > 3 files changed, 7 insertions(+) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 6613b26a8894..ef2d0d5f62e4 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -101,6 +101,7 @@ struct page { > * 32-bit architectures. > */ > dma_addr_t dma_addr; > + unsigned long signature; > }; > struct { /* slab, slob and slub */ > union { > diff --git a/include/net/page_pool.h b/include/net/page_pool.h > index b5b195305346..b30405e84b5e 100644 > --- a/include/net/page_pool.h > +++ b/include/net/page_pool.h > @@ -63,6 +63,8 @@ > */ > #define PP_ALLOC_CACHE_SIZE 128 > #define PP_ALLOC_CACHE_REFILL 64 > +#define PP_SIGNATURE 0x20210303 > + > struct pp_alloc_cache { > u32 count; > void *cache[PP_ALLOC_CACHE_SIZE]; > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index ad8b0707af04..2ae9b554ef98 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, > page_pool_dma_sync_for_device(pool, page, pool->p.max_len); > > skip_dma_map: > + page->signature = PP_SIGNATURE; > + > /* Track how many pages are held 'in-flight' */ > pool->pages_state_hold_cnt++; > > @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page) > DMA_ATTR_SKIP_CPU_SYNC); > page->dma_addr = 0; > skip_dma_unmap: > + page->signature = 0; > + > /* This may be the last page returned, releasing the pool, so > * it is not safe to reference pool afterwards. > */ > -- > 2.30.2 >
Hi Matthew On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote: > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote: > > This is needed by the page_pool to avoid recycling a page not allocated > > via page_pool. > > Is the PageType mechanism more appropriate to your needs? It wouldn't > be if you use page->_mapcount (ie mapping it to userspace). Interesting! Please keep in mind this was written ~2018 and was stale on my branches for quite some time. So back then I did try to use PageType, but had not free bits. Looking at it again though, it's cleaned up. So yes I think this can be much much cleaner. Should we go and define a new PG_pagepool? Thanks! /Ilias > > > Signed-off-by: Matteo Croce <mcroce@microsoft.com> > > --- > > include/linux/mm_types.h | 1 + > > include/net/page_pool.h | 2 ++ > > net/core/page_pool.c | 4 ++++ > > 3 files changed, 7 insertions(+) > > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index 6613b26a8894..ef2d0d5f62e4 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -101,6 +101,7 @@ struct page { > > * 32-bit architectures. > > */ > > dma_addr_t dma_addr; > > + unsigned long signature; > > }; > > struct { /* slab, slob and slub */ > > union { > > diff --git a/include/net/page_pool.h b/include/net/page_pool.h > > index b5b195305346..b30405e84b5e 100644 > > --- a/include/net/page_pool.h > > +++ b/include/net/page_pool.h > > @@ -63,6 +63,8 @@ > > */ > > #define PP_ALLOC_CACHE_SIZE 128 > > #define PP_ALLOC_CACHE_REFILL 64 > > +#define PP_SIGNATURE 0x20210303 > > + > > struct pp_alloc_cache { > > u32 count; > > void *cache[PP_ALLOC_CACHE_SIZE]; > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > > index ad8b0707af04..2ae9b554ef98 100644 > > --- a/net/core/page_pool.c > > +++ b/net/core/page_pool.c > > @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, > > page_pool_dma_sync_for_device(pool, page, pool->p.max_len); > > > > skip_dma_map: > > + page->signature = PP_SIGNATURE; > > + > > /* Track how many pages are held 'in-flight' */ > > pool->pages_state_hold_cnt++; > > > > @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page) > > DMA_ATTR_SKIP_CPU_SYNC); > > page->dma_addr = 0; > > skip_dma_unmap: > > + page->signature = 0; > > + > > /* This may be the last page returned, releasing the pool, so > > * it is not safe to reference pool afterwards. > > */ > > -- > > 2.30.2 > >
On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote: > > Hi Matthew > > On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote: > > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote: > > > This is needed by the page_pool to avoid recycling a page not allocated > > > via page_pool. > > > > Is the PageType mechanism more appropriate to your needs? It wouldn't > > be if you use page->_mapcount (ie mapping it to userspace). > > Interesting! > Please keep in mind this was written ~2018 and was stale on my branches for > quite some time. So back then I did try to use PageType, but had not free > bits. Looking at it again though, it's cleaned up. So yes I think this can > be much much cleaner. Should we go and define a new PG_pagepool? > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType can not be used. There is a recent discussion [1] on memcg accounting of TCP RX zerocopy and I am wondering if this work can somehow help in that regard. I will take a look at the series. [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/
Hi Shakeel, On Sat, Apr 10, 2021 at 10:42:30AM -0700, Shakeel Butt wrote: > On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas > <ilias.apalodimas@linaro.org> wrote: > > > > Hi Matthew > > > > On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote: > > > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote: > > > > This is needed by the page_pool to avoid recycling a page not allocated > > > > via page_pool. > > > > > > Is the PageType mechanism more appropriate to your needs? It wouldn't > > > be if you use page->_mapcount (ie mapping it to userspace). > > > > Interesting! > > Please keep in mind this was written ~2018 and was stale on my branches for > > quite some time. So back then I did try to use PageType, but had not free > > bits. Looking at it again though, it's cleaned up. So yes I think this can > > be much much cleaner. Should we go and define a new PG_pagepool? > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > can not be used. Yes it can, since it's going to be used as your default allocator for payloads, which might end up on an SKB. So we have to keep the extra added field on struct page for our mark. Matthew had an intersting idea. He suggested keeping it, but changing the magic number, so it can't be a kernel address, but I'll let him follow up on the details. > > There is a recent discussion [1] on memcg accounting of TCP RX > zerocopy and I am wondering if this work can somehow help in that > regard. I will take a look at the series. > I'll try having a look on this as well. The idea behind the patchset is to allow lower speed NICs that use the API already, gain recycling 'easily'. Using page_pool for the driver comes with a penalty to begin with. Allocating pages instead of SKBs has a measurable difference. By enabling them to recycle they'll get better performance, since you skip the reallocation/remapping and only care for syncing the buffers correctly. > [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/
On Sat, Apr 10, 2021 at 09:27:31PM +0300, Ilias Apalodimas wrote: > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > can not be used. > > Yes it can, since it's going to be used as your default allocator for > payloads, which might end up on an SKB. > So we have to keep the extra added field on struct page for our mark. > Matthew had an intersting idea. He suggested keeping it, but changing the > magic number, so it can't be a kernel address, but I'll let him follow > up on the details. Sure! So, given the misalignment problem I discovered yesterday [1], we probably want a page_pool page to look like: unsigned long flags; unsigned long pp_magic; unsigned long xmi; unsigned long _pp_mapping_pad; dma_addr_t dma_addr; /* might be one or two words */ The only real restriction here is that pp_magic should not be a valid pointer, and it must have the bottom bit clear. I'd recommend something like: #define PP_MAGIC (0x20 + POISON_POINTER_DELTA) This leaves page->mapping as NULL, so you don't have to worry about clearing it before free. [1] https://lore.kernel.org/linux-mm/20210410024313.GX2531743@casper.infradead.org/
On Sat, 10 Apr 2021 20:39:55 +0100 Matthew Wilcox <willy@infradead.org> wrote: > On Sat, Apr 10, 2021 at 09:27:31PM +0300, Ilias Apalodimas wrote: > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > > can not be used. > > > > Yes it can, since it's going to be used as your default allocator for > > payloads, which might end up on an SKB. > > So we have to keep the extra added field on struct page for our mark. > > Matthew had an intersting idea. He suggested keeping it, but changing the > > magic number, so it can't be a kernel address, but I'll let him follow > > up on the details. > > Sure! So, given the misalignment problem I discovered yesterday [1], > we probably want a page_pool page to look like: > > unsigned long flags; > unsigned long pp_magic; > unsigned long xmi; > unsigned long _pp_mapping_pad; > dma_addr_t dma_addr; /* might be one or two words */ > > The only real restriction here is that pp_magic should not be a valid > pointer, and it must have the bottom bit clear. I'd recommend something > like: > > #define PP_MAGIC (0x20 + POISON_POINTER_DELTA) > > This leaves page->mapping as NULL, so you don't have to worry about > clearing it before free. > > [1] https://lore.kernel.org/linux-mm/20210410024313.GX2531743@casper.infradead.org/ I didn't see this, before asking[2] for explaining your intent. I still worry about page->index, see [2]. [2] https://lore.kernel.org/netdev/20210411114307.5087f958@carbon/ -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
On Sat, 10 Apr 2021 21:27:31 +0300 Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote: > On Sat, Apr 10, 2021 at 10:42:30AM -0700, Shakeel Butt wrote: > > > On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas > > <ilias.apalodimas@linaro.org> wrote: > > > > > > Hi Matthew > > > > > > On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote: > > > > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote: > > > > > This is needed by the page_pool to avoid recycling a page not allocated > > > > > via page_pool. > > > > > > > > Is the PageType mechanism more appropriate to your needs? It wouldn't > > > > be if you use page->_mapcount (ie mapping it to userspace). > > > > > > Interesting! > > > Please keep in mind this was written ~2018 and was stale on my branches for > > > quite some time. So back then I did try to use PageType, but had not free > > > bits. Looking at it again though, it's cleaned up. So yes I think this can > > > be much much cleaner. Should we go and define a new PG_pagepool? > > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > can not be used. > > Yes it can, since it's going to be used as your default allocator for > payloads, which might end up on an SKB. I'm not sure we want or should "allow" page_pool be used for TCP RX zerocopy. For several reasons. (1) This implies mapping these pages page to userspace, which AFAIK means using page->mapping and page->index members (right?). (2) It feels wrong (security wise) to keep the DMA-mapping (for the device) and also map this page into userspace. (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX zerocopy will bump the refcnt, which means the page_pool will not recycle the page when it see the elevated refcnt (it will instead release its DMA-mapping). (4) I remember vaguely that this code path for (TCP RX zerocopy) uses page->private for tricks. And our patch [3/5] use page->private for storing xdp_mem_info. IMHO when the SKB travel into this TCP RX zerocopy code path, we should call page_pool_release_page() to release its DMA-mapping. > > [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/ -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer <brouer@redhat.com> wrote: > [...] > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > > can not be used. > > > > Yes it can, since it's going to be used as your default allocator for > > payloads, which might end up on an SKB. > > I'm not sure we want or should "allow" page_pool be used for TCP RX > zerocopy. > For several reasons. > > (1) This implies mapping these pages page to userspace, which AFAIK > means using page->mapping and page->index members (right?). > No, only page->_mapcount is used. > (2) It feels wrong (security wise) to keep the DMA-mapping (for the > device) and also map this page into userspace. > I think this is already the case i.e pages still DMA-mapped and also mapped into userspace. > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX > zerocopy will bump the refcnt, which means the page_pool will not > recycle the page when it see the elevated refcnt (it will instead > release its DMA-mapping). Yes this is right but the userspace might have already consumed and unmapped the page before the driver considers to recycle the page. > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses > page->private for tricks. And our patch [3/5] use page->private for > storing xdp_mem_info. > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should > call page_pool_release_page() to release its DMA-mapping. > I will let TCP RX zerocopy experts respond to this but from my high level code inspection, I didn't see page->private usage.
On Wed, Apr 14, 2021 at 10:09 PM Shakeel Butt <shakeelb@google.com> wrote: > > I will let TCP RX zerocopy experts respond to this but from my high > level code inspection, I didn't see page->private usage. Indeed, we do not use page->private, since we do not own the page(s).
On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote: > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > > [...] > > > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > > > can not be used. > > > > > > Yes it can, since it's going to be used as your default allocator for > > > payloads, which might end up on an SKB. > > > > I'm not sure we want or should "allow" page_pool be used for TCP RX > > zerocopy. > > For several reasons. > > > > (1) This implies mapping these pages page to userspace, which AFAIK > > means using page->mapping and page->index members (right?). > > > > No, only page->_mapcount is used. > I am not sure I like leaving out TCP RX zerocopy. Since we want driver to adopt the recycling mechanism we should try preserving the current functionality of the network stack. The question is how does it work with the current drivers that already have an internal page recycling mechanism. > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the > > device) and also map this page into userspace. > > > > I think this is already the case i.e pages still DMA-mapped and also > mapped into userspace. > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX > > zerocopy will bump the refcnt, which means the page_pool will not > > recycle the page when it see the elevated refcnt (it will instead > > release its DMA-mapping). > > Yes this is right but the userspace might have already consumed and > unmapped the page before the driver considers to recycle the page. Same question here. I'll have a closer look in a few days and make sure we are not breaking anything wrt zerocopy. > > > > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses > > page->private for tricks. And our patch [3/5] use page->private for > > storing xdp_mem_info. > > > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should > > call page_pool_release_page() to release its DMA-mapping. > > > > I will let TCP RX zerocopy experts respond to this but from my high > level code inspection, I didn't see page->private usage. Shakeel are you aware of any 'easy' way I can have rx zerocopy running? Thanks! /Ilias
On Wed, 14 Apr 2021 13:09:47 -0700 Shakeel Butt <shakeelb@google.com> wrote: > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > > [...] > > > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > > > can not be used. > > > > > > Yes it can, since it's going to be used as your default allocator for > > > payloads, which might end up on an SKB. > > > > I'm not sure we want or should "allow" page_pool be used for TCP RX > > zerocopy. > > For several reasons. > > > > (1) This implies mapping these pages page to userspace, which AFAIK > > means using page->mapping and page->index members (right?). > > > > No, only page->_mapcount is used. Good to know. I will admit that I don't fully understand the usage of page->mapping and page->index members. > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the > > device) and also map this page into userspace. > > > > I think this is already the case i.e pages still DMA-mapped and also > mapped into userspace. True, other drivers are doing the same. > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX > > zerocopy will bump the refcnt, which means the page_pool will not > > recycle the page when it see the elevated refcnt (it will instead > > release its DMA-mapping). > > Yes this is right but the userspace might have already consumed and > unmapped the page before the driver considers to recycle the page. That is a good point. So, there is a race window where it is possible to gain recycling. It seems my page_pool co-maintainer Ilias is interested in taking up the challenge to get this working with TCP RX zerocopy. So, lets see how this is doable. > > > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses > > page->private for tricks. And our patch [3/5] use page->private for > > storing xdp_mem_info. > > > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should > > call page_pool_release_page() to release its DMA-mapping. > > > > I will let TCP RX zerocopy experts respond to this but from my high > level code inspection, I didn't see page->private usage. I trust when Eric says page->private isn't used in this code path. So, it might actually be possible :-) I will challenge Ilias and Matteo to pull this off. (But I know that both of them are busy for personal reasons, so be patient with them). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
On Mon, Apr 19, 2021 at 01:22:04PM +0200, Jesper Dangaard Brouer wrote: > On Wed, 14 Apr 2021 13:09:47 -0700 > Shakeel Butt <shakeelb@google.com> wrote: > > > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer > > <brouer@redhat.com> wrote: > > > > > [...] > > > > > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > > > > can not be used. > > > > > > > > Yes it can, since it's going to be used as your default allocator for > > > > payloads, which might end up on an SKB. > > > > > > I'm not sure we want or should "allow" page_pool be used for TCP RX > > > zerocopy. > > > For several reasons. > > > > > > (1) This implies mapping these pages page to userspace, which AFAIK > > > means using page->mapping and page->index members (right?). > > > > > > > No, only page->_mapcount is used. > > Good to know. > I will admit that I don't fully understand the usage of page->mapping > and page->index members. That's fair. It's not well-documented, and it's complicated. For a page mapped into userspace, page->mapping is one of: - NULL - A pointer to a file's address_space - A pointer to an anonymous page's anon_vma If a page isn't mapped into userspace, you can use the space in page->mapping for anything you like (eg slab uses it) page->index is only used for indicating pfmemalloc today (and I want to move that indicator). I think it can also be used to merge VMAs (if some other conditions are also true), but failing to merge VMAs isn't a big deal for this kind of situation. > > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the > > > device) and also map this page into userspace. > > > > I think this is already the case i.e pages still DMA-mapped and also > > mapped into userspace. > > True, other drivers are doing the same. And the contents of this page already came from that device ... if it wanted to write bad data, it could already have done so. > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX > > > zerocopy will bump the refcnt, which means the page_pool will not > > > recycle the page when it see the elevated refcnt (it will instead > > > release its DMA-mapping). > > > > Yes this is right but the userspace might have already consumed and > > unmapped the page before the driver considers to recycle the page. > > That is a good point. So, there is a race window where it is possible > to gain recycling. > > It seems my page_pool co-maintainer Ilias is interested in taking up the > challenge to get this working with TCP RX zerocopy. So, lets see how > this is doable. You could also check page_ref_count() - page_mapcount() instead of just checking page_ref_count(). Assuming mapping/unmapping can't race with recycling?
On Sun, Apr 18, 2021 at 10:12 PM Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote: > > On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote: > > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer > > <brouer@redhat.com> wrote: > > > > > [...] > > > > > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > > > > can not be used. > > > > > > > > Yes it can, since it's going to be used as your default allocator for > > > > payloads, which might end up on an SKB. > > > > > > I'm not sure we want or should "allow" page_pool be used for TCP RX > > > zerocopy. > > > For several reasons. > > > > > > (1) This implies mapping these pages page to userspace, which AFAIK > > > means using page->mapping and page->index members (right?). > > > > > > > No, only page->_mapcount is used. > > > > I am not sure I like leaving out TCP RX zerocopy. Since we want driver to > adopt the recycling mechanism we should try preserving the current > functionality of the network stack. > > The question is how does it work with the current drivers that already have an > internal page recycling mechanism. > I think the current drivers check page_ref_count(page) to decide to reuse (or not) the already allocated pages. Some examples from the drivers: drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:ixgbe_can_reuse_rx_page() drivers/net/ethernet/intel/igb/igb_main.c:igb_can_reuse_rx_page() drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:mlx5e_rx_cache_get() > > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the > > > device) and also map this page into userspace. > > > > > > > I think this is already the case i.e pages still DMA-mapped and also > > mapped into userspace. > > > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX > > > zerocopy will bump the refcnt, which means the page_pool will not > > > recycle the page when it see the elevated refcnt (it will instead > > > release its DMA-mapping). > > > > Yes this is right but the userspace might have already consumed and > > unmapped the page before the driver considers to recycle the page. > > Same question here. I'll have a closer look in a few days and make sure we are > not breaking anything wrt zerocopy. > Pages mapped into the userspace have their refcnt elevated, so the page_ref_count() check by the drivers indicates to not reuse such pages. > > > > > > > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses > > > page->private for tricks. And our patch [3/5] use page->private for > > > storing xdp_mem_info. > > > > > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should > > > call page_pool_release_page() to release its DMA-mapping. > > > > > > > I will let TCP RX zerocopy experts respond to this but from my high > > level code inspection, I didn't see page->private usage. > > Shakeel are you aware of any 'easy' way I can have rx zerocopy running? > I would recommend tools/testing/selftests/net/tcp_mmap.c.
Hi Shakeel, On Mon, Apr 19, 2021 at 07:57:03AM -0700, Shakeel Butt wrote: > On Sun, Apr 18, 2021 at 10:12 PM Ilias Apalodimas > <ilias.apalodimas@linaro.org> wrote: > > > > On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote: > > > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer > > > <brouer@redhat.com> wrote: > > > > > > > [...] > > > > > > > > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType > > > > > > can not be used. > > > > > > > > > > Yes it can, since it's going to be used as your default allocator for > > > > > payloads, which might end up on an SKB. > > > > > > > > I'm not sure we want or should "allow" page_pool be used for TCP RX > > > > zerocopy. > > > > For several reasons. > > > > > > > > (1) This implies mapping these pages page to userspace, which AFAIK > > > > means using page->mapping and page->index members (right?). > > > > > > > > > > No, only page->_mapcount is used. > > > > > > > I am not sure I like leaving out TCP RX zerocopy. Since we want driver to > > adopt the recycling mechanism we should try preserving the current > > functionality of the network stack. > > > > The question is how does it work with the current drivers that already have an > > internal page recycling mechanism. > > > > I think the current drivers check page_ref_count(page) to decide to > reuse (or not) the already allocated pages. > > Some examples from the drivers: > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:ixgbe_can_reuse_rx_page() > drivers/net/ethernet/intel/igb/igb_main.c:igb_can_reuse_rx_page() > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:mlx5e_rx_cache_get() > Yes, that's how internal recycling is done in drivers. As Jesper mentioned the refcnt of the page is 1 for the page_pool owned pages and that's how we decide what to do with the page. > > > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the > > > > device) and also map this page into userspace. > > > > > > > > > > I think this is already the case i.e pages still DMA-mapped and also > > > mapped into userspace. > > > > > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX > > > > zerocopy will bump the refcnt, which means the page_pool will not > > > > recycle the page when it see the elevated refcnt (it will instead > > > > release its DMA-mapping). > > > > > > Yes this is right but the userspace might have already consumed and > > > unmapped the page before the driver considers to recycle the page. > > > > Same question here. I'll have a closer look in a few days and make sure we are > > not breaking anything wrt zerocopy. > > > > Pages mapped into the userspace have their refcnt elevated, so the > page_ref_count() check by the drivers indicates to not reuse such > pages. > When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch() which will end up doing a get_page(). What you are saying is that once the zerocopy is done though, skb_release_data() won't be called, but instead put_page() will be? If that's the case then we are indeed leaking DMA mappings and memory. That sounds weird though, since the refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who eventually frees the page? If kfree_skb() (or any wrapper that calls skb_release_data()) is called eventually, we'll end up properly recycling the page into our pool. > > > > > > > > > > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses > > > > page->private for tricks. And our patch [3/5] use page->private for > > > > storing xdp_mem_info. > > > > > > > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should > > > > call page_pool_release_page() to release its DMA-mapping. > > > > > > > > > > I will let TCP RX zerocopy experts respond to this but from my high > > > level code inspection, I didn't see page->private usage. > > > > Shakeel are you aware of any 'easy' way I can have rx zerocopy running? > > > > I would recommend tools/testing/selftests/net/tcp_mmap.c. Ok, thanks I'll have a look. Cheers /Ilias
On Mon, Apr 19, 2021 at 8:43 AM Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote: > [...] > > Pages mapped into the userspace have their refcnt elevated, so the > > page_ref_count() check by the drivers indicates to not reuse such > > pages. > > > > When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch() > which will end up doing a get_page(). > What you are saying is that once the zerocopy is done though, skb_release_data() > won't be called, but instead put_page() will be? If that's the case then we are > indeed leaking DMA mappings and memory. That sounds weird though, since the > refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who > eventually frees the page? > If kfree_skb() (or any wrapper that calls skb_release_data()) is called > eventually, we'll end up properly recycling the page into our pool. > From what I understand (Eric, please correct me if I'm wrong) for simple cases there are 3 page references taken. One by the driver, second by skb and third by page table. In tcp_zerocopy_receive(), tcp_zerocopy_vm_insert_batch() gets one page ref through insert_page_into_pte_locked(). However before returning from tcp_zerocopy_receive(), the skb references are dropped through tcp_recv_skb(). So, whenever the user unmaps the page and drops the page ref only then that page can be reused by the driver. In my understanding, for zerocopy rx the skb_release_data() is called on the pages while they are still mapped into the userspace. So, skb_release_data() might not be the right place to recycle the page for zerocopy. The email chain at [1] has some discussion on how to bundle the recycling of pages with their lifetime. [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/
On Mon, Apr 19, 2021 at 09:21:55AM -0700, Shakeel Butt wrote: > On Mon, Apr 19, 2021 at 8:43 AM Ilias Apalodimas > <ilias.apalodimas@linaro.org> wrote: > > > [...] > > > Pages mapped into the userspace have their refcnt elevated, so the > > > page_ref_count() check by the drivers indicates to not reuse such > > > pages. > > > > > > > When tcp_zerocopy_receive() is invoked it will call tcp_zerocopy_vm_insert_batch() > > which will end up doing a get_page(). > > What you are saying is that once the zerocopy is done though, skb_release_data() > > won't be called, but instead put_page() will be? If that's the case then we are > > indeed leaking DMA mappings and memory. That sounds weird though, since the > > refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who > > eventually frees the page? > > If kfree_skb() (or any wrapper that calls skb_release_data()) is called > > eventually, we'll end up properly recycling the page into our pool. > > > > From what I understand (Eric, please correct me if I'm wrong) for > simple cases there are 3 page references taken. One by the driver, > second by skb and third by page table. > > In tcp_zerocopy_receive(), tcp_zerocopy_vm_insert_batch() gets one > page ref through insert_page_into_pte_locked(). However before > returning from tcp_zerocopy_receive(), the skb references are dropped > through tcp_recv_skb(). So, whenever the user unmaps the page and > drops the page ref only then that page can be reused by the driver. > > In my understanding, for zerocopy rx the skb_release_data() is called > on the pages while they are still mapped into the userspace. So, > skb_release_data() might not be the right place to recycle the page > for zerocopy. The email chain at [1] has some discussion on how to > bundle the recycling of pages with their lifetime. > > [1] https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.kdev@gmail.com/ Ah right, you mentioned the same email before and I completely forgot about it! In the past we had thoughts of 'stealing' the page on put_page instead of skb_release_data(). We were afraid that this would cause a measurable performance hit, so we tried to limit it within the skb lifecycle. However I don't think this will be a problem. Assuming we are right here and skb_release_data() is called while the userspace holds an extra reference from the mapping here's what will happen: skb_release_data() -> skb_free_head() -> page_pool_return_skb_page() -> set_page_private() -> xdp_return_skb_frame() -> __xdp_return() -> page_pool_put_full_page() -> page_pool_put_page() -> __page_pool_put_page() When we call __page_pool_put_page(), the refcnt will be != 1 (because a user mapping is still active), so we won't try to recycle it. Instead we'll remove the DMA mappings and decrease the refcnt. So although the recycling won't 'work', nothing bad will happen (famous last words). In any case, I'll double check with the test you pointed out before v4. Thanks! /Ilias
Hi Matthew, [...] > > And the contents of this page already came from that device ... if it > wanted to write bad data, it could already have done so. > > > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX > > > > zerocopy will bump the refcnt, which means the page_pool will not > > > > recycle the page when it see the elevated refcnt (it will instead > > > > release its DMA-mapping). > > > > > > Yes this is right but the userspace might have already consumed and > > > unmapped the page before the driver considers to recycle the page. > > > > That is a good point. So, there is a race window where it is possible > > to gain recycling. > > > > It seems my page_pool co-maintainer Ilias is interested in taking up the > > challenge to get this working with TCP RX zerocopy. So, lets see how > > this is doable. > > You could also check page_ref_count() - page_mapcount() instead of > just checking page_ref_count(). Assuming mapping/unmapping can't > race with recycling? > That's not a bad idea. As I explained on my last reply to Shakeel, I don't think the current patch will blow up anywhere. If the page is unmapped prior to kfree_skb() it will be recycled. If it's done in a reverse order, we'll just free the page entirely and will have to re-allocate it. The only thing I need to test is potential races (assuming those can even happen?). Trying to recycle the page outside of kfree_skb() means we'd have to 'steal' the page, during put_page() (or some function that's outside the networking scope). I think this is going to have a measurable performance penalty though not in networking, but in general. In any case, that should be orthogonal to the current patchset. So unless someone feels strongly about it, I'd prefer keeping the current code and trying to enable recycling in the skb zc case, when we have enough users of the API. Thanks /Ilias
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 6613b26a8894..ef2d0d5f62e4 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -101,6 +101,7 @@ struct page { * 32-bit architectures. */ dma_addr_t dma_addr; + unsigned long signature; }; struct { /* slab, slob and slub */ union { diff --git a/include/net/page_pool.h b/include/net/page_pool.h index b5b195305346..b30405e84b5e 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -63,6 +63,8 @@ */ #define PP_ALLOC_CACHE_SIZE 128 #define PP_ALLOC_CACHE_REFILL 64 +#define PP_SIGNATURE 0x20210303 + struct pp_alloc_cache { u32 count; void *cache[PP_ALLOC_CACHE_SIZE]; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index ad8b0707af04..2ae9b554ef98 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, page_pool_dma_sync_for_device(pool, page, pool->p.max_len); skip_dma_map: + page->signature = PP_SIGNATURE; + /* Track how many pages are held 'in-flight' */ pool->pages_state_hold_cnt++; @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, struct page *page) DMA_ATTR_SKIP_CPU_SYNC); page->dma_addr = 0; skip_dma_unmap: + page->signature = 0; + /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. */