mbox series

[v5,net-next,00/11] skbuff: introduce skbuff_heads bulking and reusing

Message ID 20210211185220.9753-1-alobakin@pm.me
Headers show
Series skbuff: introduce skbuff_heads bulking and reusing | expand

Message

Alexander Lobakin Feb. 11, 2021, 6:52 p.m. UTC
Currently, all sorts of skb allocation always do allocate
skbuff_heads one by one via kmem_cache_alloc().
On the other hand, we have percpu napi_alloc_cache to store
skbuff_heads queued up for freeing and flush them by bulks.

We can use this cache not only for bulk-wiping, but also to obtain
heads for new skbs and avoid unconditional allocations, as well as
for bulk-allocating (like XDP's cpumap code and veth driver already
do).

As this might affect latencies, cache pressure and lots of hardware
and driver-dependent stuff, this new feature is mostly optional and
can be issued via:
 - a new napi_build_skb() function (as a replacement for build_skb());
 - existing {,__}napi_alloc_skb() and napi_get_frags() functions;
 - __alloc_skb() with passing SKB_ALLOC_NAPI in flags.

iperf3 showed 35-70 Mbps bumps for both TCP and UDP while performing
VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be bigger
on more powerful hosts and NICs with tens of Mpps.

Note on skbuff_heads from distant slabs or pfmemalloc'ed slabs:
 - kmalloc()/kmem_cache_alloc() itself allows by default allocating
   memory from the remote nodes to defragment their slabs. This is
   controlled by sysctl, but according to this, skbuff_head from a
   remote node is an OK case;
 - The easiest way to check if the slab of skbuff_head is remote or
   pfmemalloc'ed is:

	if (!dev_page_is_reusable(virt_to_head_page(skb)))
		/* drop it */;

   ...*but*, regarding that most slabs are built of compound pages,
   virt_to_head_page() will hit unlikely-branch every single call.
   This check costed at least 20 Mbps in test scenarios and seems
   like it'd be better to _not_ do this.

Since v4 [3]:
 - rebase on top of net-next and address kernel build robot issue;
 - reorder checks a bit in __alloc_skb() to make new condition even
   more harmless.

Since v3 [2]:
 - make the feature mostly optional, so driver developers could
   decide whether to use it or not (Paolo Abeni).
   This reuses the old flag for __alloc_skb() and introduces
   a new napi_build_skb();
 - reduce bulk-allocation size from 32 to 16 elements (also Paolo).
   This equals to the value of XDP's devmap and veth batch processing
   (which were tested a lot) and should be sane enough;
 - don't waste cycles on explicit in_serving_softirq() check.

Since v2 [1]:
 - also cover {,__}alloc_skb() and {,__}build_skb() cases (became handy
   after the changes that pass tiny skbs requests to kmalloc layer);
 - cover the cache with KASAN instrumentation (suggested by Eric
   Dumazet, help of Dmitry Vyukov);
 - completely drop redundant __kfree_skb_flush() (also Eric);
 - lots of code cleanups;
 - expand the commit message with NUMA and pfmemalloc points (Jakub).

Since v1 [0]:
 - use one unified cache instead of two separate to greatly simplify
   the logics and reduce hotpath overhead (Edward Cree);
 - new: recycle also GRO_MERGED_FREE skbs instead of immediate
   freeing;
 - correct performance numbers after optimizations and performing
   lots of tests for different use cases.

[0] https://lore.kernel.org/netdev/20210111182655.12159-1-alobakin@pm.me
[1] https://lore.kernel.org/netdev/20210113133523.39205-1-alobakin@pm.me
[2] https://lore.kernel.org/netdev/20210209204533.327360-1-alobakin@pm.me
[3] https://lore.kernel.org/netdev/20210210162732.80467-1-alobakin@pm.me

Alexander Lobakin (11):
  skbuff: move __alloc_skb() next to the other skb allocation functions
  skbuff: simplify kmalloc_reserve()
  skbuff: make __build_skb_around() return void
  skbuff: simplify __alloc_skb() a bit
  skbuff: use __build_skb_around() in __alloc_skb()
  skbuff: remove __kfree_skb_flush()
  skbuff: move NAPI cache declarations upper in the file
  skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads
  skbuff: allow to optionally use NAPI cache from __alloc_skb()
  skbuff: allow to use NAPI cache from __napi_alloc_skb()
  skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing

 include/linux/skbuff.h |   4 +-
 net/core/dev.c         |  16 +-
 net/core/skbuff.c      | 429 +++++++++++++++++++++++------------------
 3 files changed, 243 insertions(+), 206 deletions(-)

Comments

Alexander Duyck Feb. 12, 2021, 3:18 a.m. UTC | #1
On Thu, Feb 11, 2021 at 11:00 AM Alexander Lobakin <alobakin@pm.me> wrote:
>

> Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get

> an skbuff_head from the NAPI cache instead of inplace allocation

> inside __alloc_skb().

> This implies that the function is called from softirq or BH-off

> context, not for allocating a clone or from a distant node.

>

> Signed-off-by: Alexander Lobakin <alobakin@pm.me>

> ---

>  net/core/skbuff.c | 13 +++++++++----

>  1 file changed, 9 insertions(+), 4 deletions(-)

>

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c

> index 9e1a8ded4acc..a0b457ae87c2 100644

> --- a/net/core/skbuff.c

> +++ b/net/core/skbuff.c

> @@ -397,15 +397,20 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,

>         struct sk_buff *skb;

>         u8 *data;

>         bool pfmemalloc;

> +       bool clone;

>

> -       cache = (flags & SKB_ALLOC_FCLONE)

> -               ? skbuff_fclone_cache : skbuff_head_cache;

> +       clone = !!(flags & SKB_ALLOC_FCLONE);


The boolean conversion here is probably unnecessary. I would make
clone an int like flags and work with that. I suspect the compiler is
doing it already, but it is better to be explicit.

> +       cache = clone ? skbuff_fclone_cache : skbuff_head_cache;

>

>         if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))

>                 gfp_mask |= __GFP_MEMALLOC;

>

>         /* Get the HEAD */

> -       skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);

> +       if ((flags & SKB_ALLOC_NAPI) && !clone &&


Rather than having to do two checks you could just check for
SKB_ALLOC_NAPI and SKB_ALLOC_FCLONE in a single check. You could just
do something like:
    if ((flags & (SKB_ALLOC_FCLONE | SKB_ALLOC_NAPI) == SKB_ALLOC_NAPI)

That way you can avoid the extra conditional jumps and can start
computing the flags value sooner.

> +           likely(node == NUMA_NO_NODE || node == numa_mem_id()))

> +               skb = napi_skb_cache_get();

> +       else

> +               skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node);

>         if (unlikely(!skb))

>                 return NULL;

>         prefetchw(skb);

> @@ -436,7 +441,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,

>         __build_skb_around(skb, data, 0);

>         skb->pfmemalloc = pfmemalloc;

>

> -       if (flags & SKB_ALLOC_FCLONE) {

> +       if (clone) {

>                 struct sk_buff_fclones *fclones;

>

>                 fclones = container_of(skb, struct sk_buff_fclones, skb1);

> --

> 2.30.1

>

>
Alexander Lobakin Feb. 13, 2021, 11:53 a.m. UTC | #2
From: Alexander Duyck <alexander.duyck@gmail.com>

Date: Thu, 11 Feb 2021 19:18:45 -0800

> On Thu, Feb 11, 2021 at 11:00 AM Alexander Lobakin <alobakin@pm.me> wrote:

> >

> > Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get

> > an skbuff_head from the NAPI cache instead of inplace allocation

> > inside __alloc_skb().

> > This implies that the function is called from softirq or BH-off

> > context, not for allocating a clone or from a distant node.

> >

> > Signed-off-by: Alexander Lobakin <alobakin@pm.me>

> > ---

> >  net/core/skbuff.c | 13 +++++++++----

> >  1 file changed, 9 insertions(+), 4 deletions(-)

> >

> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c

> > index 9e1a8ded4acc..a0b457ae87c2 100644

> > --- a/net/core/skbuff.c

> > +++ b/net/core/skbuff.c

> > @@ -397,15 +397,20 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,

> >         struct sk_buff *skb;

> >         u8 *data;

> >         bool pfmemalloc;

> > +       bool clone;

> >

> > -       cache = (flags & SKB_ALLOC_FCLONE)

> > -               ? skbuff_fclone_cache : skbuff_head_cache;

> > +       clone = !!(flags & SKB_ALLOC_FCLONE);

> 

> The boolean conversion here is probably unnecessary. I would make

> clone an int like flags and work with that. I suspect the compiler is

> doing it already, but it is better to be explicit.

> 

> > +       cache = clone ? skbuff_fclone_cache : skbuff_head_cache;

> >

> >         if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))

> >                 gfp_mask |= __GFP_MEMALLOC;

> >

> >         /* Get the HEAD */

> > -       skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);

> > +       if ((flags & SKB_ALLOC_NAPI) && !clone &&

> 

> Rather than having to do two checks you could just check for

> SKB_ALLOC_NAPI and SKB_ALLOC_FCLONE in a single check. You could just

> do something like:

>     if ((flags & (SKB_ALLOC_FCLONE | SKB_ALLOC_NAPI) == SKB_ALLOC_NAPI)

> 

> That way you can avoid the extra conditional jumps and can start

> computing the flags value sooner.


I thought about combined check for two flags yesterday, so yeah, that
probably should be better than the current version.

> > +           likely(node == NUMA_NO_NODE || node == numa_mem_id()))

> > +               skb = napi_skb_cache_get();

> > +       else

> > +               skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node);

> >         if (unlikely(!skb))

> >                 return NULL;

> >         prefetchw(skb);

> > @@ -436,7 +441,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,

> >         __build_skb_around(skb, data, 0);

> >         skb->pfmemalloc = pfmemalloc;

> >

> > -       if (flags & SKB_ALLOC_FCLONE) {

> > +       if (clone) {

> >                 struct sk_buff_fclones *fclones;

> >

> >                 fclones = container_of(skb, struct sk_buff_fclones, skb1);

> > --

> > 2.30.1


Thanks,
Al