diff mbox series

[v3,net-next,08/10] skbuff: reuse NAPI skb cache on allocation path (__build_skb())

Message ID 20210209204533.327360-9-alobakin@pm.me
State New
Headers show
Series None | expand

Commit Message

Alexander Lobakin Feb. 9, 2021, 8:48 p.m. UTC
Instead of just bulk-flushing skbuff_heads queued up through
napi_consume_skb() or __kfree_skb_defer(), try to reuse them
on allocation path.
If the cache is empty on allocation, bulk-allocate the first
half, which is more efficient than per-skb allocation.
If the cache is full on freeing, bulk-wipe the second half.
This also includes custom KASAN poisoning/unpoisoning to be
double sure there are no use-after-free cases.

Functions that got cache fastpath:
 - {,__}build_skb();
 - {,__}netdev_alloc_skb();
 - {,__}napi_alloc_skb().

Note on "napi_safe" argument:
NAPI cache should be accessed only from BH-disabled or (better)
NAPI context. To make sure access is safe, in_serving_softirq()
check is used.
Hovewer, there are plenty of cases when we know for sure that
we're in such context. This includes: build_skb() (called only
from NIC drivers in NAPI Rx context) and {,__}napi_alloc_skb()
(called from the same place or from kernel network softirq
functions).
We can use that knowledge to avoid unnecessary checks.

Suggested-by: Edward Cree <ecree.xilinx@gmail.com> # Unified cache part
Suggested-by: Eric Dumazet <edumazet@google.com>   # KASAN poisoning
Suggested-by: Dmitry Vyukov <dvyukov@google.com>   # Help with KASAN
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
---
 include/linux/skbuff.h   |  2 +-
 net/core/skbuff.c        | 61 ++++++++++++++++++++++++++++------------
 net/netlink/af_netlink.c |  2 +-
 3 files changed, 45 insertions(+), 20 deletions(-)

Comments

Paolo Abeni Feb. 10, 2021, 10:21 a.m. UTC | #1
Hello,

I'm sorry for the late feedback, I could not step-in before.

Also adding Jesper for awareness, as he introduced the bulk free
infrastructure.

On Tue, 2021-02-09 at 20:48 +0000, Alexander Lobakin wrote:
> @@ -231,7 +256,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size)

>   */

>  struct sk_buff *build_skb(void *data, unsigned int frag_size)

>  {

> -	struct sk_buff *skb = __build_skb(data, frag_size);

> +	struct sk_buff *skb = __build_skb(data, frag_size, true);


I must admit I'm a bit scared of this. There are several high speed
device drivers that will move to bulk allocation, and we don't have any
performance figure for them.

In my experience with (low end) MIPS board, cache misses cost tend to
be much less visible there compared to reasonably recent server H/W,
because the CPU/memory access time difference is much lower.

When moving to higher end H/W the performance gain you measured could
be completely countered by less optimal cache usage.

I fear also latency spikes - I'm unsure if a 32 skbs allocation vs a
single skb would be visible e.g. in a round-robin test. Generally
speaking bulk allocating 32 skbs looks a bit too much. IIRC, when
Edward added listification to GRO, he did several measures with
different list size and found 8 to be the optimal value (for the tested
workload). Above such number the list become too big and the pressure
on the cache outweighted the bulking benefits.

Perhaps giving the device drivers the ability to opt-in on this infra
via a new helper - as done back then with napi_consume_skb() - would
make this change safer?

> @@ -838,31 +863,31 @@ void __consume_stateless_skb(struct sk_buff *skb)

>  	kfree_skbmem(skb);

>  }

>  

> -static inline void _kfree_skb_defer(struct sk_buff *skb)

> +static void napi_skb_cache_put(struct sk_buff *skb)

>  {

>  	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);

> +	u32 i;

>  

>  	/* drop skb->head and call any destructors for packet */

>  	skb_release_all(skb);

>  

> -	/* record skb to CPU local list */

> +	kasan_poison_object_data(skbuff_head_cache, skb);

>  	nc->skb_cache[nc->skb_count++] = skb;

>  

> -#ifdef CONFIG_SLUB

> -	/* SLUB writes into objects when freeing */

> -	prefetchw(skb);

> -#endif


It looks like this chunk has been lost. Is that intentional?

Thanks!

Paolo
Alexander Lobakin Feb. 10, 2021, 12:25 p.m. UTC | #2
From: Paolo Abeni <pabeni@redhat.com>

Date: Wed, 10 Feb 2021 11:21:06 +0100

> Hello,


Hi!
 
> I'm sorry for the late feedback, I could not step-in before.

> 

> Also adding Jesper for awareness, as he introduced the bulk free

> infrastructure.

> 

> On Tue, 2021-02-09 at 20:48 +0000, Alexander Lobakin wrote:

> > @@ -231,7 +256,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size)

> >   */

> >  struct sk_buff *build_skb(void *data, unsigned int frag_size)

> >  {

> > -	struct sk_buff *skb = __build_skb(data, frag_size);

> > +	struct sk_buff *skb = __build_skb(data, frag_size, true);

> 

> I must admit I'm a bit scared of this. There are several high speed

> device drivers that will move to bulk allocation, and we don't have any

> performance figure for them.

> 

> In my experience with (low end) MIPS board, cache misses cost tend to

> be much less visible there compared to reasonably recent server H/W,

> because the CPU/memory access time difference is much lower.

> 

> When moving to higher end H/W the performance gain you measured could

> be completely countered by less optimal cache usage.

> 

> I fear also latency spikes - I'm unsure if a 32 skbs allocation vs a

> single skb would be visible e.g. in a round-robin test. Generally

> speaking bulk allocating 32 skbs looks a bit too much. IIRC, when

> Edward added listification to GRO, he did several measures with

> different list size and found 8 to be the optimal value (for the tested

> workload). Above such number the list become too big and the pressure

> on the cache outweighted the bulking benefits.


I can change to logics the way so it would allocate the first 8.
I think I've already seen this batch value somewhere in XDP code,
so this might be a balanced one.

Regarding bulk-freeing: can the batch size make sense when freeing
or it's okay to wipe 32 (currently 64 in baseline) in a row?

> Perhaps giving the device drivers the ability to opt-in on this infra

> via a new helper - as done back then with napi_consume_skb() - would

> make this change safer?


That's actually a very nice idea. There's only a little in the code
to change to introduce an ability to take heads from the cache
optionally. This way developers could switch to it when needed.

Thanks for the suggestions! I'll definitely absorb them into the code
and give it a test.

> > @@ -838,31 +863,31 @@ void __consume_stateless_skb(struct sk_buff *skb)

> >  	kfree_skbmem(skb);

> >  }

> >

> > -static inline void _kfree_skb_defer(struct sk_buff *skb)

> > +static void napi_skb_cache_put(struct sk_buff *skb)

> >  {

> >  	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);

> > +	u32 i;

> >

> >  	/* drop skb->head and call any destructors for packet */

> >  	skb_release_all(skb);

> >

> > -	/* record skb to CPU local list */

> > +	kasan_poison_object_data(skbuff_head_cache, skb);

> >  	nc->skb_cache[nc->skb_count++] = skb;

> >

> > -#ifdef CONFIG_SLUB

> > -	/* SLUB writes into objects when freeing */

> > -	prefetchw(skb);

> > -#endif

> 

> It looks like this chunk has been lost. Is that intentional?


Yep. This prefetchw() assumed that skbuff_heads will be wiped
immediately or at the end of network softirq. Reusing this cache
means that heads can be reused later or may be kept in a cache for
some time, so prefetching makes no sense anymore.

> Thanks!

> 

> Paolo


Al
Paolo Abeni Feb. 10, 2021, 12:51 p.m. UTC | #3
On Wed, 2021-02-10 at 12:25 +0000, Alexander Lobakin wrote:
> Paolo Abeni <pabeni@redhat.com> on Wed, 10 Feb 2021 11:21:06 +0100 wrote:

> > Perhaps giving the device drivers the ability to opt-in on this infra

> > via a new helper - as done back then with napi_consume_skb() - would

> > make this change safer?

> 

> That's actually a very nice idea. There's only a little in the code

> to change to introduce an ability to take heads from the cache

> optionally. This way developers could switch to it when needed.

> 

> Thanks for the suggestions! I'll definitely absorb them into the code

> and give it a test.


Quick reply before is too late. I suggest to wait a bit for others
opinions before coding - if others dislike this I would regret wasting
your time.

Cheers,

Paolo
Florian Westphal Feb. 10, 2021, 1:15 p.m. UTC | #4
Alexander Lobakin <alobakin@pm.me> wrote:
> we're in such context. This includes: build_skb() (called only

> from NIC drivers in NAPI Rx context) and {,__}napi_alloc_skb()

> (called from the same place or from kernel network softirq

> functions).


build_skb is called from sleepable context in drivers/net/tun.c .
Perhaps there are other cases.
Jesper Dangaard Brouer Feb. 10, 2021, 5:25 p.m. UTC | #5
On Wed, 10 Feb 2021 12:25:04 +0000
Alexander Lobakin <alobakin@pm.me> wrote:

> From: Paolo Abeni <pabeni@redhat.com>

> Date: Wed, 10 Feb 2021 11:21:06 +0100

> 

> > I'm sorry for the late feedback, I could not step-in before.

> > 

> > Also adding Jesper for awareness, as he introduced the bulk free

> > infrastructure.


Thanks (and Alexander Duyck also did part of the work while at Red Hat).

In my initial versions of my patchsets I actually also had reuse of the
SKBs that were defer freed during NAPI context.  But I dropped that
part because it was getting nitpicked and the merge window was getting
close, so I ended up dropping that part.



> > On Tue, 2021-02-09 at 20:48 +0000, Alexander Lobakin wrote:  

> > > @@ -231,7 +256,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size)

> > >   */

> > >  struct sk_buff *build_skb(void *data, unsigned int frag_size)

> > >  {

> > > -	struct sk_buff *skb = __build_skb(data, frag_size);

> > > +	struct sk_buff *skb = __build_skb(data, frag_size, true);  

> > 

> > I must admit I'm a bit scared of this. There are several high speed

> > device drivers that will move to bulk allocation, and we don't have any

> > performance figure for them.

> > 

> > In my experience with (low end) MIPS board, cache misses cost tend to

> > be much less visible there compared to reasonably recent server H/W,

> > because the CPU/memory access time difference is much lower.

> > 

> > When moving to higher end H/W the performance gain you measured could

> > be completely countered by less optimal cache usage.

> > 

> > I fear also latency spikes - I'm unsure if a 32 skbs allocation vs a

> > single skb would be visible e.g. in a round-robin test. Generally

> > speaking bulk allocating 32 skbs looks a bit too much. IIRC, when

> > Edward added listification to GRO, he did several measures with

> > different list size and found 8 to be the optimal value (for the tested

> > workload). Above such number the list become too big and the pressure

> > on the cache outweighted the bulking benefits.  

> 

> I can change to logics the way so it would allocate the first 8.

> I think I've already seen this batch value somewhere in XDP code,

> so this might be a balanced one.


(Speaking about SLUB code): Bulk ALLOC side disables interrupts, and
can call slow path (___slab_alloc), which is bad for latency sensitive
workloads.   This I don't recommend large bulk ALLOCATIONS.

> Regarding bulk-freeing: can the batch size make sense when freeing

> or it's okay to wipe 32 (currently 64 in baseline) in a row?


(Speaking about SLUB code):  You can bulk FREE large amount of object
without hurting latency sensitive workloads, because it doesn't disable
interrupts (I'm quite proud that this was possible).


> > Perhaps giving the device drivers the ability to opt-in on this infra

> > via a new helper - as done back then with napi_consume_skb() - would

> > make this change safer?  

> 

> That's actually a very nice idea. There's only a little in the code

> to change to introduce an ability to take heads from the cache

> optionally. This way developers could switch to it when needed.


Well, I actually disagree that this should be hidden behind a switch
for drivers to enable, as this will take forever to get proper enabled.



> Thanks for the suggestions! I'll definitely absorb them into the code

> and give it a test.

> 

> > > @@ -838,31 +863,31 @@ void __consume_stateless_skb(struct sk_buff *skb)

> > >  	kfree_skbmem(skb);

> > >  }

> > >

> > > -static inline void _kfree_skb_defer(struct sk_buff *skb)

> > > +static void napi_skb_cache_put(struct sk_buff *skb)

> > >  {

> > >  	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);

> > > +	u32 i;

> > >

> > >  	/* drop skb->head and call any destructors for packet */

> > >  	skb_release_all(skb);

> > >

> > > -	/* record skb to CPU local list */

> > > +	kasan_poison_object_data(skbuff_head_cache, skb);

> > >  	nc->skb_cache[nc->skb_count++] = skb;

> > >

> > > -#ifdef CONFIG_SLUB

> > > -	/* SLUB writes into objects when freeing */

> > > -	prefetchw(skb);

> > > -#endif  

> > 

> > It looks like this chunk has been lost. Is that intentional?  

> 

> Yep. This prefetchw() assumed that skbuff_heads will be wiped

> immediately or at the end of network softirq. Reusing this cache

> means that heads can be reused later or may be kept in a cache for

> some time, so prefetching makes no sense anymore.


I agree with this statement, the prefetchw() is no-longer needed.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0e0707296098..5bb443d37bf4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1082,7 +1082,7 @@  bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int flags,
 			    int node);
-struct sk_buff *__build_skb(void *data, unsigned int frag_size);
+struct sk_buff *__build_skb(void *data, unsigned int frag_size, bool napi_safe);
 struct sk_buff *build_skb(void *data, unsigned int frag_size);
 struct sk_buff *build_skb_around(struct sk_buff *skb,
 				 void *data, unsigned int frag_size);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 860a9d4f752f..8747566a8136 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -120,6 +120,7 @@  static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
 }
 
 #define NAPI_SKB_CACHE_SIZE	64
+#define NAPI_SKB_CACHE_HALF	(NAPI_SKB_CACHE_SIZE / 2)
 
 struct napi_alloc_cache {
 	struct page_frag_cache page;
@@ -164,6 +165,30 @@  void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 }
 EXPORT_SYMBOL(__netdev_alloc_frag_align);
 
+static struct sk_buff *napi_skb_cache_get(bool napi_safe)
+{
+	struct napi_alloc_cache *nc;
+	struct sk_buff *skb;
+
+	if (!napi_safe && unlikely(!in_serving_softirq()))
+		return kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+
+	nc = this_cpu_ptr(&napi_alloc_cache);
+
+	if (unlikely(!nc->skb_count))
+		nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache,
+						      GFP_ATOMIC,
+						      NAPI_SKB_CACHE_HALF,
+						      nc->skb_cache);
+	if (unlikely(!nc->skb_count))
+		return NULL;
+
+	skb = nc->skb_cache[--nc->skb_count];
+	kasan_unpoison_object_data(skbuff_head_cache, skb);
+
+	return skb;
+}
+
 /* Caller must provide SKB that is memset cleared */
 static void __build_skb_around(struct sk_buff *skb, void *data,
 			       unsigned int frag_size)
@@ -210,11 +235,11 @@  static void __build_skb_around(struct sk_buff *skb, void *data,
  *  before giving packet to stack.
  *  RX rings only contains data buffers, not full skbs.
  */
-struct sk_buff *__build_skb(void *data, unsigned int frag_size)
+struct sk_buff *__build_skb(void *data, unsigned int frag_size, bool napi_safe)
 {
 	struct sk_buff *skb;
 
-	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+	skb = napi_skb_cache_get(napi_safe);
 	if (unlikely(!skb))
 		return NULL;
 
@@ -231,7 +256,7 @@  struct sk_buff *__build_skb(void *data, unsigned int frag_size)
  */
 struct sk_buff *build_skb(void *data, unsigned int frag_size)
 {
-	struct sk_buff *skb = __build_skb(data, frag_size);
+	struct sk_buff *skb = __build_skb(data, frag_size, true);
 
 	if (skb && frag_size) {
 		skb->head_frag = 1;
@@ -443,7 +468,7 @@  struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 	if (unlikely(!data))
 		return NULL;
 
-	skb = __build_skb(data, len);
+	skb = __build_skb(data, len, false);
 	if (unlikely(!skb)) {
 		skb_free_frag(data);
 		return NULL;
@@ -507,7 +532,7 @@  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (unlikely(!data))
 		return NULL;
 
-	skb = __build_skb(data, len);
+	skb = __build_skb(data, len, true);
 	if (unlikely(!skb)) {
 		skb_free_frag(data);
 		return NULL;
@@ -838,31 +863,31 @@  void __consume_stateless_skb(struct sk_buff *skb)
 	kfree_skbmem(skb);
 }
 
-static inline void _kfree_skb_defer(struct sk_buff *skb)
+static void napi_skb_cache_put(struct sk_buff *skb)
 {
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	u32 i;
 
 	/* drop skb->head and call any destructors for packet */
 	skb_release_all(skb);
 
-	/* record skb to CPU local list */
+	kasan_poison_object_data(skbuff_head_cache, skb);
 	nc->skb_cache[nc->skb_count++] = skb;
 
-#ifdef CONFIG_SLUB
-	/* SLUB writes into objects when freeing */
-	prefetchw(skb);
-#endif
-
-	/* flush skb_cache if it is filled */
 	if (unlikely(nc->skb_count == NAPI_SKB_CACHE_SIZE)) {
-		kmem_cache_free_bulk(skbuff_head_cache, NAPI_SKB_CACHE_SIZE,
-				     nc->skb_cache);
-		nc->skb_count = 0;
+		for (i = NAPI_SKB_CACHE_HALF; i < NAPI_SKB_CACHE_SIZE; i++)
+			kasan_unpoison_object_data(skbuff_head_cache,
+						   nc->skb_cache[i]);
+
+		kmem_cache_free_bulk(skbuff_head_cache, NAPI_SKB_CACHE_HALF,
+				     nc->skb_cache + NAPI_SKB_CACHE_HALF);
+		nc->skb_count = NAPI_SKB_CACHE_HALF;
 	}
 }
+
 void __kfree_skb_defer(struct sk_buff *skb)
 {
-	_kfree_skb_defer(skb);
+	napi_skb_cache_put(skb);
 }
 
 void napi_consume_skb(struct sk_buff *skb, int budget)
@@ -887,7 +912,7 @@  void napi_consume_skb(struct sk_buff *skb, int budget)
 		return;
 	}
 
-	_kfree_skb_defer(skb);
+	napi_skb_cache_put(skb);
 }
 EXPORT_SYMBOL(napi_consume_skb);
 
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index dd488938447f..afba4e11a526 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1190,7 +1190,7 @@  static struct sk_buff *netlink_alloc_large_skb(unsigned int size,
 	if (data == NULL)
 		return NULL;
 
-	skb = __build_skb(data, size);
+	skb = __build_skb(data, size, false);
 	if (skb == NULL)
 		vfree(data);
 	else