diff mbox series

[net-next] tcp: add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb()

Message ID 1630492744-60396-1-git-send-email-linyunsheng@huawei.com
State New
Headers show
Series [net-next] tcp: add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb() | expand

Commit Message

Yunsheng Lin Sept. 1, 2021, 10:39 a.m. UTC
Since tcp_tx_skb_cache is disabled by default in:
commit 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl")

Add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb() to
avoid possible branch-misses.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
Also, the sk->sk_tx_skb_cache may be both changed by allocation
and freeing side, I assume there may be some implicit protection
here too, such as the NAPI protection for rx?
---
 net/ipv4/tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric Dumazet Sept. 1, 2021, 3:16 p.m. UTC | #1
On Wed, Sep 1, 2021 at 8:06 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Sep 1, 2021 at 3:52 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On Wed, 2021-09-01 at 18:39 +0800, Yunsheng Lin wrote:
> > > Since tcp_tx_skb_cache is disabled by default in:
> > > commit 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl")
> > >
> > > Add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb() to
> > > avoid possible branch-misses.
> > >
> > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> >
> > Note that MPTCP is currently exploiting sk->sk_tx_skb_cache. If we get
> > this patch goes in as-is, it will break mptcp.
> >
> > One possible solution would be to let mptcp usage enable sk-
> > >sk_tx_skb_cache, but that has relevant side effects on plain TCP.
> >
> > Another options would be re-work once again the mptcp xmit path to
> > avoid using sk->sk_tx_skb_cache.
> >
>
> Hmmm, I actually wrote a revert of this feature but forgot to submit
> it last year.
>
> commit c36cfbd791f62c0f7c6b32132af59dfdbe6be21b (HEAD -> listener_scale4)
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Wed May 20 06:38:38 2020 -0700
>
>     tcp: remove sk_{tr}x_skb_cache
>
>     This reverts the following patches :
>
>     2e05fcae83c41eb2df10558338dc600dc783af47 ("tcp: fix compile error
> if !CONFIG_SYSCTL")
>     4f661542a40217713f2cee0bb6678fbb30d9d367 ("tcp: fix zerocopy and
> notsent_lowat issues")
>     472c2e07eef045145bc1493cc94a01c87140780a ("tcp: add one skb cache for tx")
>     8b27dae5a2e89a61c46c6dbc76c040c0e6d0ed4c ("tcp: add one skb cache for rx")
>
>     Having a cache of one skb (in each direction) per TCP socket is fragile,
>     since it can cause a significant increase of memory needs,
>     and not good enough for high speed flows anyway where more than one skb
>     is needed.
>
>     We want instead to add a generic infrastructure, with more flexible per-cpu
>     caches, for alien NUMA nodes.
>
>     Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> I will update this commit to also remove the part in MPTCP.
>
> Let's remove this feature and replace it with something less costly.

Paolo, can you work on MPTP side, so that my revert can be then applied ?

Thanks !
From c36cfbd791f62c0f7c6b32132af59dfdbe6be21b Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Wed, 20 May 2020 06:38:38 -0700
Subject: [PATCH net-next] tcp: remove sk_{tr}x_skb_cache

This reverts the following patches :

2e05fcae83c41eb2df10558338dc600dc783af47 ("tcp: fix compile error if !CONFIG_SYSCTL")
4f661542a40217713f2cee0bb6678fbb30d9d367 ("tcp: fix zerocopy and notsent_lowat issues")
472c2e07eef045145bc1493cc94a01c87140780a ("tcp: add one skb cache for tx")
8b27dae5a2e89a61c46c6dbc76c040c0e6d0ed4c ("tcp: add one skb cache for rx")

Having a cache of one skb (in each direction) per TCP socket is fragile,
since it can cause a significant increase of memory needs,
and not good enough for high speed flows anyway where more than one skb
is needed.

We want instead to add a generic infrastructure, with more flexible per-cpu
caches, for alien NUMA nodes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 Documentation/networking/ip-sysctl.rst |  8 --------
 include/net/sock.h                     | 19 -------------------
 net/ipv4/af_inet.c                     |  4 ----
 net/ipv4/sysctl_net_ipv4.c             | 12 ------------
 net/ipv4/tcp.c                         | 26 --------------------------
 net/ipv4/tcp_ipv4.c                    |  6 ------
 net/ipv6/tcp_ipv6.c                    |  6 ------
 7 files changed, 81 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index d91ab28718d493bf5196f2b08cb59ec2c138e8e4..16b8bf72feaf42753121c368530a4f2e864a658f 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -989,14 +989,6 @@ tcp_challenge_ack_limit - INTEGER
 	in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks)
 	Default: 1000
 
-tcp_rx_skb_cache - BOOLEAN
-	Controls a per TCP socket cache of one skb, that might help
-	performance of some workloads. This might be dangerous
-	on systems with a lot of TCP sockets, since it increases
-	memory usage.
-
-	Default: 0 (disabled)
-
 UDP variables
 =============
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 66a9a90f9558e4c739e01192b3f8cee9fa271ae9..708b9de3cdbb116eb2d1e63ecbc8672d556f0951 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -262,7 +262,6 @@ struct bpf_local_storage;
   *	@sk_dst_cache: destination cache
   *	@sk_dst_pending_confirm: need to confirm neighbour
   *	@sk_policy: flow policy
-  *	@sk_rx_skb_cache: cache copy of recently accessed RX skb
   *	@sk_receive_queue: incoming packets
   *	@sk_wmem_alloc: transmit queue bytes committed
   *	@sk_tsq_flags: TCP Small Queues flags
@@ -328,7 +327,6 @@ struct bpf_local_storage;
   *	@sk_peek_off: current peek_offset value
   *	@sk_send_head: front of stuff to transmit
   *	@tcp_rtx_queue: TCP re-transmit queue [union with @sk_send_head]
-  *	@sk_tx_skb_cache: cache copy of recently accessed TX skb
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
   *	@sk_cgrp_data: cgroup data for this cgroup
@@ -393,7 +391,6 @@ struct sock {
 	atomic_t		sk_drops;
 	int			sk_rcvlowat;
 	struct sk_buff_head	sk_error_queue;
-	struct sk_buff		*sk_rx_skb_cache;
 	struct sk_buff_head	sk_receive_queue;
 	/*
 	 * The backlog queue is special, it is always used with
@@ -442,7 +439,6 @@ struct sock {
 		struct sk_buff	*sk_send_head;
 		struct rb_root	tcp_rtx_queue;
 	};
-	struct sk_buff		*sk_tx_skb_cache;
 	struct sk_buff_head	sk_write_queue;
 	__s32			sk_peek_off;
 	int			sk_write_pending;
@@ -1555,18 +1551,10 @@ static inline void sk_mem_uncharge(struct sock *sk, int size)
 		__sk_mem_reclaim(sk, 1 << 20);
 }
 
-DECLARE_STATIC_KEY_FALSE(tcp_tx_skb_cache_key);
 static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
 {
 	sk_wmem_queued_add(sk, -skb->truesize);
 	sk_mem_uncharge(sk, skb->truesize);
-	if (static_branch_unlikely(&tcp_tx_skb_cache_key) &&
-	    !sk->sk_tx_skb_cache && !skb_cloned(skb)) {
-		skb_ext_reset(skb);
-		skb_zcopy_clear(skb, true);
-		sk->sk_tx_skb_cache = skb;
-		return;
-	}
 	__kfree_skb(skb);
 }
 
@@ -2575,7 +2563,6 @@ static inline void skb_setup_tx_timestamp(struct sk_buff *skb, __u16 tsflags)
 			   &skb_shinfo(skb)->tskey);
 }
 
-DECLARE_STATIC_KEY_FALSE(tcp_rx_skb_cache_key);
 /**
  * sk_eat_skb - Release a skb if it is no longer needed
  * @sk: socket to eat this skb from
@@ -2587,12 +2574,6 @@ DECLARE_STATIC_KEY_FALSE(tcp_rx_skb_cache_key);
 static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb)
 {
 	__skb_unlink(skb, &sk->sk_receive_queue);
-	if (static_branch_unlikely(&tcp_rx_skb_cache_key) &&
-	    !sk->sk_rx_skb_cache) {
-		sk->sk_rx_skb_cache = skb;
-		skb_orphan(skb);
-		return;
-	}
 	__kfree_skb(skb);
 }
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1d816a5fd3eb914e0e3a071a7ae159d7311073f8..40558033f857c0ca7d98b778f70487e194f3d066 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -133,10 +133,6 @@ void inet_sock_destruct(struct sock *sk)
 	struct inet_sock *inet = inet_sk(sk);
 
 	__skb_queue_purge(&sk->sk_receive_queue);
-	if (sk->sk_rx_skb_cache) {
-		__kfree_skb(sk->sk_rx_skb_cache);
-		sk->sk_rx_skb_cache = NULL;
-	}
 	__skb_queue_purge(&sk->sk_error_queue);
 
 	sk_mem_reclaim(sk);
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 6f1e64d492328777bf8f167b56c01bc96b182e98..6eb43dc91218cea14134b66dfd2232ff2659babb 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -594,18 +594,6 @@ static struct ctl_table ipv4_table[] = {
 		.extra1		= &sysctl_fib_sync_mem_min,
 		.extra2		= &sysctl_fib_sync_mem_max,
 	},
-	{
-		.procname	= "tcp_rx_skb_cache",
-		.data		= &tcp_rx_skb_cache_key.key,
-		.mode		= 0644,
-		.proc_handler	= proc_do_static_key,
-	},
-	{
-		.procname	= "tcp_tx_skb_cache",
-		.data		= &tcp_tx_skb_cache_key.key,
-		.mode		= 0644,
-		.proc_handler	= proc_do_static_key,
-	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e8b48df73c852a48e51754ea98b1e08bf024bb9e..4a37500f4bd4422bda2609de520b83056c2cb01a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -325,11 +325,6 @@ struct tcp_splice_state {
 unsigned long tcp_memory_pressure __read_mostly;
 EXPORT_SYMBOL_GPL(tcp_memory_pressure);
 
-DEFINE_STATIC_KEY_FALSE(tcp_rx_skb_cache_key);
-EXPORT_SYMBOL(tcp_rx_skb_cache_key);
-
-DEFINE_STATIC_KEY_FALSE(tcp_tx_skb_cache_key);
-
 void tcp_enter_memory_pressure(struct sock *sk)
 {
 	unsigned long val;
@@ -866,18 +861,6 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
 {
 	struct sk_buff *skb;
 
-	if (likely(!size)) {
-		skb = sk->sk_tx_skb_cache;
-		if (skb) {
-			skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
-			sk->sk_tx_skb_cache = NULL;
-			pskb_trim(skb, 0);
-			INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);
-			skb_shinfo(skb)->tx_flags = 0;
-			memset(TCP_SKB_CB(skb), 0, sizeof(struct tcp_skb_cb));
-			return skb;
-		}
-	}
 	/* The TCP header must be at least 32-bit aligned.  */
 	size = ALIGN(size, 4);
 
@@ -2920,11 +2903,6 @@ void tcp_write_queue_purge(struct sock *sk)
 		sk_wmem_free_skb(sk, skb);
 	}
 	tcp_rtx_queue_purge(sk);
-	skb = sk->sk_tx_skb_cache;
-	if (skb) {
-		__kfree_skb(skb);
-		sk->sk_tx_skb_cache = NULL;
-	}
 	INIT_LIST_HEAD(&tcp_sk(sk)->tsorted_sent_queue);
 	sk_mem_reclaim(sk);
 	tcp_clear_all_retrans_hints(tcp_sk(sk));
@@ -2961,10 +2939,6 @@ int tcp_disconnect(struct sock *sk, int flags)
 
 	tcp_clear_xmit_timers(sk);
 	__skb_queue_purge(&sk->sk_receive_queue);
-	if (sk->sk_rx_skb_cache) {
-		__kfree_skb(sk->sk_rx_skb_cache);
-		sk->sk_rx_skb_cache = NULL;
-	}
 	WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
 	tp->urg_data = 0;
 	tcp_write_queue_purge(sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 2e62e0d6373a6ee5d98756fb1967bff9743ffa02..29a57bd159f0aa99e892bac56b75961c107f803a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1941,7 +1941,6 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
 int tcp_v4_rcv(struct sk_buff *skb)
 {
 	struct net *net = dev_net(skb->dev);
-	struct sk_buff *skb_to_free;
 	int sdif = inet_sdif(skb);
 	int dif = inet_iif(skb);
 	const struct iphdr *iph;
@@ -2082,17 +2081,12 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	tcp_segs_in(tcp_sk(sk), skb);
 	ret = 0;
 	if (!sock_owned_by_user(sk)) {
-		skb_to_free = sk->sk_rx_skb_cache;
-		sk->sk_rx_skb_cache = NULL;
 		ret = tcp_v4_do_rcv(sk, skb);
 	} else {
 		if (tcp_add_backlog(sk, skb))
 			goto discard_and_relse;
-		skb_to_free = NULL;
 	}
 	bh_unlock_sock(sk);
-	if (skb_to_free)
-		__kfree_skb(skb_to_free);
 
 put_and_return:
 	if (refcounted)
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0ce52d46e4f81b221a6acd4a0dd7b0d462dbac7a..8cf5ff2e95043ec1a2b27661aae884eb13dcf9eb 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1618,7 +1618,6 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
 
 INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 {
-	struct sk_buff *skb_to_free;
 	int sdif = inet6_sdif(skb);
 	int dif = inet6_iif(skb);
 	const struct tcphdr *th;
@@ -1754,17 +1753,12 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 	tcp_segs_in(tcp_sk(sk), skb);
 	ret = 0;
 	if (!sock_owned_by_user(sk)) {
-		skb_to_free = sk->sk_rx_skb_cache;
-		sk->sk_rx_skb_cache = NULL;
 		ret = tcp_v6_do_rcv(sk, skb);
 	} else {
 		if (tcp_add_backlog(sk, skb))
 			goto discard_and_relse;
-		skb_to_free = NULL;
 	}
 	bh_unlock_sock(sk);
-	if (skb_to_free)
-		__kfree_skb(skb_to_free);
 put_and_return:
 	if (refcounted)
 		sock_put(sk);
Paolo Abeni Sept. 1, 2021, 3:25 p.m. UTC | #2
On Wed, 2021-09-01 at 08:16 -0700, Eric Dumazet wrote:
> On Wed, Sep 1, 2021 at 8:06 AM Eric Dumazet <edumazet@google.com> wrote:
> > On Wed, Sep 1, 2021 at 3:52 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > On Wed, 2021-09-01 at 18:39 +0800, Yunsheng Lin wrote:
> > > > Since tcp_tx_skb_cache is disabled by default in:
> > > > commit 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl")
> > > > 
> > > > Add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb() to
> > > > avoid possible branch-misses.
> > > > 
> > > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> > > 
> > > Note that MPTCP is currently exploiting sk->sk_tx_skb_cache. If we get
> > > this patch goes in as-is, it will break mptcp.
> > > 
> > > One possible solution would be to let mptcp usage enable sk-
> > > > sk_tx_skb_cache, but that has relevant side effects on plain TCP.
> > > 
> > > Another options would be re-work once again the mptcp xmit path to
> > > avoid using sk->sk_tx_skb_cache.
> > > 
> > 
> > Hmmm, I actually wrote a revert of this feature but forgot to submit
> > it last year.
> > 
> > commit c36cfbd791f62c0f7c6b32132af59dfdbe6be21b (HEAD -> listener_scale4)
> > Author: Eric Dumazet <edumazet@google.com>
> > Date:   Wed May 20 06:38:38 2020 -0700
> > 
> >     tcp: remove sk_{tr}x_skb_cache
> > 
> >     This reverts the following patches :
> > 
> >     2e05fcae83c41eb2df10558338dc600dc783af47 ("tcp: fix compile error
> > if !CONFIG_SYSCTL")
> >     4f661542a40217713f2cee0bb6678fbb30d9d367 ("tcp: fix zerocopy and
> > notsent_lowat issues")
> >     472c2e07eef045145bc1493cc94a01c87140780a ("tcp: add one skb cache for tx")
> >     8b27dae5a2e89a61c46c6dbc76c040c0e6d0ed4c ("tcp: add one skb cache for rx")
> > 
> >     Having a cache of one skb (in each direction) per TCP socket is fragile,
> >     since it can cause a significant increase of memory needs,
> >     and not good enough for high speed flows anyway where more than one skb
> >     is needed.
> > 
> >     We want instead to add a generic infrastructure, with more flexible per-cpu
> >     caches, for alien NUMA nodes.
> > 
> >     Signed-off-by: Eric Dumazet <edumazet@google.com>
> > 
> > I will update this commit to also remove the part in MPTCP.
> > 
> > Let's remove this feature and replace it with something less costly.
> 
> Paolo, can you work on MPTP side, so that my revert can be then applied ?

You are way too fast, I was still replying to your previous email,
asking if I could help :)

I'll a look ASAP. Please, allow for some latency: I'm way slower!

Cheers,

Paolo
Eric Dumazet Sept. 1, 2021, 4:02 p.m. UTC | #3
On Wed, Sep 1, 2021 at 9:01 AM Paolo Abeni <pabeni@redhat.com> wrote:
>

>
> I think the easiest way and the one with less code duplication will
> require accessing the tcp_mark_push() and skb_entail() helpers from the
> MPTCP code, making them not static and exposing them e.g. in net/tcp.h.
> Would that be acceptable or should I look for other options?
>

I think this is fine, really.
Yunsheng Lin Sept. 2, 2021, 12:47 a.m. UTC | #4
On 2021/9/1 18:39, Yunsheng Lin wrote:
> Since tcp_tx_skb_cache is disabled by default in:

> commit 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl")

> 

> Add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb() to

> avoid possible branch-misses.

> 

> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>

> ---

> Also, the sk->sk_tx_skb_cache may be both changed by allocation

> and freeing side, I assume there may be some implicit protection

> here too, such as the NAPI protection for rx?


Hi, Eric
   Is there any implicit protection for sk->sk_tx_skb_cache?
As my understanding, sk_stream_alloc_skb() seems to be protected
by lock_sock(), and the sk_wmem_free_skb() seems to be mostly
happening in NAPI polling for TCP(when ack packet is received)
without lock_sock(), so it seems there is no protection here?

>
Eric Dumazet Sept. 2, 2021, 1:13 a.m. UTC | #5
On Wed, Sep 1, 2021 at 5:47 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>

> On 2021/9/1 18:39, Yunsheng Lin wrote:

> > Since tcp_tx_skb_cache is disabled by default in:

> > commit 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl")

> >

> > Add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb() to

> > avoid possible branch-misses.

> >

> > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>

> > ---

> > Also, the sk->sk_tx_skb_cache may be both changed by allocation

> > and freeing side, I assume there may be some implicit protection

> > here too, such as the NAPI protection for rx?

>

> Hi, Eric

>    Is there any implicit protection for sk->sk_tx_skb_cache?

> As my understanding, sk_stream_alloc_skb() seems to be protected

> by lock_sock(), and the sk_wmem_free_skb() seems to be mostly

> happening in NAPI polling for TCP(when ack packet is received)

> without lock_sock(), so it seems there is no protection here?

>


Please look again.
This is protected by socket lock of course.
Otherwise sk_mem_uncharge() would be very broken, sk->sk_forward_alloc
is not an atomic field.

TCP stack has no direct relation  with NAPI.
It can run over loopback interface, no NAPI there.
Yunsheng Lin Sept. 2, 2021, 2:05 a.m. UTC | #6
On 2021/9/2 9:13, Eric Dumazet wrote:
> On Wed, Sep 1, 2021 at 5:47 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:

>>

>> On 2021/9/1 18:39, Yunsheng Lin wrote:

>>> Since tcp_tx_skb_cache is disabled by default in:

>>> commit 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl")

>>>

>>> Add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb() to

>>> avoid possible branch-misses.

>>>

>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>

>>> ---

>>> Also, the sk->sk_tx_skb_cache may be both changed by allocation

>>> and freeing side, I assume there may be some implicit protection

>>> here too, such as the NAPI protection for rx?

>>

>> Hi, Eric

>>    Is there any implicit protection for sk->sk_tx_skb_cache?

>> As my understanding, sk_stream_alloc_skb() seems to be protected

>> by lock_sock(), and the sk_wmem_free_skb() seems to be mostly

>> happening in NAPI polling for TCP(when ack packet is received)

>> without lock_sock(), so it seems there is no protection here?

>>

> 

> Please look again.

> This is protected by socket lock of course.

> Otherwise sk_mem_uncharge() would be very broken, sk->sk_forward_alloc

> is not an atomic field.


Thanks for clarifying.
I have been looking for a point to implement the socket'pp_alloc_cache for
page pool, and sk_wmem_free_skb() seems like the place to avoid the
scalablity problem of ptr_ring in page pool.

The protection for sk_wmem_free_skb() is in tcp_v4_rcv(), right?
https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_ipv4.c#L2081

> 

> TCP stack has no direct relation  with NAPI.

> It can run over loopback interface, no NAPI there.

> .

>
diff mbox series

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e8b48df..226ddef 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -866,7 +866,7 @@  struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
 {
 	struct sk_buff *skb;
 
-	if (likely(!size)) {
+	if (static_branch_unlikely(&tcp_tx_skb_cache_key) && likely(!size)) {
 		skb = sk->sk_tx_skb_cache;
 		if (skb) {
 			skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));