Message ID | 1630492744-60396-1-git-send-email-linyunsheng@huawei.com |
---|---|
State | New |
Headers | show |
Series | [net-next] tcp: add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb() | expand |
On Wed, Sep 1, 2021 at 8:06 AM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Sep 1, 2021 at 3:52 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On Wed, 2021-09-01 at 18:39 +0800, Yunsheng Lin wrote: > > > Since tcp_tx_skb_cache is disabled by default in: > > > commit 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl") > > > > > > Add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb() to > > > avoid possible branch-misses. > > > > > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > > > > Note that MPTCP is currently exploiting sk->sk_tx_skb_cache. If we get > > this patch goes in as-is, it will break mptcp. > > > > One possible solution would be to let mptcp usage enable sk- > > >sk_tx_skb_cache, but that has relevant side effects on plain TCP. > > > > Another options would be re-work once again the mptcp xmit path to > > avoid using sk->sk_tx_skb_cache. > > > > Hmmm, I actually wrote a revert of this feature but forgot to submit > it last year. > > commit c36cfbd791f62c0f7c6b32132af59dfdbe6be21b (HEAD -> listener_scale4) > Author: Eric Dumazet <edumazet@google.com> > Date: Wed May 20 06:38:38 2020 -0700 > > tcp: remove sk_{tr}x_skb_cache > > This reverts the following patches : > > 2e05fcae83c41eb2df10558338dc600dc783af47 ("tcp: fix compile error > if !CONFIG_SYSCTL") > 4f661542a40217713f2cee0bb6678fbb30d9d367 ("tcp: fix zerocopy and > notsent_lowat issues") > 472c2e07eef045145bc1493cc94a01c87140780a ("tcp: add one skb cache for tx") > 8b27dae5a2e89a61c46c6dbc76c040c0e6d0ed4c ("tcp: add one skb cache for rx") > > Having a cache of one skb (in each direction) per TCP socket is fragile, > since it can cause a significant increase of memory needs, > and not good enough for high speed flows anyway where more than one skb > is needed. > > We want instead to add a generic infrastructure, with more flexible per-cpu > caches, for alien NUMA nodes. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > I will update this commit to also remove the part in MPTCP. > > Let's remove this feature and replace it with something less costly. Paolo, can you work on MPTP side, so that my revert can be then applied ? Thanks ! From c36cfbd791f62c0f7c6b32132af59dfdbe6be21b Mon Sep 17 00:00:00 2001 From: Eric Dumazet <edumazet@google.com> Date: Wed, 20 May 2020 06:38:38 -0700 Subject: [PATCH net-next] tcp: remove sk_{tr}x_skb_cache This reverts the following patches : 2e05fcae83c41eb2df10558338dc600dc783af47 ("tcp: fix compile error if !CONFIG_SYSCTL") 4f661542a40217713f2cee0bb6678fbb30d9d367 ("tcp: fix zerocopy and notsent_lowat issues") 472c2e07eef045145bc1493cc94a01c87140780a ("tcp: add one skb cache for tx") 8b27dae5a2e89a61c46c6dbc76c040c0e6d0ed4c ("tcp: add one skb cache for rx") Having a cache of one skb (in each direction) per TCP socket is fragile, since it can cause a significant increase of memory needs, and not good enough for high speed flows anyway where more than one skb is needed. We want instead to add a generic infrastructure, with more flexible per-cpu caches, for alien NUMA nodes. Signed-off-by: Eric Dumazet <edumazet@google.com> --- Documentation/networking/ip-sysctl.rst | 8 -------- include/net/sock.h | 19 ------------------- net/ipv4/af_inet.c | 4 ---- net/ipv4/sysctl_net_ipv4.c | 12 ------------ net/ipv4/tcp.c | 26 -------------------------- net/ipv4/tcp_ipv4.c | 6 ------ net/ipv6/tcp_ipv6.c | 6 ------ 7 files changed, 81 deletions(-) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index d91ab28718d493bf5196f2b08cb59ec2c138e8e4..16b8bf72feaf42753121c368530a4f2e864a658f 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -989,14 +989,6 @@ tcp_challenge_ack_limit - INTEGER in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) Default: 1000 -tcp_rx_skb_cache - BOOLEAN - Controls a per TCP socket cache of one skb, that might help - performance of some workloads. This might be dangerous - on systems with a lot of TCP sockets, since it increases - memory usage. - - Default: 0 (disabled) - UDP variables ============= diff --git a/include/net/sock.h b/include/net/sock.h index 66a9a90f9558e4c739e01192b3f8cee9fa271ae9..708b9de3cdbb116eb2d1e63ecbc8672d556f0951 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -262,7 +262,6 @@ struct bpf_local_storage; * @sk_dst_cache: destination cache * @sk_dst_pending_confirm: need to confirm neighbour * @sk_policy: flow policy - * @sk_rx_skb_cache: cache copy of recently accessed RX skb * @sk_receive_queue: incoming packets * @sk_wmem_alloc: transmit queue bytes committed * @sk_tsq_flags: TCP Small Queues flags @@ -328,7 +327,6 @@ struct bpf_local_storage; * @sk_peek_off: current peek_offset value * @sk_send_head: front of stuff to transmit * @tcp_rtx_queue: TCP re-transmit queue [union with @sk_send_head] - * @sk_tx_skb_cache: cache copy of recently accessed TX skb * @sk_security: used by security modules * @sk_mark: generic packet mark * @sk_cgrp_data: cgroup data for this cgroup @@ -393,7 +391,6 @@ struct sock { atomic_t sk_drops; int sk_rcvlowat; struct sk_buff_head sk_error_queue; - struct sk_buff *sk_rx_skb_cache; struct sk_buff_head sk_receive_queue; /* * The backlog queue is special, it is always used with @@ -442,7 +439,6 @@ struct sock { struct sk_buff *sk_send_head; struct rb_root tcp_rtx_queue; }; - struct sk_buff *sk_tx_skb_cache; struct sk_buff_head sk_write_queue; __s32 sk_peek_off; int sk_write_pending; @@ -1555,18 +1551,10 @@ static inline void sk_mem_uncharge(struct sock *sk, int size) __sk_mem_reclaim(sk, 1 << 20); } -DECLARE_STATIC_KEY_FALSE(tcp_tx_skb_cache_key); static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb) { sk_wmem_queued_add(sk, -skb->truesize); sk_mem_uncharge(sk, skb->truesize); - if (static_branch_unlikely(&tcp_tx_skb_cache_key) && - !sk->sk_tx_skb_cache && !skb_cloned(skb)) { - skb_ext_reset(skb); - skb_zcopy_clear(skb, true); - sk->sk_tx_skb_cache = skb; - return; - } __kfree_skb(skb); } @@ -2575,7 +2563,6 @@ static inline void skb_setup_tx_timestamp(struct sk_buff *skb, __u16 tsflags) &skb_shinfo(skb)->tskey); } -DECLARE_STATIC_KEY_FALSE(tcp_rx_skb_cache_key); /** * sk_eat_skb - Release a skb if it is no longer needed * @sk: socket to eat this skb from @@ -2587,12 +2574,6 @@ DECLARE_STATIC_KEY_FALSE(tcp_rx_skb_cache_key); static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb) { __skb_unlink(skb, &sk->sk_receive_queue); - if (static_branch_unlikely(&tcp_rx_skb_cache_key) && - !sk->sk_rx_skb_cache) { - sk->sk_rx_skb_cache = skb; - skb_orphan(skb); - return; - } __kfree_skb(skb); } diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 1d816a5fd3eb914e0e3a071a7ae159d7311073f8..40558033f857c0ca7d98b778f70487e194f3d066 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -133,10 +133,6 @@ void inet_sock_destruct(struct sock *sk) struct inet_sock *inet = inet_sk(sk); __skb_queue_purge(&sk->sk_receive_queue); - if (sk->sk_rx_skb_cache) { - __kfree_skb(sk->sk_rx_skb_cache); - sk->sk_rx_skb_cache = NULL; - } __skb_queue_purge(&sk->sk_error_queue); sk_mem_reclaim(sk); diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 6f1e64d492328777bf8f167b56c01bc96b182e98..6eb43dc91218cea14134b66dfd2232ff2659babb 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -594,18 +594,6 @@ static struct ctl_table ipv4_table[] = { .extra1 = &sysctl_fib_sync_mem_min, .extra2 = &sysctl_fib_sync_mem_max, }, - { - .procname = "tcp_rx_skb_cache", - .data = &tcp_rx_skb_cache_key.key, - .mode = 0644, - .proc_handler = proc_do_static_key, - }, - { - .procname = "tcp_tx_skb_cache", - .data = &tcp_tx_skb_cache_key.key, - .mode = 0644, - .proc_handler = proc_do_static_key, - }, { } }; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e8b48df73c852a48e51754ea98b1e08bf024bb9e..4a37500f4bd4422bda2609de520b83056c2cb01a 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -325,11 +325,6 @@ struct tcp_splice_state { unsigned long tcp_memory_pressure __read_mostly; EXPORT_SYMBOL_GPL(tcp_memory_pressure); -DEFINE_STATIC_KEY_FALSE(tcp_rx_skb_cache_key); -EXPORT_SYMBOL(tcp_rx_skb_cache_key); - -DEFINE_STATIC_KEY_FALSE(tcp_tx_skb_cache_key); - void tcp_enter_memory_pressure(struct sock *sk) { unsigned long val; @@ -866,18 +861,6 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp, { struct sk_buff *skb; - if (likely(!size)) { - skb = sk->sk_tx_skb_cache; - if (skb) { - skb->truesize = SKB_TRUESIZE(skb_end_offset(skb)); - sk->sk_tx_skb_cache = NULL; - pskb_trim(skb, 0); - INIT_LIST_HEAD(&skb->tcp_tsorted_anchor); - skb_shinfo(skb)->tx_flags = 0; - memset(TCP_SKB_CB(skb), 0, sizeof(struct tcp_skb_cb)); - return skb; - } - } /* The TCP header must be at least 32-bit aligned. */ size = ALIGN(size, 4); @@ -2920,11 +2903,6 @@ void tcp_write_queue_purge(struct sock *sk) sk_wmem_free_skb(sk, skb); } tcp_rtx_queue_purge(sk); - skb = sk->sk_tx_skb_cache; - if (skb) { - __kfree_skb(skb); - sk->sk_tx_skb_cache = NULL; - } INIT_LIST_HEAD(&tcp_sk(sk)->tsorted_sent_queue); sk_mem_reclaim(sk); tcp_clear_all_retrans_hints(tcp_sk(sk)); @@ -2961,10 +2939,6 @@ int tcp_disconnect(struct sock *sk, int flags) tcp_clear_xmit_timers(sk); __skb_queue_purge(&sk->sk_receive_queue); - if (sk->sk_rx_skb_cache) { - __kfree_skb(sk->sk_rx_skb_cache); - sk->sk_rx_skb_cache = NULL; - } WRITE_ONCE(tp->copied_seq, tp->rcv_nxt); tp->urg_data = 0; tcp_write_queue_purge(sk); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 2e62e0d6373a6ee5d98756fb1967bff9743ffa02..29a57bd159f0aa99e892bac56b75961c107f803a 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1941,7 +1941,6 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph, int tcp_v4_rcv(struct sk_buff *skb) { struct net *net = dev_net(skb->dev); - struct sk_buff *skb_to_free; int sdif = inet_sdif(skb); int dif = inet_iif(skb); const struct iphdr *iph; @@ -2082,17 +2081,12 @@ int tcp_v4_rcv(struct sk_buff *skb) tcp_segs_in(tcp_sk(sk), skb); ret = 0; if (!sock_owned_by_user(sk)) { - skb_to_free = sk->sk_rx_skb_cache; - sk->sk_rx_skb_cache = NULL; ret = tcp_v4_do_rcv(sk, skb); } else { if (tcp_add_backlog(sk, skb)) goto discard_and_relse; - skb_to_free = NULL; } bh_unlock_sock(sk); - if (skb_to_free) - __kfree_skb(skb_to_free); put_and_return: if (refcounted) diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 0ce52d46e4f81b221a6acd4a0dd7b0d462dbac7a..8cf5ff2e95043ec1a2b27661aae884eb13dcf9eb 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1618,7 +1618,6 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr, INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) { - struct sk_buff *skb_to_free; int sdif = inet6_sdif(skb); int dif = inet6_iif(skb); const struct tcphdr *th; @@ -1754,17 +1753,12 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) tcp_segs_in(tcp_sk(sk), skb); ret = 0; if (!sock_owned_by_user(sk)) { - skb_to_free = sk->sk_rx_skb_cache; - sk->sk_rx_skb_cache = NULL; ret = tcp_v6_do_rcv(sk, skb); } else { if (tcp_add_backlog(sk, skb)) goto discard_and_relse; - skb_to_free = NULL; } bh_unlock_sock(sk); - if (skb_to_free) - __kfree_skb(skb_to_free); put_and_return: if (refcounted) sock_put(sk);
On Wed, 2021-09-01 at 08:16 -0700, Eric Dumazet wrote: > On Wed, Sep 1, 2021 at 8:06 AM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Sep 1, 2021 at 3:52 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > On Wed, 2021-09-01 at 18:39 +0800, Yunsheng Lin wrote: > > > > Since tcp_tx_skb_cache is disabled by default in: > > > > commit 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl") > > > > > > > > Add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb() to > > > > avoid possible branch-misses. > > > > > > > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > > > > > > Note that MPTCP is currently exploiting sk->sk_tx_skb_cache. If we get > > > this patch goes in as-is, it will break mptcp. > > > > > > One possible solution would be to let mptcp usage enable sk- > > > > sk_tx_skb_cache, but that has relevant side effects on plain TCP. > > > > > > Another options would be re-work once again the mptcp xmit path to > > > avoid using sk->sk_tx_skb_cache. > > > > > > > Hmmm, I actually wrote a revert of this feature but forgot to submit > > it last year. > > > > commit c36cfbd791f62c0f7c6b32132af59dfdbe6be21b (HEAD -> listener_scale4) > > Author: Eric Dumazet <edumazet@google.com> > > Date: Wed May 20 06:38:38 2020 -0700 > > > > tcp: remove sk_{tr}x_skb_cache > > > > This reverts the following patches : > > > > 2e05fcae83c41eb2df10558338dc600dc783af47 ("tcp: fix compile error > > if !CONFIG_SYSCTL") > > 4f661542a40217713f2cee0bb6678fbb30d9d367 ("tcp: fix zerocopy and > > notsent_lowat issues") > > 472c2e07eef045145bc1493cc94a01c87140780a ("tcp: add one skb cache for tx") > > 8b27dae5a2e89a61c46c6dbc76c040c0e6d0ed4c ("tcp: add one skb cache for rx") > > > > Having a cache of one skb (in each direction) per TCP socket is fragile, > > since it can cause a significant increase of memory needs, > > and not good enough for high speed flows anyway where more than one skb > > is needed. > > > > We want instead to add a generic infrastructure, with more flexible per-cpu > > caches, for alien NUMA nodes. > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > > > I will update this commit to also remove the part in MPTCP. > > > > Let's remove this feature and replace it with something less costly. > > Paolo, can you work on MPTP side, so that my revert can be then applied ? You are way too fast, I was still replying to your previous email, asking if I could help :) I'll a look ASAP. Please, allow for some latency: I'm way slower! Cheers, Paolo
On Wed, Sep 1, 2021 at 9:01 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > I think the easiest way and the one with less code duplication will > require accessing the tcp_mark_push() and skb_entail() helpers from the > MPTCP code, making them not static and exposing them e.g. in net/tcp.h. > Would that be acceptable or should I look for other options? > I think this is fine, really.
On 2021/9/1 18:39, Yunsheng Lin wrote: > Since tcp_tx_skb_cache is disabled by default in: > commit 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl") > > Add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb() to > avoid possible branch-misses. > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > --- > Also, the sk->sk_tx_skb_cache may be both changed by allocation > and freeing side, I assume there may be some implicit protection > here too, such as the NAPI protection for rx? Hi, Eric Is there any implicit protection for sk->sk_tx_skb_cache? As my understanding, sk_stream_alloc_skb() seems to be protected by lock_sock(), and the sk_wmem_free_skb() seems to be mostly happening in NAPI polling for TCP(when ack packet is received) without lock_sock(), so it seems there is no protection here? >
On Wed, Sep 1, 2021 at 5:47 PM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > On 2021/9/1 18:39, Yunsheng Lin wrote: > > Since tcp_tx_skb_cache is disabled by default in: > > commit 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl") > > > > Add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb() to > > avoid possible branch-misses. > > > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > > --- > > Also, the sk->sk_tx_skb_cache may be both changed by allocation > > and freeing side, I assume there may be some implicit protection > > here too, such as the NAPI protection for rx? > > Hi, Eric > Is there any implicit protection for sk->sk_tx_skb_cache? > As my understanding, sk_stream_alloc_skb() seems to be protected > by lock_sock(), and the sk_wmem_free_skb() seems to be mostly > happening in NAPI polling for TCP(when ack packet is received) > without lock_sock(), so it seems there is no protection here? > Please look again. This is protected by socket lock of course. Otherwise sk_mem_uncharge() would be very broken, sk->sk_forward_alloc is not an atomic field. TCP stack has no direct relation with NAPI. It can run over loopback interface, no NAPI there.
On 2021/9/2 9:13, Eric Dumazet wrote: > On Wed, Sep 1, 2021 at 5:47 PM Yunsheng Lin <linyunsheng@huawei.com> wrote: >> >> On 2021/9/1 18:39, Yunsheng Lin wrote: >>> Since tcp_tx_skb_cache is disabled by default in: >>> commit 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl") >>> >>> Add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb() to >>> avoid possible branch-misses. >>> >>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> >>> --- >>> Also, the sk->sk_tx_skb_cache may be both changed by allocation >>> and freeing side, I assume there may be some implicit protection >>> here too, such as the NAPI protection for rx? >> >> Hi, Eric >> Is there any implicit protection for sk->sk_tx_skb_cache? >> As my understanding, sk_stream_alloc_skb() seems to be protected >> by lock_sock(), and the sk_wmem_free_skb() seems to be mostly >> happening in NAPI polling for TCP(when ack packet is received) >> without lock_sock(), so it seems there is no protection here? >> > > Please look again. > This is protected by socket lock of course. > Otherwise sk_mem_uncharge() would be very broken, sk->sk_forward_alloc > is not an atomic field. Thanks for clarifying. I have been looking for a point to implement the socket'pp_alloc_cache for page pool, and sk_wmem_free_skb() seems like the place to avoid the scalablity problem of ptr_ring in page pool. The protection for sk_wmem_free_skb() is in tcp_v4_rcv(), right? https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_ipv4.c#L2081 > > TCP stack has no direct relation with NAPI. > It can run over loopback interface, no NAPI there. > . >
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e8b48df..226ddef 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -866,7 +866,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp, { struct sk_buff *skb; - if (likely(!size)) { + if (static_branch_unlikely(&tcp_tx_skb_cache_key) && likely(!size)) { skb = sk->sk_tx_skb_cache; if (skb) { skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
Since tcp_tx_skb_cache is disabled by default in: commit 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl") Add tcp_tx_skb_cache_key checking in sk_stream_alloc_skb() to avoid possible branch-misses. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> --- Also, the sk->sk_tx_skb_cache may be both changed by allocation and freeing side, I assume there may be some implicit protection here too, such as the NAPI protection for rx? --- net/ipv4/tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)