Message ID | 20210719214834.125484-3-john.fastabend@gmail.com |
---|---|
State | Superseded |
Headers | show |
Series | [bpf,1/3] bpf, sockmap: zap ingress queues after stopping strparser | expand |
Hi John, url: https://github.com/0day-ci/linux/commits/John-Fastabend/sockmap-fixes-picked-up-by-stress-tests/20210720-144138 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master config: i386-randconfig-m021-20210720 (attached as .config) compiler: gcc-10 (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> smatch warnings: net/core/skmsg.c:627 sk_psock_backlog() error: uninitialized symbol 'skb'. net/core/skmsg.c:639 sk_psock_backlog() error: uninitialized symbol 'off'. net/core/skmsg.c:640 sk_psock_backlog() error: uninitialized symbol 'len'. vim +/skb +627 net/core/skmsg.c 604326b41a6fb9 Daniel Borkmann 2018-10-13 609 static void sk_psock_backlog(struct work_struct *work) 604326b41a6fb9 Daniel Borkmann 2018-10-13 610 { 604326b41a6fb9 Daniel Borkmann 2018-10-13 611 struct sk_psock *psock = container_of(work, struct sk_psock, work); 604326b41a6fb9 Daniel Borkmann 2018-10-13 612 struct sk_psock_work_state *state = &psock->work_state; 604326b41a6fb9 Daniel Borkmann 2018-10-13 613 struct sk_buff *skb; 604326b41a6fb9 Daniel Borkmann 2018-10-13 614 bool ingress; 604326b41a6fb9 Daniel Borkmann 2018-10-13 615 u32 len, off; 604326b41a6fb9 Daniel Borkmann 2018-10-13 616 int ret; 604326b41a6fb9 Daniel Borkmann 2018-10-13 617 799aa7f98d53e0 Cong Wang 2021-03-30 618 mutex_lock(&psock->work_mutex); d1f6b1c794e27f John Fastabend 2021-07-19 619 if (unlikely(state->skb)) { d1f6b1c794e27f John Fastabend 2021-07-19 620 spin_lock_bh(&psock->ingress_lock); 604326b41a6fb9 Daniel Borkmann 2018-10-13 621 skb = state->skb; 604326b41a6fb9 Daniel Borkmann 2018-10-13 622 len = state->len; 604326b41a6fb9 Daniel Borkmann 2018-10-13 623 off = state->off; 604326b41a6fb9 Daniel Borkmann 2018-10-13 624 state->skb = NULL; d1f6b1c794e27f John Fastabend 2021-07-19 625 spin_unlock_bh(&psock->ingress_lock); 604326b41a6fb9 Daniel Borkmann 2018-10-13 626 } skb uninitialized on else path. d1f6b1c794e27f John Fastabend 2021-07-19 @627 if (skb) d1f6b1c794e27f John Fastabend 2021-07-19 628 goto start; 604326b41a6fb9 Daniel Borkmann 2018-10-13 629 604326b41a6fb9 Daniel Borkmann 2018-10-13 630 while ((skb = skb_dequeue(&psock->ingress_skb))) { 604326b41a6fb9 Daniel Borkmann 2018-10-13 631 len = skb->len; 604326b41a6fb9 Daniel Borkmann 2018-10-13 632 off = 0; 604326b41a6fb9 Daniel Borkmann 2018-10-13 633 start: e3526bb92a2084 Cong Wang 2021-02-23 634 ingress = skb_bpf_ingress(skb); e3526bb92a2084 Cong Wang 2021-02-23 635 skb_bpf_redirect_clear(skb); 604326b41a6fb9 Daniel Borkmann 2018-10-13 636 do { 604326b41a6fb9 Daniel Borkmann 2018-10-13 637 ret = -EIO; 799aa7f98d53e0 Cong Wang 2021-03-30 638 if (!sock_flag(psock->sk, SOCK_DEAD)) 604326b41a6fb9 Daniel Borkmann 2018-10-13 @639 ret = sk_psock_handle_skb(psock, skb, off, 604326b41a6fb9 Daniel Borkmann 2018-10-13 @640 len, ingress); 604326b41a6fb9 Daniel Borkmann 2018-10-13 641 if (ret <= 0) { 604326b41a6fb9 Daniel Borkmann 2018-10-13 642 if (ret == -EAGAIN) { d1f6b1c794e27f John Fastabend 2021-07-19 643 sk_psock_skb_state(psock, state, skb, d1f6b1c794e27f John Fastabend 2021-07-19 644 len, off); 604326b41a6fb9 Daniel Borkmann 2018-10-13 645 goto end; 604326b41a6fb9 Daniel Borkmann 2018-10-13 646 } 604326b41a6fb9 Daniel Borkmann 2018-10-13 647 /* Hard errors break pipe and stop xmit. */ 604326b41a6fb9 Daniel Borkmann 2018-10-13 648 sk_psock_report_error(psock, ret ? -ret : EPIPE); 604326b41a6fb9 Daniel Borkmann 2018-10-13 649 sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED); 781dd0431eb549 Cong Wang 2021-06-14 650 sock_drop(psock->sk, skb); 604326b41a6fb9 Daniel Borkmann 2018-10-13 651 goto end; 604326b41a6fb9 Daniel Borkmann 2018-10-13 652 } 604326b41a6fb9 Daniel Borkmann 2018-10-13 653 off += ret; 604326b41a6fb9 Daniel Borkmann 2018-10-13 654 len -= ret; 604326b41a6fb9 Daniel Borkmann 2018-10-13 655 } while (len); 604326b41a6fb9 Daniel Borkmann 2018-10-13 656 604326b41a6fb9 Daniel Borkmann 2018-10-13 657 if (!ingress) 604326b41a6fb9 Daniel Borkmann 2018-10-13 658 kfree_skb(skb); 604326b41a6fb9 Daniel Borkmann 2018-10-13 659 } 604326b41a6fb9 Daniel Borkmann 2018-10-13 660 end: 799aa7f98d53e0 Cong Wang 2021-03-30 661 mutex_unlock(&psock->work_mutex); 604326b41a6fb9 Daniel Borkmann 2018-10-13 662 } --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
On Mon, Jul 19, 2021 at 11:48 PM CEST, John Fastabend wrote: > Its possible if a socket is closed and the receive thread is under memory > pressure it may have cached a skb. We need to ensure these skbs are > free'd along with the normal ingress_skb queue. > > Before 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") tear > down and backlog processing both had sock_lock for the common case of > socket close or unhash. So it was not possible to have both running in > parrallel so all we would need is the kfree in those kernels. > > But, latest kernels include the commit 799aa7f98d5e and this requires a > bit more work. Without the ingress_lock guarding reading/writing the > state->skb case its possible the tear down could run before the state > update causing it to leak memory or worse when the backlog reads the state > it could potentially run interleaved with the tear down and we might end up > free'ing the state->skb from tear down side but already have the reference > from backlog side. To resolve such races we wrap accesses in ingress_lock > on both sides serializing tear down and backlog case. In both cases this > only happens after an EAGAIN error case so having an extra lock in place > is likely fine. The normal path will skip the locks. > > Note, we check state->skb before grabbing lock. This works because > we can only enqueue with the mutex we hold already. Avoiding a race > on adding state->skb after the check. And if tear down path is running > that is also fine if the tear down path then removes state->skb we > will simply set skb=NULL and the subsequent goto is skipped. This > slight complication avoids locking in normal case. > > With this fix we no longer see this warning splat from tcp side on > socket close when we hit the above case with redirect to ingress self. > > [224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220 > [224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi > [224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G I 5.14.0-rc1alu+ #181 > [224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019 > [224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220 > [224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41 > [224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206 > [224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000 > [224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460 > [224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543 > [224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390 > [224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500 > [224913.935974] FS: 00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000 > [224913.935981] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0 > [224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [224913.936007] Call Trace: > [224913.936016] inet_csk_destroy_sock+0xba/0x1f0 > [224913.936033] __tcp_close+0x620/0x790 > [224913.936047] tcp_close+0x20/0x80 > [224913.936056] inet_release+0x8f/0xf0 > [224913.936070] __sock_release+0x72/0x120 > [224913.936083] sock_close+0x14/0x20 > > Reported-by: Jussi Maki <joamaki@gmail.com> > Fixes: a136678c0bdbb ("bpf: sk_msg, zap ingress queue on psock down") > Signed-off-by: John Fastabend <john.fastabend@gmail.com> > --- This looks fine to me. I've played around with the idea of wrapping read & write access to psock->work_state in helpers with set/steal semantics. Building on an example from net/core/fib_rules.c where nla_get_kuid_range() returns a (u32, u32) pair. But I'm not sure if what I ended up with is actually nicer. Leaving it for your consideration, if you want to use any of it. diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index 96f319099744..ed067986a7b5 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -71,12 +71,16 @@ struct sk_psock_link { void *link_raw; }; -struct sk_psock_work_state { - struct sk_buff *skb; +struct sk_psock_skb_slice { u32 len; u32 off; }; +struct sk_psock_work_state { + struct sk_buff *skb; + struct sk_psock_skb_slice slice; +}; + struct sk_psock { struct sock *sk; struct sock *sk_redir; diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 15d71288e741..da0542074c24 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -590,40 +590,82 @@ static void sock_drop(struct sock *sk, struct sk_buff *skb) kfree_skb(skb); } +static void __sk_psock_skb_state_set(struct sk_psock *psock, + struct sk_buff *skb, + const struct sk_psock_skb_slice *slice) +{ + struct sk_psock_work_state *state = &psock->work_state; + + if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) { + state->skb = skb; + state->slice = *slice; + } else { + sock_drop(psock->sk, skb); + } +} + +static void sk_psock_skb_state_set(struct sk_psock *psock, + struct sk_buff *skb, + const struct sk_psock_skb_slice *slice) +{ + spin_lock_bh(&psock->ingress_lock); + __sk_psock_skb_state_set(psock, skb, slice); + spin_unlock_bh(&psock->ingress_lock); +} + +static struct sk_psock_skb_slice __sk_psock_skb_state_steal(struct sk_psock *psock, + struct sk_buff **pskb) +{ + struct sk_psock_work_state *state = &psock->work_state; + + *pskb = state->skb; + state->skb = NULL; + + return state->slice; +} + +static struct sk_psock_skb_slice sk_psock_skb_state_steal(struct sk_psock *psock, + struct sk_buff **pskb) +{ + struct sk_psock_skb_slice ret; + + spin_lock_bh(&psock->ingress_lock); + ret = __sk_psock_skb_state_steal(psock, pskb); + spin_unlock_bh(&psock->ingress_lock); + + return ret; +} + static void sk_psock_backlog(struct work_struct *work) { struct sk_psock *psock = container_of(work, struct sk_psock, work); struct sk_psock_work_state *state = &psock->work_state; - struct sk_buff *skb; + struct sk_psock_skb_slice slice = {}; + struct sk_buff *skb = NULL; bool ingress; - u32 len, off; int ret; mutex_lock(&psock->work_mutex); if (state->skb) { - skb = state->skb; - len = state->len; - off = state->off; - state->skb = NULL; + slice = sk_psock_skb_state_steal(psock, &skb); goto start; } while ((skb = skb_dequeue(&psock->ingress_skb))) { - len = skb->len; - off = 0; + slice.len = skb->len; + slice.off = 0; start: ingress = skb_bpf_ingress(skb); skb_bpf_redirect_clear(skb); do { ret = -EIO; if (!sock_flag(psock->sk, SOCK_DEAD)) - ret = sk_psock_handle_skb(psock, skb, off, - len, ingress); + ret = sk_psock_handle_skb(psock, skb, slice.off, + slice.len, ingress); if (ret <= 0) { if (ret == -EAGAIN) { - state->skb = skb; - state->len = len; - state->off = off; + sk_psock_skb_state_set(psock, skb, + &slice); goto end; } /* Hard errors break pipe and stop xmit. */ @@ -632,9 +674,9 @@ static void sk_psock_backlog(struct work_struct *work) sock_drop(psock->sk, skb); goto end; } - off += ret; - len -= ret; - } while (len); + slice.off += ret; + slice.len -= ret; + } while (slice.len); if (!ingress) kfree_skb(skb); @@ -723,6 +765,12 @@ static void __sk_psock_zap_ingress(struct sk_psock *psock) sock_drop(psock->sk, skb); } __sk_psock_purge_ingress_msg(psock); + + /* We steal the skb here to ensure that calls to sk_psock_backlog + * do not pick up the free'd skb. + */ + __sk_psock_skb_state_steal(psock, &skb); + kfree_skb(skb); } static void sk_psock_link_destroy(struct sk_psock *psock)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 28115ef742e8..5d956e91d05a 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -590,6 +590,22 @@ static void sock_drop(struct sock *sk, struct sk_buff *skb) kfree_skb(skb); } +static void sk_psock_skb_state(struct sk_psock *psock, + struct sk_psock_work_state *state, + struct sk_buff *skb, + int len, int off) +{ + spin_lock_bh(&psock->ingress_lock); + if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) { + state->skb = skb; + state->len = len; + state->off = off; + } else { + sock_drop(psock->sk, skb); + } + spin_unlock_bh(&psock->ingress_lock); +} + static void sk_psock_backlog(struct work_struct *work) { struct sk_psock *psock = container_of(work, struct sk_psock, work); @@ -600,13 +616,16 @@ static void sk_psock_backlog(struct work_struct *work) int ret; mutex_lock(&psock->work_mutex); - if (state->skb) { + if (unlikely(state->skb)) { + spin_lock_bh(&psock->ingress_lock); skb = state->skb; len = state->len; off = state->off; state->skb = NULL; - goto start; + spin_unlock_bh(&psock->ingress_lock); } + if (skb) + goto start; while ((skb = skb_dequeue(&psock->ingress_skb))) { len = skb->len; @@ -621,9 +640,8 @@ static void sk_psock_backlog(struct work_struct *work) len, ingress); if (ret <= 0) { if (ret == -EAGAIN) { - state->skb = skb; - state->len = len; - state->off = off; + sk_psock_skb_state(psock, state, skb, + len, off); goto end; } /* Hard errors break pipe and stop xmit. */ @@ -722,6 +740,11 @@ static void __sk_psock_zap_ingress(struct sk_psock *psock) skb_bpf_redirect_clear(skb); sock_drop(psock->sk, skb); } + kfree_skb(psock->work_state.skb); + /* We null the skb here to ensure that calls to sk_psock_backlog + * do not pick up the free'd skb. + */ + psock->work_state.skb = NULL; __sk_psock_purge_ingress_msg(psock); }
Its possible if a socket is closed and the receive thread is under memory pressure it may have cached a skb. We need to ensure these skbs are free'd along with the normal ingress_skb queue. Before 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") tear down and backlog processing both had sock_lock for the common case of socket close or unhash. So it was not possible to have both running in parrallel so all we would need is the kfree in those kernels. But, latest kernels include the commit 799aa7f98d5e and this requires a bit more work. Without the ingress_lock guarding reading/writing the state->skb case its possible the tear down could run before the state update causing it to leak memory or worse when the backlog reads the state it could potentially run interleaved with the tear down and we might end up free'ing the state->skb from tear down side but already have the reference from backlog side. To resolve such races we wrap accesses in ingress_lock on both sides serializing tear down and backlog case. In both cases this only happens after an EAGAIN error case so having an extra lock in place is likely fine. The normal path will skip the locks. Note, we check state->skb before grabbing lock. This works because we can only enqueue with the mutex we hold already. Avoiding a race on adding state->skb after the check. And if tear down path is running that is also fine if the tear down path then removes state->skb we will simply set skb=NULL and the subsequent goto is skipped. This slight complication avoids locking in normal case. With this fix we no longer see this warning splat from tcp side on socket close when we hit the above case with redirect to ingress self. [224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220 [224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi [224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G I 5.14.0-rc1alu+ #181 [224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019 [224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220 [224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41 [224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206 [224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000 [224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460 [224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543 [224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390 [224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500 [224913.935974] FS: 00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000 [224913.935981] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0 [224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [224913.936007] Call Trace: [224913.936016] inet_csk_destroy_sock+0xba/0x1f0 [224913.936033] __tcp_close+0x620/0x790 [224913.936047] tcp_close+0x20/0x80 [224913.936056] inet_release+0x8f/0xf0 [224913.936070] __sock_release+0x72/0x120 [224913.936083] sock_close+0x14/0x20 Reported-by: Jussi Maki <joamaki@gmail.com> Fixes: a136678c0bdbb ("bpf: sk_msg, zap ingress queue on psock down") Signed-off-by: John Fastabend <john.fastabend@gmail.com> --- net/core/skmsg.c | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-)