Message ID | tencent_8D9918CF808DC6A5FF8DCB12055DECDDF107@qq.com |
---|---|
State | Accepted |
Commit | e3203b17771757fdcd259d6378673f1590e36694 |
Headers | show |
Series | [v2] bluetooth/l2cap: sync sock recv cb and release | expand |
Hello: This patch was applied to bluetooth/bluetooth-next.git (master) by Luiz Augusto von Dentz <luiz.von.dentz@intel.com>: On Sat, 15 Jun 2024 09:45:54 +0800 you wrote: > The problem occurs between the system call to close the sock and hci_rx_work, > where the former releases the sock and the latter accesses it without lock protection. > > CPU0 CPU1 > ---- ---- > sock_close hci_rx_work > l2cap_sock_release hci_acldata_packet > l2cap_sock_kill l2cap_recv_frame > sk_free l2cap_conless_channel > l2cap_sock_recv_cb > > [...] Here is the summary with links: - [v2] bluetooth/l2cap: sync sock recv cb and release https://git.kernel.org/bluetooth/bluetooth-next/c/e3203b177717 You are awesome, thank you!
Hi Edward, On Fri, Jun 14, 2024 at 9:46 PM Edward Adam Davis <eadavis@qq.com> wrote: > > The problem occurs between the system call to close the sock and hci_rx_work, > where the former releases the sock and the latter accesses it without lock protection. > > CPU0 CPU1 > ---- ---- > sock_close hci_rx_work > l2cap_sock_release hci_acldata_packet > l2cap_sock_kill l2cap_recv_frame > sk_free l2cap_conless_channel > l2cap_sock_recv_cb > > If hci_rx_work processes the data that needs to be received before the sock is > closed, then everything is normal; Otherwise, the work thread may access the > released sock when receiving data. > > Add a chan mutex in the rx callback of the sock to achieve synchronization between > the sock release and recv cb. > > Sock is dead, so set chan data to NULL, avoid others use invalid sock pointer. > > Reported-and-tested-by: syzbot+b7f6f8c9303466e16c8a@syzkaller.appspotmail.com > Signed-off-by: Edward Adam Davis <eadavis@qq.com> > --- > net/bluetooth/l2cap_sock.c | 25 ++++++++++++++++++++++--- > 1 file changed, 22 insertions(+), 3 deletions(-) > > diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c > index 6db60946c627..f45cdf9bc985 100644 > --- a/net/bluetooth/l2cap_sock.c > +++ b/net/bluetooth/l2cap_sock.c > @@ -1239,6 +1239,10 @@ static void l2cap_sock_kill(struct sock *sk) > > BT_DBG("sk %p state %s", sk, state_to_string(sk->sk_state)); > > + /* Sock is dead, so set chan data to NULL, avoid other task use invalid > + * sock pointer. > + */ > + l2cap_pi(sk)->chan->data = NULL; > /* Kill poor orphan */ > > l2cap_chan_put(l2cap_pi(sk)->chan); > @@ -1481,12 +1485,25 @@ static struct l2cap_chan *l2cap_sock_new_connection_cb(struct l2cap_chan *chan) > > static int l2cap_sock_recv_cb(struct l2cap_chan *chan, struct sk_buff *skb) > { > - struct sock *sk = chan->data; > - struct l2cap_pinfo *pi = l2cap_pi(sk); > + struct sock *sk; > + struct l2cap_pinfo *pi; > int err; > > - lock_sock(sk); > + /* To avoid race with sock_release, a chan lock needs to be added here > + * to synchronize the sock. > + */ > + l2cap_chan_hold(chan); > + l2cap_chan_lock(chan); > + sk = chan->data; > > + if (!sk) { > + l2cap_chan_unlock(chan); > + l2cap_chan_put(chan); > + return -ENXIO; > + } > + > + pi = l2cap_pi(sk); > + lock_sock(sk); > if (chan->mode == L2CAP_MODE_ERTM && !list_empty(&pi->rx_busy)) { > err = -ENOMEM; > goto done; > @@ -1535,6 +1552,8 @@ static int l2cap_sock_recv_cb(struct l2cap_chan *chan, struct sk_buff *skb) > > done: > release_sock(sk); > + l2cap_chan_unlock(chan); > + l2cap_chan_put(chan); > > return err; > } > -- > 2.43.0 Looks like this was never really tested properly: ============================================ WARNING: possible recursive locking detected 6.10.0-rc3-g4029dba6b6f1 #6823 Not tainted -------------------------------------------- kworker/u5:0/35 is trying to acquire lock: ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: l2cap_sock_recv_cb+0x44/0x1e0 but task is already holding lock: ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: l2cap_get_chan_by_scid+0xaf/0xd0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&chan->lock#2/1); lock(&chan->lock#2/1); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by kworker/u5:0/35: #0: ffff888002b8a940 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_one_work+0x750/0x930 #1: ffff888002c67dd0 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_one_work+0x44e/0x930 #2: ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: l2cap_get_chan_by_scid+0xaf/0xd0 l2cap_sock_recv_cb is assumed to be called with the chan_lock held so perhaps we can just do: sk = chan->data; if (!sk) return -ENXIO;
Hi Luiz Augusto von Dentz, On Thu, 20 Jun 2024 12:53:19 -0400, Luiz Augusto von Dentz wrote: > > release_sock(sk); > > + l2cap_chan_unlock(chan); > > + l2cap_chan_put(chan); > > > > return err; > > } > > -- > > 2.43.0 > > Looks like this was never really tested properly: > > ============================================ > WARNING: possible recursive locking detected > 6.10.0-rc3-g4029dba6b6f1 #6823 Not tainted > -------------------------------------------- > kworker/u5:0/35 is trying to acquire lock: > ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: > l2cap_sock_recv_cb+0x44/0x1e0 > > but task is already holding lock: > ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: > l2cap_get_chan_by_scid+0xaf/0xd0 > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock(&chan->lock#2/1); > lock(&chan->lock#2/1); > > *** DEADLOCK *** > > May be due to missing lock nesting notation > > 3 locks held by kworker/u5:0/35: > #0: ffff888002b8a940 ((wq_completion)hci0#2){+.+.}-{0:0}, at: > process_one_work+0x750/0x930 > #1: ffff888002c67dd0 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, > at: process_one_work+0x44e/0x930 > #2: ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: > l2cap_get_chan_by_scid+0xaf/0xd0 > > l2cap_sock_recv_cb is assumed to be called with the chan_lock held so > perhaps we can just do: > > sk = chan->data; > if (!sk) > return -ENXIO; If the release occurs after this judgment, the same problem will still occur. Recv and release must be synchronized using locks, which can be solved by adding new lock. Can you provide a reproduction program for the AA lock mentioned above? -- Edward
Hi, pe, 2024-06-21 kello 22:45 +0800, Edward Adam Davis kirjoitti: > Hi Luiz Augusto von Dentz, > > On Thu, 20 Jun 2024 12:53:19 -0400, Luiz Augusto von Dentz wrote: > > > release_sock(sk); > > > + l2cap_chan_unlock(chan); > > > + l2cap_chan_put(chan); > > > > > > return err; > > > } > > > -- > > > 2.43.0 > > > > Looks like this was never really tested properly: > > > > ============================================ > > WARNING: possible recursive locking detected > > 6.10.0-rc3-g4029dba6b6f1 #6823 Not tainted > > -------------------------------------------- > > kworker/u5:0/35 is trying to acquire lock: > > ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: > > l2cap_sock_recv_cb+0x44/0x1e0 > > > > but task is already holding lock: > > ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: > > l2cap_get_chan_by_scid+0xaf/0xd0 > > > > other info that might help us debug this: > > Possible unsafe locking scenario: > > > > CPU0 > > ---- > > lock(&chan->lock#2/1); > > lock(&chan->lock#2/1); > > > > *** DEADLOCK *** > > > > May be due to missing lock nesting notation > > > > 3 locks held by kworker/u5:0/35: > > #0: ffff888002b8a940 ((wq_completion)hci0#2){+.+.}-{0:0}, at: > > process_one_work+0x750/0x930 > > #1: ffff888002c67dd0 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, > > at: process_one_work+0x44e/0x930 > > #2: ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: > > l2cap_get_chan_by_scid+0xaf/0xd0 > > > > l2cap_sock_recv_cb is assumed to be called with the chan_lock held so > > perhaps we can just do: > > > > sk = chan->data; > > if (!sk) > > return -ENXIO; > > If the release occurs after this judgment, the same problem will still occur. > Recv and release must be synchronized using locks, which can be solved by > adding new lock. > > Can you provide a reproduction program for the AA lock mentioned above? eg. l2cap_data_channel() calls l2cap_sock_recv_cb with chan->lock (and conn->chan_lock) held. All callsites eg. l2cap_raw_recv() don't seem to hold chan->lock, so you'd probably need to check the locking on all of them, if you want to use chan->lock for this synchronization.
Hi Luiz Augusto von Dentz, On Thu, 20 Jun 2024 12:53:19 -0400, Luiz Augusto von Dentz wrote: > > release_sock(sk); > > + l2cap_chan_unlock(chan); > > + l2cap_chan_put(chan); > > > > return err; > > } > > -- > > 2.43.0 > > Looks like this was never really tested properly: > > ============================================ > WARNING: possible recursive locking detected > 6.10.0-rc3-g4029dba6b6f1 #6823 Not tainted > -------------------------------------------- > kworker/u5:0/35 is trying to acquire lock: > ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: > l2cap_sock_recv_cb+0x44/0x1e0 > > but task is already holding lock: > ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: > l2cap_get_chan_by_scid+0xaf/0xd0 > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock(&chan->lock#2/1); > lock(&chan->lock#2/1); > > *** DEADLOCK *** > > May be due to missing lock nesting notation > > 3 locks held by kworker/u5:0/35: > #0: ffff888002b8a940 ((wq_completion)hci0#2){+.+.}-{0:0}, at: > process_one_work+0x750/0x930 > #1: ffff888002c67dd0 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, > at: process_one_work+0x44e/0x930 > #2: ffff888002ec2510 (&chan->lock#2/1){+.+.}-{3:3}, at: > l2cap_get_chan_by_scid+0xaf/0xd0 > > l2cap_sock_recv_cb is assumed to be called with the chan_lock held so > perhaps we can just do: > > sk = chan->data; > if (!sk) > return -ENXIO; If the release occurs after this judgment, the same problem will still occur. Recv and release must be synchronized using locks, which can be solved by adding new lock. Please use the new patch https://syzkaller.appspot.com/text?tag=Patch&x=15d2c48e980000, I have tested in https://syzkaller.appspot.com/bug?extid=b7f6f8c9303466e16c8a -- Edward
diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c index 6db60946c627..f45cdf9bc985 100644 --- a/net/bluetooth/l2cap_sock.c +++ b/net/bluetooth/l2cap_sock.c @@ -1239,6 +1239,10 @@ static void l2cap_sock_kill(struct sock *sk) BT_DBG("sk %p state %s", sk, state_to_string(sk->sk_state)); + /* Sock is dead, so set chan data to NULL, avoid other task use invalid + * sock pointer. + */ + l2cap_pi(sk)->chan->data = NULL; /* Kill poor orphan */ l2cap_chan_put(l2cap_pi(sk)->chan); @@ -1481,12 +1485,25 @@ static struct l2cap_chan *l2cap_sock_new_connection_cb(struct l2cap_chan *chan) static int l2cap_sock_recv_cb(struct l2cap_chan *chan, struct sk_buff *skb) { - struct sock *sk = chan->data; - struct l2cap_pinfo *pi = l2cap_pi(sk); + struct sock *sk; + struct l2cap_pinfo *pi; int err; - lock_sock(sk); + /* To avoid race with sock_release, a chan lock needs to be added here + * to synchronize the sock. + */ + l2cap_chan_hold(chan); + l2cap_chan_lock(chan); + sk = chan->data; + if (!sk) { + l2cap_chan_unlock(chan); + l2cap_chan_put(chan); + return -ENXIO; + } + + pi = l2cap_pi(sk); + lock_sock(sk); if (chan->mode == L2CAP_MODE_ERTM && !list_empty(&pi->rx_busy)) { err = -ENOMEM; goto done; @@ -1535,6 +1552,8 @@ static int l2cap_sock_recv_cb(struct l2cap_chan *chan, struct sk_buff *skb) done: release_sock(sk); + l2cap_chan_unlock(chan); + l2cap_chan_put(chan); return err; }
The problem occurs between the system call to close the sock and hci_rx_work, where the former releases the sock and the latter accesses it without lock protection. CPU0 CPU1 ---- ---- sock_close hci_rx_work l2cap_sock_release hci_acldata_packet l2cap_sock_kill l2cap_recv_frame sk_free l2cap_conless_channel l2cap_sock_recv_cb If hci_rx_work processes the data that needs to be received before the sock is closed, then everything is normal; Otherwise, the work thread may access the released sock when receiving data. Add a chan mutex in the rx callback of the sock to achieve synchronization between the sock release and recv cb. Sock is dead, so set chan data to NULL, avoid others use invalid sock pointer. Reported-and-tested-by: syzbot+b7f6f8c9303466e16c8a@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> --- net/bluetooth/l2cap_sock.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)