mbox series

[RFC,0/7] Fix a memory leak in Bluetooth L2CAP

Message ID 20230904221158.35425-1-olivier.lheureux@mind.be
Headers show
Series Fix a memory leak in Bluetooth L2CAP | expand

Message

Olivier L'Heureux Sept. 4, 2023, 10:11 p.m. UTC
Request for Comments

Summary
=======

We have found a memory leak in the Bluetooth L2CAP subsystem and fixed
it, but this opened the doors to use-after-free problems, which are
not completely fixed yet. This patch series present a way to reproduce
it, the proposed fix and the status. There is more detailed
documentation in [1].

Memory Leak Overview
====================

We have found a memory leak in the L2CAP layer of Linux's Bluetooth
Subsystem, the same as reported by syzbot in "[syzbot] [bluetooth?]
memory leak in hci_conn_add (2)" (2023-09-02 23:25:00 -0700) [19].

We can reproduce it. When, in a loop, a user-mode program:

 1. Opens a Bluetooth socket at the L2CAP layer:

          sockfd = socket(AF_BLUETOOTH, SOCK_SEQPACKET, BTPROTO_L2CAP);

 2. Set a timeout on the socket:

          timeout.tv_sec  = 2;
          timeout.tv_usec = 0;

          err = setsockopt(sockfd, SOL_SOCKET, SO_SNDTIMEO, &timeout, sizeof(timeout));

 3. Connect to a specific Bluetooth address:

          struct sockaddr_l2 ble_addr;

          memset(&ble_addr, 0, sizeof(ble_addr));
          ble_addr.l2_family = AF_BLUETOOTH;
          ble_addr.l2_psm = htobs(0x80 /* L2CAP_PSM_LE_DYN_START */);
          // l2_bdaddr=00:0a:07:a3:22:00
          ble_addr.l2_bdaddr = *(bdaddr_t*)("\x00\x0a\x07\xa3\x22\x00");
          ble_addr.l2_cid = htobs(0);
          ble_addr.l2_bdaddr_type = BDADDR_LE_PUBLIC;

          err = connect(sockfd, (struct sockaddr*)&ble_addr, sizeof(ble_addr));

and when the user program does those three steps in a loop, then the
kernel leaks the "struct l2cap_conn" [5] and "struct hci_conn" [6]
objects allocated in "l2cap_conn_add()" [2] and "hci_conn_add()" [3].
Those objects are never freed.

The "ble-memleak-repro" program reproduces the memory leak, if the
kernel is not patched. Its source is in
"package/ble-memleak-repro/ble-memleak-repro.c" [18].

Memory Leak Fix
===============

We have fixed the memory leak, see the patch series in
"patches/linux/":

 1. The first patch
    "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11]
    enables Bluetooth on the DK2.
 2. The second patch
    "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12]
    adds many dynamic debug traces that help understand the kernel
    behaviour and freeing problems.
 3. Patches
    "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13]
    and
    "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14]
    fix the memory leak.
 4. Patches
    "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15],
    "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16]
    and
    "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17]
    fixes the use-after-free that appears when the "struct l2cap_conn"
    [5] and "struct hci_conn" [6] objects are freed.

The commit messages explain why the commit is needed, which problem
the commit solves, and how.

The first and second patches are present for the memory leak
reproduction only, they should not be part of a final fix.

Patch Status
============

As of writing, the memory leak is fixed. The fix opened the door to
other problems, especially use-after-free, sometimes followed by
crashes due to NULL dereferences. We think there are weak references
(i.e. pointers that don't increment the refcounter) previously
pointing to memory that was never freed, but now is freed. On kernels
v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the
use-after-free and NULL dereferences, but not on kernel v6.5 yet.

Reproducing with Buildroot
==========================

We have reproduced and fixed the memory leak with Buildroot [7] and a
"ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4].

The "ble-memleak-repro" repository [1] contains the sources of a
complete external Buildroot customisation [8], with the patches, a
README, and more explanations to reproduce the problem with Buildroot
on a DK2.

References:
===========

- [1]: <https://gitlab.com/essensium-mind/ble-memleak-repro.git>
       "ble-memleak-repro repository"

- [2]: <https://elixir.bootlin.com/linux/v6.5/source/net/bluetooth/l2cap_core.c#L7833>
       "l2cap_conn_add()"

- [3]: <https://elixir.bootlin.com/linux/v6.5/source/net/bluetooth/hci_conn.c#L986>
       "hci_conn_add()"

- [4]: <https://www.st.com/content/st_com/en/products/evaluation-tools/product-evaluation-tools/mcu-mpu-eval-tools/stm32-mcu-mpu-eval-tools/stm32-discovery-kits/stm32mp157c-dk2.html>
       "STM32MP157C-DK2 (DK2)"

- [5]: <https://elixir.bootlin.com/linux/v6.5/source/include/net/bluetooth/l2cap.h#L674>
       "struct l2cap_conn"

- [6]: <https://elixir.bootlin.com/linux/v6.5/source/include/net/bluetooth/hci_core.h#L685>
       "struct hci_conn"

- [7]: <https://buildroot.org/>
       "Buildroot is a tool to generate embedded Linux systems"

- [8]: <https://buildroot.org/downloads/manual/manual.html#outside-br-custom>
       "9.2. Keeping customizations outside of Buildroot"

- [11]: <patches/linux/0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch>
        "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch"

- [12]: <patches/linux/0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch>
        "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch"

- [13]: <patches/linux/0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch>
        "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch"

- [14]: <patches/linux/0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch>
        "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch"

- [15]: <patches/linux/0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch>
        "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch"

- [16]: <patches/linux/0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch>
        "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch"

- [17]: <patches/linux/0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch>
        "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch"

- [18]: <https://gitlab.com/essensium-mind/ble-memleak-repro/-/blob/main/package/ble-memleak-repro/ble-memleak-repro.c?ref_type=heads>
        "ble-memleak-repro.c"

- [19]: <https://lore.kernel.org/linux-bluetooth/0000000000000fd01206046e7410@google.com/T/#u>
        "[syzbot] [bluetooth?] memory leak in hci_conn_add (2)"
        2023-09-02 23:25:00 -0700

---

Christophe Roullier (1):
  ARM: dts: stm32: Add Bluetooth (usart2) feature on stm32mp157x

Olivier L'Heureux (6):
  Bluetooth: add many traces for allocation/free/refcounting
  Bluetooth: L2CAP: use refcount on struct l2cap_chan->conn
  Bluetooth: L2CAP: free the leaking struct l2cap_conn
  Bluetooth: introduce hci_conn_free() for better structure
  Bluetooth: L2CAP: inc refcount if reuse struct l2cap_conn
  Bluetooth: unlink objects to avoid use-after-free

 arch/arm/boot/dts/st/stm32mp157c-dk2.dts | 11 ++++++-
 include/net/bluetooth/hci_core.h         |  7 +++--
 net/bluetooth/hci_conn.c                 | 18 ++++++++++++
 net/bluetooth/hci_sysfs.c                |  8 ++++-
 net/bluetooth/l2cap_core.c               | 37 ++++++++++++++++++++----
 5 files changed, 72 insertions(+), 9 deletions(-)


base-commit: 2dde18cd1d8fac735875f2e4987f11817cc0bc2c

Comments

Luiz Augusto von Dentz Sept. 5, 2023, 8:42 p.m. UTC | #1
Hi Olivier,

On Mon, Sep 4, 2023 at 3:12 PM Olivier L'Heureux
<olivier.lheureux@mind.be> wrote:
>
> Request for Comments
>
> Summary
> =======
>
> We have found a memory leak in the Bluetooth L2CAP subsystem and fixed
> it, but this opened the doors to use-after-free problems, which are
> not completely fixed yet. This patch series present a way to reproduce
> it, the proposed fix and the status. There is more detailed
> documentation in [1].
>
> Memory Leak Overview
> ====================
>
> We have found a memory leak in the L2CAP layer of Linux's Bluetooth
> Subsystem, the same as reported by syzbot in "[syzbot] [bluetooth?]
> memory leak in hci_conn_add (2)" (2023-09-02 23:25:00 -0700) [19].
>
> We can reproduce it. When, in a loop, a user-mode program:
>
>  1. Opens a Bluetooth socket at the L2CAP layer:
>
>           sockfd = socket(AF_BLUETOOTH, SOCK_SEQPACKET, BTPROTO_L2CAP);
>
>  2. Set a timeout on the socket:
>
>           timeout.tv_sec  = 2;
>           timeout.tv_usec = 0;
>
>           err = setsockopt(sockfd, SOL_SOCKET, SO_SNDTIMEO, &timeout, sizeof(timeout));
>
>  3. Connect to a specific Bluetooth address:
>
>           struct sockaddr_l2 ble_addr;
>
>           memset(&ble_addr, 0, sizeof(ble_addr));
>           ble_addr.l2_family = AF_BLUETOOTH;
>           ble_addr.l2_psm = htobs(0x80 /* L2CAP_PSM_LE_DYN_START */);
>           // l2_bdaddr=00:0a:07:a3:22:00
>           ble_addr.l2_bdaddr = *(bdaddr_t*)("\x00\x0a\x07\xa3\x22\x00");
>           ble_addr.l2_cid = htobs(0);
>           ble_addr.l2_bdaddr_type = BDADDR_LE_PUBLIC;
>
>           err = connect(sockfd, (struct sockaddr*)&ble_addr, sizeof(ble_addr));
>
> and when the user program does those three steps in a loop, then the
> kernel leaks the "struct l2cap_conn" [5] and "struct hci_conn" [6]
> objects allocated in "l2cap_conn_add()" [2] and "hci_conn_add()" [3].
> Those objects are never freed.
>
> The "ble-memleak-repro" program reproduces the memory leak, if the
> kernel is not patched. Its source is in
> "package/ble-memleak-repro/ble-memleak-repro.c" [18].

We should probably create a test in l2cap-tester using SO_SNDTIMEO
first, so we can make sure CI test such case and we are able to detect
if the problem is reintroduced later:

https://github.com/bluez/bluez/blob/master/tools/l2cap-tester.c

> Memory Leak Fix
> ===============
>
> We have fixed the memory leak, see the patch series in
> "patches/linux/":
>
>  1. The first patch
>     "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11]
>     enables Bluetooth on the DK2.

This above should probably be sent separately.

>  2. The second patch
>     "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12]
>     adds many dynamic debug traces that help understand the kernel
>     behaviour and freeing problems.

I'm fine with this change, but we better use the likes of bt_dev_dbg
instead of BT_DBG.

>  3. Patches
>     "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13]
>     and
>     "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14]
>     fix the memory leak.
>  4. Patches
>     "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15],
>     "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16]
>     and
>     "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17]
>     fixes the use-after-free that appears when the "struct l2cap_conn"
>     [5] and "struct hci_conn" [6] objects are freed.

These I'm not very comfortable applying as they are, I'm afraid there
could be regressions if they are not accompanied with proper tests,
besides I think there are less intrusive ways to cleanup l2cap_conn,
see below.

> The commit messages explain why the commit is needed, which problem
> the commit solves, and how.
>
> The first and second patches are present for the memory leak
> reproduction only, they should not be part of a final fix.
>
> Patch Status
> ============
>
> As of writing, the memory leak is fixed. The fix opened the door to
> other problems, especially use-after-free, sometimes followed by
> crashes due to NULL dereferences. We think there are weak references
> (i.e. pointers that don't increment the refcounter) previously
> pointing to memory that was never freed, but now is freed. On kernels
> v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the
> use-after-free and NULL dereferences, but not on kernel v6.5 yet.

I think the problem is that the lifetime of l2cap_conn shall be hooked
with hci_chan, but the likes of hci_chan_list_flush -> hci_chan_del
don't actually trigger l2cap_conn_del, which imo is the culprit here,
because connect_cfm/disconnect_cfm is not called when the connection
procedure has been aborted prematurely, so perhaps we shall get rid of
the likes of l2cap_connect_cfm/l2cap_disconn_cfm and instead do the
cleanup with in the following order:
hci_conn_cleanup->hci_chan_list_flush->hci_chan_del->l2cap_conn_del,
that way we avoid having multiple code paths attempting to cleanup
objects associated with hci_conn/hci_chan.

I'd change hci_chan_create to take a del callback to avoid having
circular dependencies on one another though.

> Reproducing with Buildroot
> ==========================
>
> We have reproduced and fixed the memory leak with Buildroot [7] and a
> "ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4].
>
> The "ble-memleak-repro" repository [1] contains the sources of a
> complete external Buildroot customisation [8], with the patches, a
> README, and more explanations to reproduce the problem with Buildroot
> on a DK2.
>
> References:
> ===========
>
> - [1]: <https://gitlab.com/essensium-mind/ble-memleak-repro.git>
>        "ble-memleak-repro repository"
>
> - [2]: <https://elixir.bootlin.com/linux/v6.5/source/net/bluetooth/l2cap_core.c#L7833>
>        "l2cap_conn_add()"
>
> - [3]: <https://elixir.bootlin.com/linux/v6.5/source/net/bluetooth/hci_conn.c#L986>
>        "hci_conn_add()"
>
> - [4]: <https://www.st.com/content/st_com/en/products/evaluation-tools/product-evaluation-tools/mcu-mpu-eval-tools/stm32-mcu-mpu-eval-tools/stm32-discovery-kits/stm32mp157c-dk2.html>
>        "STM32MP157C-DK2 (DK2)"
>
> - [5]: <https://elixir.bootlin.com/linux/v6.5/source/include/net/bluetooth/l2cap.h#L674>
>        "struct l2cap_conn"
>
> - [6]: <https://elixir.bootlin.com/linux/v6.5/source/include/net/bluetooth/hci_core.h#L685>
>        "struct hci_conn"
>
> - [7]: <https://buildroot.org/>
>        "Buildroot is a tool to generate embedded Linux systems"
>
> - [8]: <https://buildroot.org/downloads/manual/manual.html#outside-br-custom>
>        "9.2. Keeping customizations outside of Buildroot"
>
> - [11]: <patches/linux/0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch>
>         "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch"
>
> - [12]: <patches/linux/0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch>
>         "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch"
>
> - [13]: <patches/linux/0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch>
>         "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch"
>
> - [14]: <patches/linux/0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch>
>         "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch"
>
> - [15]: <patches/linux/0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch>
>         "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch"
>
> - [16]: <patches/linux/0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch>
>         "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch"
>
> - [17]: <patches/linux/0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch>
>         "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch"
>
> - [18]: <https://gitlab.com/essensium-mind/ble-memleak-repro/-/blob/main/package/ble-memleak-repro/ble-memleak-repro.c?ref_type=heads>
>         "ble-memleak-repro.c"
>
> - [19]: <https://lore.kernel.org/linux-bluetooth/0000000000000fd01206046e7410@google.com/T/#u>
>         "[syzbot] [bluetooth?] memory leak in hci_conn_add (2)"
>         2023-09-02 23:25:00 -0700
>
> ---
>
> Christophe Roullier (1):
>   ARM: dts: stm32: Add Bluetooth (usart2) feature on stm32mp157x
>
> Olivier L'Heureux (6):
>   Bluetooth: add many traces for allocation/free/refcounting
>   Bluetooth: L2CAP: use refcount on struct l2cap_chan->conn
>   Bluetooth: L2CAP: free the leaking struct l2cap_conn
>   Bluetooth: introduce hci_conn_free() for better structure
>   Bluetooth: L2CAP: inc refcount if reuse struct l2cap_conn
>   Bluetooth: unlink objects to avoid use-after-free
>
>  arch/arm/boot/dts/st/stm32mp157c-dk2.dts | 11 ++++++-
>  include/net/bluetooth/hci_core.h         |  7 +++--
>  net/bluetooth/hci_conn.c                 | 18 ++++++++++++
>  net/bluetooth/hci_sysfs.c                |  8 ++++-
>  net/bluetooth/l2cap_core.c               | 37 ++++++++++++++++++++----
>  5 files changed, 72 insertions(+), 9 deletions(-)
>
>
> base-commit: 2dde18cd1d8fac735875f2e4987f11817cc0bc2c
> --
> 2.39.2
>
Luiz Augusto von Dentz Sept. 12, 2023, 7:19 p.m. UTC | #2
Hi Olivier,

On Tue, Sep 5, 2023 at 1:42 PM Luiz Augusto von Dentz
<luiz.dentz@gmail.com> wrote:
>
> Hi Olivier,
>
> On Mon, Sep 4, 2023 at 3:12 PM Olivier L'Heureux
> <olivier.lheureux@mind.be> wrote:
> >
> > Request for Comments
> >
> > Summary
> > =======
> >
> > We have found a memory leak in the Bluetooth L2CAP subsystem and fixed
> > it, but this opened the doors to use-after-free problems, which are
> > not completely fixed yet. This patch series present a way to reproduce
> > it, the proposed fix and the status. There is more detailed
> > documentation in [1].
> >
> > Memory Leak Overview
> > ====================
> >
> > We have found a memory leak in the L2CAP layer of Linux's Bluetooth
> > Subsystem, the same as reported by syzbot in "[syzbot] [bluetooth?]
> > memory leak in hci_conn_add (2)" (2023-09-02 23:25:00 -0700) [19].
> >
> > We can reproduce it. When, in a loop, a user-mode program:
> >
> >  1. Opens a Bluetooth socket at the L2CAP layer:
> >
> >           sockfd = socket(AF_BLUETOOTH, SOCK_SEQPACKET, BTPROTO_L2CAP);
> >
> >  2. Set a timeout on the socket:
> >
> >           timeout.tv_sec  = 2;
> >           timeout.tv_usec = 0;
> >
> >           err = setsockopt(sockfd, SOL_SOCKET, SO_SNDTIMEO, &timeout, sizeof(timeout));
> >
> >  3. Connect to a specific Bluetooth address:
> >
> >           struct sockaddr_l2 ble_addr;
> >
> >           memset(&ble_addr, 0, sizeof(ble_addr));
> >           ble_addr.l2_family = AF_BLUETOOTH;
> >           ble_addr.l2_psm = htobs(0x80 /* L2CAP_PSM_LE_DYN_START */);
> >           // l2_bdaddr=00:0a:07:a3:22:00
> >           ble_addr.l2_bdaddr = *(bdaddr_t*)("\x00\x0a\x07\xa3\x22\x00");
> >           ble_addr.l2_cid = htobs(0);
> >           ble_addr.l2_bdaddr_type = BDADDR_LE_PUBLIC;
> >
> >           err = connect(sockfd, (struct sockaddr*)&ble_addr, sizeof(ble_addr));
> >
> > and when the user program does those three steps in a loop, then the
> > kernel leaks the "struct l2cap_conn" [5] and "struct hci_conn" [6]
> > objects allocated in "l2cap_conn_add()" [2] and "hci_conn_add()" [3].
> > Those objects are never freed.
> >
> > The "ble-memleak-repro" program reproduces the memory leak, if the
> > kernel is not patched. Its source is in
> > "package/ble-memleak-repro/ble-memleak-repro.c" [18].
>
> We should probably create a test in l2cap-tester using SO_SNDTIMEO
> first, so we can make sure CI test such case and we are able to detect
> if the problem is reintroduced later:
>
> https://github.com/bluez/bluez/blob/master/tools/l2cap-tester.c
>
> > Memory Leak Fix
> > ===============
> >
> > We have fixed the memory leak, see the patch series in
> > "patches/linux/":
> >
> >  1. The first patch
> >     "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11]
> >     enables Bluetooth on the DK2.
>
> This above should probably be sent separately.
>
> >  2. The second patch
> >     "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12]
> >     adds many dynamic debug traces that help understand the kernel
> >     behaviour and freeing problems.
>
> I'm fine with this change, but we better use the likes of bt_dev_dbg
> instead of BT_DBG.
>
> >  3. Patches
> >     "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13]
> >     and
> >     "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14]
> >     fix the memory leak.
> >  4. Patches
> >     "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15],
> >     "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16]
> >     and
> >     "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17]
> >     fixes the use-after-free that appears when the "struct l2cap_conn"
> >     [5] and "struct hci_conn" [6] objects are freed.
>
> These I'm not very comfortable applying as they are, I'm afraid there
> could be regressions if they are not accompanied with proper tests,
> besides I think there are less intrusive ways to cleanup l2cap_conn,
> see below.
>
> > The commit messages explain why the commit is needed, which problem
> > the commit solves, and how.
> >
> > The first and second patches are present for the memory leak
> > reproduction only, they should not be part of a final fix.
> >
> > Patch Status
> > ============
> >
> > As of writing, the memory leak is fixed. The fix opened the door to
> > other problems, especially use-after-free, sometimes followed by
> > crashes due to NULL dereferences. We think there are weak references
> > (i.e. pointers that don't increment the refcounter) previously
> > pointing to memory that was never freed, but now is freed. On kernels
> > v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the
> > use-after-free and NULL dereferences, but not on kernel v6.5 yet.
>
> I think the problem is that the lifetime of l2cap_conn shall be hooked
> with hci_chan, but the likes of hci_chan_list_flush -> hci_chan_del
> don't actually trigger l2cap_conn_del, which imo is the culprit here,
> because connect_cfm/disconnect_cfm is not called when the connection
> procedure has been aborted prematurely, so perhaps we shall get rid of
> the likes of l2cap_connect_cfm/l2cap_disconn_cfm and instead do the
> cleanup with in the following order:
> hci_conn_cleanup->hci_chan_list_flush->hci_chan_del->l2cap_conn_del,
> that way we avoid having multiple code paths attempting to cleanup
> objects associated with hci_conn/hci_chan.
>
> I'd change hci_chan_create to take a del callback to avoid having
> circular dependencies on one another though.

Are you still planning to work on these changes, or perhaps you need
more guidance regarding the suggested modifications?

> > Reproducing with Buildroot
> > ==========================
> >
> > We have reproduced and fixed the memory leak with Buildroot [7] and a
> > "ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4].
> >
> > The "ble-memleak-repro" repository [1] contains the sources of a
> > complete external Buildroot customisation [8], with the patches, a
> > README, and more explanations to reproduce the problem with Buildroot
> > on a DK2.
> >
> > References:
> > ===========
> >
> > - [1]: <https://gitlab.com/essensium-mind/ble-memleak-repro.git>
> >        "ble-memleak-repro repository"
> >
> > - [2]: <https://elixir.bootlin.com/linux/v6.5/source/net/bluetooth/l2cap_core.c#L7833>
> >        "l2cap_conn_add()"
> >
> > - [3]: <https://elixir.bootlin.com/linux/v6.5/source/net/bluetooth/hci_conn.c#L986>
> >        "hci_conn_add()"
> >
> > - [4]: <https://www.st.com/content/st_com/en/products/evaluation-tools/product-evaluation-tools/mcu-mpu-eval-tools/stm32-mcu-mpu-eval-tools/stm32-discovery-kits/stm32mp157c-dk2.html>
> >        "STM32MP157C-DK2 (DK2)"
> >
> > - [5]: <https://elixir.bootlin.com/linux/v6.5/source/include/net/bluetooth/l2cap.h#L674>
> >        "struct l2cap_conn"
> >
> > - [6]: <https://elixir.bootlin.com/linux/v6.5/source/include/net/bluetooth/hci_core.h#L685>
> >        "struct hci_conn"
> >
> > - [7]: <https://buildroot.org/>
> >        "Buildroot is a tool to generate embedded Linux systems"
> >
> > - [8]: <https://buildroot.org/downloads/manual/manual.html#outside-br-custom>
> >        "9.2. Keeping customizations outside of Buildroot"
> >
> > - [11]: <patches/linux/0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch>
> >         "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch"
> >
> > - [12]: <patches/linux/0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch>
> >         "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch"
> >
> > - [13]: <patches/linux/0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch>
> >         "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch"
> >
> > - [14]: <patches/linux/0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch>
> >         "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch"
> >
> > - [15]: <patches/linux/0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch>
> >         "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch"
> >
> > - [16]: <patches/linux/0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch>
> >         "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch"
> >
> > - [17]: <patches/linux/0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch>
> >         "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch"
> >
> > - [18]: <https://gitlab.com/essensium-mind/ble-memleak-repro/-/blob/main/package/ble-memleak-repro/ble-memleak-repro.c?ref_type=heads>
> >         "ble-memleak-repro.c"
> >
> > - [19]: <https://lore.kernel.org/linux-bluetooth/0000000000000fd01206046e7410@google.com/T/#u>
> >         "[syzbot] [bluetooth?] memory leak in hci_conn_add (2)"
> >         2023-09-02 23:25:00 -0700
> >
> > ---
> >
> > Christophe Roullier (1):
> >   ARM: dts: stm32: Add Bluetooth (usart2) feature on stm32mp157x
> >
> > Olivier L'Heureux (6):
> >   Bluetooth: add many traces for allocation/free/refcounting
> >   Bluetooth: L2CAP: use refcount on struct l2cap_chan->conn
> >   Bluetooth: L2CAP: free the leaking struct l2cap_conn
> >   Bluetooth: introduce hci_conn_free() for better structure
> >   Bluetooth: L2CAP: inc refcount if reuse struct l2cap_conn
> >   Bluetooth: unlink objects to avoid use-after-free
> >
> >  arch/arm/boot/dts/st/stm32mp157c-dk2.dts | 11 ++++++-
> >  include/net/bluetooth/hci_core.h         |  7 +++--
> >  net/bluetooth/hci_conn.c                 | 18 ++++++++++++
> >  net/bluetooth/hci_sysfs.c                |  8 ++++-
> >  net/bluetooth/l2cap_core.c               | 37 ++++++++++++++++++++----
> >  5 files changed, 72 insertions(+), 9 deletions(-)
> >
> >
> > base-commit: 2dde18cd1d8fac735875f2e4987f11817cc0bc2c
> > --
> > 2.39.2
> >
>
>
> --
> Luiz Augusto von Dentz
Olivier L'Heureux Sept. 13, 2023, 10:25 p.m. UTC | #3
Hello Luiz,

Thanks for your review.

On 05/09/2023 22:42, Luiz Augusto von Dentz wrote:
> Hi Olivier,
> 
> On Mon, Sep 4, 2023 at 3:12 PM Olivier L'Heureux
> <olivier.lheureux@mind.be> wrote:
>>
>> Request for Comments
[...]
>>
>> The "ble-memleak-repro" program reproduces the memory leak, if the
>> kernel is not patched. Its source is in
>> "package/ble-memleak-repro/ble-memleak-repro.c" [18].
> 
> We should probably create a test in l2cap-tester using SO_SNDTIMEO
> first, so we can make sure CI test such case and we are able to detect
> if the problem is reintroduced later:
> 
> https://github.com/bluez/bluez/blob/master/tools/l2cap-tester.c

I didn't know about "l2cap-tester.c". Agree, it would be great to
integrate my test app into it. I could try, but I don't know the test
framework yet.

>> Memory Leak Fix
>> ===============
>>
>> We have fixed the memory leak, see the patch series in
>> "patches/linux/":
>>
>>   1. The first patch
>>      "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11]
>>      enables Bluetooth on the DK2.
> 
> This above should probably be sent separately.
> 
>>   2. The second patch
>>      "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12]
>>      adds many dynamic debug traces that help understand the kernel
>>      behaviour and freeing problems.
> 
> I'm fine with this change, but we better use the likes of bt_dev_dbg
> instead of BT_DBG.

This commit intended to increase the visibility during debugging, and
I was not intending it for a permanent presence in the kernel.
But if you find it useful, I can submit a patch RFC v2 with
"bt_dev_dbg()" instead of "BT_DBG()". Note that there is currently no
"bt_dev_dbg()" in "l2cap_core.c" yet.

>>      "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13]
>>      and
>>      "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14]
>>      fix the memory leak.
>>   4. Patches
>>      "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15],
>>      "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16]
>>      and
>>      "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17]
>>      fixes the use-after-free that appears when the "struct l2cap_conn"
>>      [5] and "struct hci_conn" [6] objects are freed.
> 
> These I'm not very comfortable applying as they are, I'm afraid there
> could be regressions if they are not accompanied with proper tests,
> besides I think there are less intrusive ways to cleanup l2cap_conn,
> see below.
> 
>> The commit messages explain why the commit is needed, which problem
>> the commit solves, and how.
>>
>> The first and second patches are present for the memory leak
>> reproduction only, they should not be part of a final fix.
>>
>> Patch Status
>> ============
>>
>> As of writing, the memory leak is fixed. The fix opened the door to
>> other problems, especially use-after-free, sometimes followed by
>> crashes due to NULL dereferences. We think there are weak references
>> (i.e. pointers that don't increment the refcounter) previously
>> pointing to memory that was never freed, but now is freed. On kernels
>> v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the
>> use-after-free and NULL dereferences, but not on kernel v6.5 yet.
> 
> I think the problem is that the lifetime of l2cap_conn shall be hooked
> with hci_chan, but the likes of hci_chan_list_flush -> hci_chan_del
> don't actually trigger l2cap_conn_del, which imo is the culprit here,
> because connect_cfm/disconnect_cfm is not called when the connection
> procedure has been aborted prematurely, so perhaps we shall get rid of
> the likes of l2cap_connect_cfm/l2cap_disconn_cfm and instead do the
> cleanup with in the following order:
> hci_conn_cleanup->hci_chan_list_flush->hci_chan_del->l2cap_conn_del,
> that way we avoid having multiple code paths attempting to cleanup
> objects associated with hci_conn/hci_chan.

I fully agree with your analysis, which correspond to mine: we should
call "l2cap_conn_del()", it would properly clean the allocations in
"l2cap_conn_add()".
I have tried but it was not obvious to find the right place to call
"l2cap_conn_del()" with the proper locking.
As you write, connect_cfm/disconnect_cfm is not called when the
connection is aborted, and that is the root cause of the memory leak.

Your proposal is most probably the best way to go.

> I'd change hci_chan_create to take a del callback to avoid having
> circular dependencies on one another though.

Interesting, could you explain how you would do it? Perhaps point on
a callback example?

>> Reproducing with Buildroot
>> ==========================
>>
>> We have reproduced and fixed the memory leak with Buildroot [7] and a
>> "ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4].
>>
>> The "ble-memleak-repro" repository [1] contains the sources of a
>> complete external Buildroot customisation [8], with the patches, a
>> README, and more explanations to reproduce the problem with Buildroot
>> on a DK2.
>>
>> References:
>> ===========
>>
>> - [1]: <https://gitlab.com/essensium-mind/ble-memleak-repro.git>
>>         "ble-memleak-repro repository"
[...]

--
Olivier L'Heureux
Luiz Augusto von Dentz Sept. 14, 2023, 9:17 p.m. UTC | #4
Hi Olivier,

On Wed, Sep 13, 2023 at 3:25 PM Olivier L'Heureux
<olivier.lheureux@mind.be> wrote:
>
> Hello Luiz,
>
> Thanks for your review.
>
> On 05/09/2023 22:42, Luiz Augusto von Dentz wrote:
> > Hi Olivier,
> >
> > On Mon, Sep 4, 2023 at 3:12 PM Olivier L'Heureux
> > <olivier.lheureux@mind.be> wrote:
> >>
> >> Request for Comments
> [...]
> >>
> >> The "ble-memleak-repro" program reproduces the memory leak, if the
> >> kernel is not patched. Its source is in
> >> "package/ble-memleak-repro/ble-memleak-repro.c" [18].
> >
> > We should probably create a test in l2cap-tester using SO_SNDTIMEO
> > first, so we can make sure CI test such case and we are able to detect
> > if the problem is reintroduced later:
> >
> > https://github.com/bluez/bluez/blob/master/tools/l2cap-tester.c
>
> I didn't know about "l2cap-tester.c". Agree, it would be great to
> integrate my test app into it. I could try, but I don't know the test
> framework yet.
>
> >> Memory Leak Fix
> >> ===============
> >>
> >> We have fixed the memory leak, see the patch series in
> >> "patches/linux/":
> >>
> >>   1. The first patch
> >>      "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11]
> >>      enables Bluetooth on the DK2.
> >
> > This above should probably be sent separately.
> >
> >>   2. The second patch
> >>      "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12]
> >>      adds many dynamic debug traces that help understand the kernel
> >>      behaviour and freeing problems.
> >
> > I'm fine with this change, but we better use the likes of bt_dev_dbg
> > instead of BT_DBG.
>
> This commit intended to increase the visibility during debugging, and
> I was not intending it for a permanent presence in the kernel.
> But if you find it useful, I can submit a patch RFC v2 with
> "bt_dev_dbg()" instead of "BT_DBG()". Note that there is currently no
> "bt_dev_dbg()" in "l2cap_core.c" yet.
>
> >>      "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13]
> >>      and
> >>      "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14]
> >>      fix the memory leak.
> >>   4. Patches
> >>      "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15],
> >>      "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16]
> >>      and
> >>      "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17]
> >>      fixes the use-after-free that appears when the "struct l2cap_conn"
> >>      [5] and "struct hci_conn" [6] objects are freed.
> >
> > These I'm not very comfortable applying as they are, I'm afraid there
> > could be regressions if they are not accompanied with proper tests,
> > besides I think there are less intrusive ways to cleanup l2cap_conn,
> > see below.
> >
> >> The commit messages explain why the commit is needed, which problem
> >> the commit solves, and how.
> >>
> >> The first and second patches are present for the memory leak
> >> reproduction only, they should not be part of a final fix.
> >>
> >> Patch Status
> >> ============
> >>
> >> As of writing, the memory leak is fixed. The fix opened the door to
> >> other problems, especially use-after-free, sometimes followed by
> >> crashes due to NULL dereferences. We think there are weak references
> >> (i.e. pointers that don't increment the refcounter) previously
> >> pointing to memory that was never freed, but now is freed. On kernels
> >> v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the
> >> use-after-free and NULL dereferences, but not on kernel v6.5 yet.
> >
> > I think the problem is that the lifetime of l2cap_conn shall be hooked
> > with hci_chan, but the likes of hci_chan_list_flush -> hci_chan_del
> > don't actually trigger l2cap_conn_del, which imo is the culprit here,
> > because connect_cfm/disconnect_cfm is not called when the connection
> > procedure has been aborted prematurely, so perhaps we shall get rid of
> > the likes of l2cap_connect_cfm/l2cap_disconn_cfm and instead do the
> > cleanup with in the following order:
> > hci_conn_cleanup->hci_chan_list_flush->hci_chan_del->l2cap_conn_del,
> > that way we avoid having multiple code paths attempting to cleanup
> > objects associated with hci_conn/hci_chan.
>
> I fully agree with your analysis, which correspond to mine: we should
> call "l2cap_conn_del()", it would properly clean the allocations in
> "l2cap_conn_add()".
> I have tried but it was not obvious to find the right place to call
> "l2cap_conn_del()" with the proper locking.
> As you write, connect_cfm/disconnect_cfm is not called when the
> connection is aborted, and that is the root cause of the memory leak.

Btw, I was trying to reproduce it with the following test set, but at
least kmemleak was not able to detect leaks of l2cap_conn:

https://patchwork.kernel.org/project/bluetooth/list/?series=784343

> Your proposal is most probably the best way to go.
>
> > I'd change hci_chan_create to take a del callback to avoid having
> > circular dependencies on one another though.
>
> Interesting, could you explain how you would do it? Perhaps point on
> a callback example?
>
> >> Reproducing with Buildroot
> >> ==========================
> >>
> >> We have reproduced and fixed the memory leak with Buildroot [7] and a
> >> "ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4].
> >>
> >> The "ble-memleak-repro" repository [1] contains the sources of a
> >> complete external Buildroot customisation [8], with the patches, a
> >> README, and more explanations to reproduce the problem with Buildroot
> >> on a DK2.
> >>
> >> References:
> >> ===========
> >>
> >> - [1]: <https://gitlab.com/essensium-mind/ble-memleak-repro.git>
> >>         "ble-memleak-repro repository"
> [...]
>
> --
> Olivier L'Heureux
Marleen Vos Nov. 7, 2023, 8:46 a.m. UTC | #5
Hi Luiz,

Because Olivier has been and still is being swamped with other work, I'm 
kind of trying to take over these patches. So far I can reproduce the 
memleak on a recent kernel without the patches.

Olivier told me you added tests to check for the memleak in 
l2cap-tester. Can you point me towards more info on how exactly you run 
these tests, as the info I find on the web is rather minimal.

 From what I read in the thread, it looks like your tests don't catch 
the memleak?

Kind regards,

Marleen


On 14/09/2023 23:17, Luiz Augusto von Dentz wrote:
> Hi Olivier,
>
> On Wed, Sep 13, 2023 at 3:25 PM Olivier L'Heureux
> <olivier.lheureux@mind.be>  wrote:
>> Hello Luiz,
>>
>> Thanks for your review.
>>
>> On 05/09/2023 22:42, Luiz Augusto von Dentz wrote:
>>> Hi Olivier,
>>>
>>> On Mon, Sep 4, 2023 at 3:12 PM Olivier L'Heureux
>>> <olivier.lheureux@mind.be>  wrote:
>>>> Request for Comments
>> [...]
>>>> The "ble-memleak-repro" program reproduces the memory leak, if the
>>>> kernel is not patched. Its source is in
>>>> "package/ble-memleak-repro/ble-memleak-repro.c" [18].
>>> We should probably create a test in l2cap-tester using SO_SNDTIMEO
>>> first, so we can make sure CI test such case and we are able to detect
>>> if the problem is reintroduced later:
>>>
>>> https://github.com/bluez/bluez/blob/master/tools/l2cap-tester.c
>> I didn't know about "l2cap-tester.c". Agree, it would be great to
>> integrate my test app into it. I could try, but I don't know the test
>> framework yet.
>>
>>>> Memory Leak Fix
>>>> ===============
>>>>
>>>> We have fixed the memory leak, see the patch series in
>>>> "patches/linux/":
>>>>
>>>>    1. The first patch
>>>>       "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11]
>>>>       enables Bluetooth on the DK2.
>>> This above should probably be sent separately.
>>>
>>>>    2. The second patch
>>>>       "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12]
>>>>       adds many dynamic debug traces that help understand the kernel
>>>>       behaviour and freeing problems.
>>> I'm fine with this change, but we better use the likes of bt_dev_dbg
>>> instead of BT_DBG.
>> This commit intended to increase the visibility during debugging, and
>> I was not intending it for a permanent presence in the kernel.
>> But if you find it useful, I can submit a patch RFC v2 with
>> "bt_dev_dbg()" instead of "BT_DBG()". Note that there is currently no
>> "bt_dev_dbg()" in "l2cap_core.c" yet.
>>
>>>>       "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13]
>>>>       and
>>>>       "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14]
>>>>       fix the memory leak.
>>>>    4. Patches
>>>>       "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15],
>>>>       "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16]
>>>>       and
>>>>       "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17]
>>>>       fixes the use-after-free that appears when the "struct l2cap_conn"
>>>>       [5] and "struct hci_conn" [6] objects are freed.
>>> These I'm not very comfortable applying as they are, I'm afraid there
>>> could be regressions if they are not accompanied with proper tests,
>>> besides I think there are less intrusive ways to cleanup l2cap_conn,
>>> see below.
>>>
>>>> The commit messages explain why the commit is needed, which problem
>>>> the commit solves, and how.
>>>>
>>>> The first and second patches are present for the memory leak
>>>> reproduction only, they should not be part of a final fix.
>>>>
>>>> Patch Status
>>>> ============
>>>>
>>>> As of writing, the memory leak is fixed. The fix opened the door to
>>>> other problems, especially use-after-free, sometimes followed by
>>>> crashes due to NULL dereferences. We think there are weak references
>>>> (i.e. pointers that don't increment the refcounter) previously
>>>> pointing to memory that was never freed, but now is freed. On kernels
>>>> v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the
>>>> use-after-free and NULL dereferences, but not on kernel v6.5 yet.
>>> I think the problem is that the lifetime of l2cap_conn shall be hooked
>>> with hci_chan, but the likes of hci_chan_list_flush -> hci_chan_del
>>> don't actually trigger l2cap_conn_del, which imo is the culprit here,
>>> because connect_cfm/disconnect_cfm is not called when the connection
>>> procedure has been aborted prematurely, so perhaps we shall get rid of
>>> the likes of l2cap_connect_cfm/l2cap_disconn_cfm and instead do the
>>> cleanup with in the following order:
>>> hci_conn_cleanup->hci_chan_list_flush->hci_chan_del->l2cap_conn_del,
>>> that way we avoid having multiple code paths attempting to cleanup
>>> objects associated with hci_conn/hci_chan.
>> I fully agree with your analysis, which correspond to mine: we should
>> call "l2cap_conn_del()", it would properly clean the allocations in
>> "l2cap_conn_add()".
>> I have tried but it was not obvious to find the right place to call
>> "l2cap_conn_del()" with the proper locking.
>> As you write, connect_cfm/disconnect_cfm is not called when the
>> connection is aborted, and that is the root cause of the memory leak.
> Btw, I was trying to reproduce it with the following test set, but at
> least kmemleak was not able to detect leaks of l2cap_conn:
>
> https://patchwork.kernel.org/project/bluetooth/list/?series=784343
>
>> Your proposal is most probably the best way to go.
>>
>>> I'd change hci_chan_create to take a del callback to avoid having
>>> circular dependencies on one another though.
>> Interesting, could you explain how you would do it? Perhaps point on
>> a callback example?
>>
>>>> Reproducing with Buildroot
>>>> ==========================
>>>>
>>>> We have reproduced and fixed the memory leak with Buildroot [7] and a
>>>> "ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4].
>>>>
>>>> The "ble-memleak-repro" repository [1] contains the sources of a
>>>> complete external Buildroot customisation [8], with the patches, a
>>>> README, and more explanations to reproduce the problem with Buildroot
>>>> on a DK2.
>>>>
>>>> References:
>>>> ===========
>>>>
>>>> - [1]:<https://gitlab.com/essensium-mind/ble-memleak-repro.git>
>>>>          "ble-memleak-repro repository"
>> [...]
>>
>> --
>> Olivier L'Heureux
Luiz Augusto von Dentz Nov. 7, 2023, 3:10 p.m. UTC | #6
Hi Marleen,

On Tue, Nov 7, 2023 at 3:46 AM Marleen Vos <marleen.vos@essensium.com> wrote:
>
> Hi Luiz,
>
> Because Olivier has been and still is being swamped with other work, I'm
> kind of trying to take over these patches. So far I can reproduce the
> memleak on a recent kernel without the patches.
>
> Olivier told me you added tests to check for the memleak in
> l2cap-tester. Can you point me towards more info on how exactly you run
> these tests, as the info I find on the web is rather minimal.
>
>  From what I read in the thread, it looks like your tests don't catch
> the memleak?

I'm currently on vacation, I should be back next week, try checking
l2cap-tester to emulate the leak as you said it was still
reproducible.

> Kind regards,
>
> Marleen
>
>
> On 14/09/2023 23:17, Luiz Augusto von Dentz wrote:
> > Hi Olivier,
> >
> > On Wed, Sep 13, 2023 at 3:25 PM Olivier L'Heureux
> > <olivier.lheureux@mind.be>  wrote:
> >> Hello Luiz,
> >>
> >> Thanks for your review.
> >>
> >> On 05/09/2023 22:42, Luiz Augusto von Dentz wrote:
> >>> Hi Olivier,
> >>>
> >>> On Mon, Sep 4, 2023 at 3:12 PM Olivier L'Heureux
> >>> <olivier.lheureux@mind.be>  wrote:
> >>>> Request for Comments
> >> [...]
> >>>> The "ble-memleak-repro" program reproduces the memory leak, if the
> >>>> kernel is not patched. Its source is in
> >>>> "package/ble-memleak-repro/ble-memleak-repro.c" [18].
> >>> We should probably create a test in l2cap-tester using SO_SNDTIMEO
> >>> first, so we can make sure CI test such case and we are able to detect
> >>> if the problem is reintroduced later:
> >>>
> >>> https://github.com/bluez/bluez/blob/master/tools/l2cap-tester.c
> >> I didn't know about "l2cap-tester.c". Agree, it would be great to
> >> integrate my test app into it. I could try, but I don't know the test
> >> framework yet.
> >>
> >>>> Memory Leak Fix
> >>>> ===============
> >>>>
> >>>> We have fixed the memory leak, see the patch series in
> >>>> "patches/linux/":
> >>>>
> >>>>    1. The first patch
> >>>>       "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11]
> >>>>       enables Bluetooth on the DK2.
> >>> This above should probably be sent separately.
> >>>
> >>>>    2. The second patch
> >>>>       "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12]
> >>>>       adds many dynamic debug traces that help understand the kernel
> >>>>       behaviour and freeing problems.
> >>> I'm fine with this change, but we better use the likes of bt_dev_dbg
> >>> instead of BT_DBG.
> >> This commit intended to increase the visibility during debugging, and
> >> I was not intending it for a permanent presence in the kernel.
> >> But if you find it useful, I can submit a patch RFC v2 with
> >> "bt_dev_dbg()" instead of "BT_DBG()". Note that there is currently no
> >> "bt_dev_dbg()" in "l2cap_core.c" yet.
> >>
> >>>>       "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13]
> >>>>       and
> >>>>       "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14]
> >>>>       fix the memory leak.
> >>>>    4. Patches
> >>>>       "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15],
> >>>>       "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16]
> >>>>       and
> >>>>       "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17]
> >>>>       fixes the use-after-free that appears when the "struct l2cap_conn"
> >>>>       [5] and "struct hci_conn" [6] objects are freed.
> >>> These I'm not very comfortable applying as they are, I'm afraid there
> >>> could be regressions if they are not accompanied with proper tests,
> >>> besides I think there are less intrusive ways to cleanup l2cap_conn,
> >>> see below.
> >>>
> >>>> The commit messages explain why the commit is needed, which problem
> >>>> the commit solves, and how.
> >>>>
> >>>> The first and second patches are present for the memory leak
> >>>> reproduction only, they should not be part of a final fix.
> >>>>
> >>>> Patch Status
> >>>> ============
> >>>>
> >>>> As of writing, the memory leak is fixed. The fix opened the door to
> >>>> other problems, especially use-after-free, sometimes followed by
> >>>> crashes due to NULL dereferences. We think there are weak references
> >>>> (i.e. pointers that don't increment the refcounter) previously
> >>>> pointing to memory that was never freed, but now is freed. On kernels
> >>>> v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the
> >>>> use-after-free and NULL dereferences, but not on kernel v6.5 yet.
> >>> I think the problem is that the lifetime of l2cap_conn shall be hooked
> >>> with hci_chan, but the likes of hci_chan_list_flush -> hci_chan_del
> >>> don't actually trigger l2cap_conn_del, which imo is the culprit here,
> >>> because connect_cfm/disconnect_cfm is not called when the connection
> >>> procedure has been aborted prematurely, so perhaps we shall get rid of
> >>> the likes of l2cap_connect_cfm/l2cap_disconn_cfm and instead do the
> >>> cleanup with in the following order:
> >>> hci_conn_cleanup->hci_chan_list_flush->hci_chan_del->l2cap_conn_del,
> >>> that way we avoid having multiple code paths attempting to cleanup
> >>> objects associated with hci_conn/hci_chan.
> >> I fully agree with your analysis, which correspond to mine: we should
> >> call "l2cap_conn_del()", it would properly clean the allocations in
> >> "l2cap_conn_add()".
> >> I have tried but it was not obvious to find the right place to call
> >> "l2cap_conn_del()" with the proper locking.
> >> As you write, connect_cfm/disconnect_cfm is not called when the
> >> connection is aborted, and that is the root cause of the memory leak.
> > Btw, I was trying to reproduce it with the following test set, but at
> > least kmemleak was not able to detect leaks of l2cap_conn:
> >
> > https://patchwork.kernel.org/project/bluetooth/list/?series=784343
> >
> >> Your proposal is most probably the best way to go.
> >>
> >>> I'd change hci_chan_create to take a del callback to avoid having
> >>> circular dependencies on one another though.
> >> Interesting, could you explain how you would do it? Perhaps point on
> >> a callback example?
> >>
> >>>> Reproducing with Buildroot
> >>>> ==========================
> >>>>
> >>>> We have reproduced and fixed the memory leak with Buildroot [7] and a
> >>>> "ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4].
> >>>>
> >>>> The "ble-memleak-repro" repository [1] contains the sources of a
> >>>> complete external Buildroot customisation [8], with the patches, a
> >>>> README, and more explanations to reproduce the problem with Buildroot
> >>>> on a DK2.
> >>>>
> >>>> References:
> >>>> ===========
> >>>>
> >>>> - [1]:<https://gitlab.com/essensium-mind/ble-memleak-repro.git>
> >>>>          "ble-memleak-repro repository"
> >> [...]
> >>
> >> --
> >> Olivier L'Heureux
Luiz Augusto von Dentz Nov. 13, 2023, 2:58 p.m. UTC | #7
Hi Marleen,

On Mon, Nov 13, 2023 at 9:23 AM Marleen Vos <marleen.vos@essensium.com> wrote:
>
> Hi Luiz,
>
> On Tue, Nov 7, 2023 at 4:10 PM Luiz Augusto von Dentz <luiz.dentz@gmail.com> wrote:
>>
>> Hi Marleen,
>>
>> On Tue, Nov 7, 2023 at 3:46 AM Marleen Vos <marleen.vos@essensium.com> wrote:
>> >
>> > Hi Luiz,
>> >
>> > Because Olivier has been and still is being swamped with other work, I'm
>> > kind of trying to take over these patches. So far I can reproduce the
>> > memleak on a recent kernel without the patches.
>> >
>> > Olivier told me you added tests to check for the memleak in
>> > l2cap-tester. Can you point me towards more info on how exactly you run
>> > these tests, as the info I find on the web is rather minimal.
>> >
>> >  From what I read in the thread, it looks like your tests don't catch
>> > the memleak?
>>
>> I'm currently on vacation, I should be back next week, try checking
>> l2cap-tester to emulate the leak as you said it was still
>> reproducible.
>
>
> I hope you had a great vacation!
> Using git bisect, I found that commit 4ab81f16c68a602b2b69e333ae08d8748a9398de is the one where the leaks stop being reproducable by the method we use. At first sight the changes you made seem unrelated to the memleak, so the main question for us is now: has the resolution been accidental or intentional?

So it is not reproducible anymore?

> The commit message says:
> commit 4ab81f16c68a602b2b69e333ae08d8748a9398de
> Author: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
> Date:   Mon Jun 26 17:25:06 2023 -0700
>
>     Bluetooth: hci_conn: Consolidate code for aborting connections
>
>     [ Upstream commit a13f316e90fdb1fb6df6582e845aa9b3270f3581 ]
>
>     This consolidates code for aborting connections using
>     hci_cmd_sync_queue so it is synchronized with other threads, but
>     because of the fact that some commands may block the cmd_sync_queue
>     while waiting specific events this attempt to cancel those requests by
>     using hci_cmd_sync_cancel.
>
>     Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
>     Stable-dep-of: 94d9ba9f9888 ("Bluetooth: hci_sync: Fix UAF in hci_disconnect_all_sync")
>     Signed-off-by: Sasha Levin <sashal@kernel.org>
>
> Maybe you can shed some light on why this commit seems to fix the leaks?

Well I suppose it is due to the cleanup process freeing l2cap_conn in
the process, while perhaps the previous code didn't.

> Kind regards,
> Marleen
>
>>
>>
>> > Kind regards,
>> >
>> > Marleen
>> >
>> >
>> > On 14/09/2023 23:17, Luiz Augusto von Dentz wrote:
>> > > Hi Olivier,
>> > >
>> > > On Wed, Sep 13, 2023 at 3:25 PM Olivier L'Heureux
>> > > <olivier.lheureux@mind.be>  wrote:
>> > >> Hello Luiz,
>> > >>
>> > >> Thanks for your review.
>> > >>
>> > >> On 05/09/2023 22:42, Luiz Augusto von Dentz wrote:
>> > >>> Hi Olivier,
>> > >>>
>> > >>> On Mon, Sep 4, 2023 at 3:12 PM Olivier L'Heureux
>> > >>> <olivier.lheureux@mind.be>  wrote:
>> > >>>> Request for Comments
>> > >> [...]
>> > >>>> The "ble-memleak-repro" program reproduces the memory leak, if the
>> > >>>> kernel is not patched. Its source is in
>> > >>>> "package/ble-memleak-repro/ble-memleak-repro.c" [18].
>> > >>> We should probably create a test in l2cap-tester using SO_SNDTIMEO
>> > >>> first, so we can make sure CI test such case and we are able to detect
>> > >>> if the problem is reintroduced later:
>> > >>>
>> > >>> https://github.com/bluez/bluez/blob/master/tools/l2cap-tester.c
>> > >> I didn't know about "l2cap-tester.c". Agree, it would be great to
>> > >> integrate my test app into it. I could try, but I don't know the test
>> > >> framework yet.
>> > >>
>> > >>>> Memory Leak Fix
>> > >>>> ===============
>> > >>>>
>> > >>>> We have fixed the memory leak, see the patch series in
>> > >>>> "patches/linux/":
>> > >>>>
>> > >>>>    1. The first patch
>> > >>>>       "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11]
>> > >>>>       enables Bluetooth on the DK2.
>> > >>> This above should probably be sent separately.
>> > >>>
>> > >>>>    2. The second patch
>> > >>>>       "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12]
>> > >>>>       adds many dynamic debug traces that help understand the kernel
>> > >>>>       behaviour and freeing problems.
>> > >>> I'm fine with this change, but we better use the likes of bt_dev_dbg
>> > >>> instead of BT_DBG.
>> > >> This commit intended to increase the visibility during debugging, and
>> > >> I was not intending it for a permanent presence in the kernel.
>> > >> But if you find it useful, I can submit a patch RFC v2 with
>> > >> "bt_dev_dbg()" instead of "BT_DBG()". Note that there is currently no
>> > >> "bt_dev_dbg()" in "l2cap_core.c" yet.
>> > >>
>> > >>>>       "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13]
>> > >>>>       and
>> > >>>>       "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14]
>> > >>>>       fix the memory leak.
>> > >>>>    4. Patches
>> > >>>>       "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15],
>> > >>>>       "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16]
>> > >>>>       and
>> > >>>>       "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17]
>> > >>>>       fixes the use-after-free that appears when the "struct l2cap_conn"
>> > >>>>       [5] and "struct hci_conn" [6] objects are freed.
>> > >>> These I'm not very comfortable applying as they are, I'm afraid there
>> > >>> could be regressions if they are not accompanied with proper tests,
>> > >>> besides I think there are less intrusive ways to cleanup l2cap_conn,
>> > >>> see below.
>> > >>>
>> > >>>> The commit messages explain why the commit is needed, which problem
>> > >>>> the commit solves, and how.
>> > >>>>
>> > >>>> The first and second patches are present for the memory leak
>> > >>>> reproduction only, they should not be part of a final fix.
>> > >>>>
>> > >>>> Patch Status
>> > >>>> ============
>> > >>>>
>> > >>>> As of writing, the memory leak is fixed. The fix opened the door to
>> > >>>> other problems, especially use-after-free, sometimes followed by
>> > >>>> crashes due to NULL dereferences. We think there are weak references
>> > >>>> (i.e. pointers that don't increment the refcounter) previously
>> > >>>> pointing to memory that was never freed, but now is freed. On kernels
>> > >>>> v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the
>> > >>>> use-after-free and NULL dereferences, but not on kernel v6.5 yet.
>> > >>> I think the problem is that the lifetime of l2cap_conn shall be hooked
>> > >>> with hci_chan, but the likes of hci_chan_list_flush -> hci_chan_del
>> > >>> don't actually trigger l2cap_conn_del, which imo is the culprit here,
>> > >>> because connect_cfm/disconnect_cfm is not called when the connection
>> > >>> procedure has been aborted prematurely, so perhaps we shall get rid of
>> > >>> the likes of l2cap_connect_cfm/l2cap_disconn_cfm and instead do the
>> > >>> cleanup with in the following order:
>> > >>> hci_conn_cleanup->hci_chan_list_flush->hci_chan_del->l2cap_conn_del,
>> > >>> that way we avoid having multiple code paths attempting to cleanup
>> > >>> objects associated with hci_conn/hci_chan.
>> > >> I fully agree with your analysis, which correspond to mine: we should
>> > >> call "l2cap_conn_del()", it would properly clean the allocations in
>> > >> "l2cap_conn_add()".
>> > >> I have tried but it was not obvious to find the right place to call
>> > >> "l2cap_conn_del()" with the proper locking.
>> > >> As you write, connect_cfm/disconnect_cfm is not called when the
>> > >> connection is aborted, and that is the root cause of the memory leak.
>> > > Btw, I was trying to reproduce it with the following test set, but at
>> > > least kmemleak was not able to detect leaks of l2cap_conn:
>> > >
>> > > https://patchwork.kernel.org/project/bluetooth/list/?series=784343
>> > >
>> > >> Your proposal is most probably the best way to go.
>> > >>
>> > >>> I'd change hci_chan_create to take a del callback to avoid having
>> > >>> circular dependencies on one another though.
>> > >> Interesting, could you explain how you would do it? Perhaps point on
>> > >> a callback example?
>> > >>
>> > >>>> Reproducing with Buildroot
>> > >>>> ==========================
>> > >>>>
>> > >>>> We have reproduced and fixed the memory leak with Buildroot [7] and a
>> > >>>> "ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4].
>> > >>>>
>> > >>>> The "ble-memleak-repro" repository [1] contains the sources of a
>> > >>>> complete external Buildroot customisation [8], with the patches, a
>> > >>>> README, and more explanations to reproduce the problem with Buildroot
>> > >>>> on a DK2.
>> > >>>>
>> > >>>> References:
>> > >>>> ===========
>> > >>>>
>> > >>>> - [1]:<https://gitlab.com/essensium-mind/ble-memleak-repro.git>
>> > >>>>          "ble-memleak-repro repository"
>> > >> [...]
>> > >>
>> > >> --
>> > >> Olivier L'Heureux
>>
>>
>>
>> --
>> Luiz Augusto von Dentz
Marleen Vos Nov. 13, 2023, 3:01 p.m. UTC | #8
Hi Luiz,

On Mon, Nov 13, 2023 at 3:59 PM Luiz Augusto von Dentz
<luiz.dentz@gmail.com> wrote:
>
> Hi Marleen,
>
> On Mon, Nov 13, 2023 at 9:23 AM Marleen Vos <marleen.vos@essensium.com> wrote:
> >
> > Hi Luiz,
> >
> > On Tue, Nov 7, 2023 at 4:10 PM Luiz Augusto von Dentz <luiz.dentz@gmail.com> wrote:
> >>
> >> Hi Marleen,
> >>
> >> On Tue, Nov 7, 2023 at 3:46 AM Marleen Vos <marleen.vos@essensium.com> wrote:
> >> >
> >> > Hi Luiz,
> >> >
> >> > Because Olivier has been and still is being swamped with other work, I'm
> >> > kind of trying to take over these patches. So far I can reproduce the
> >> > memleak on a recent kernel without the patches.
> >> >
> >> > Olivier told me you added tests to check for the memleak in
> >> > l2cap-tester. Can you point me towards more info on how exactly you run
> >> > these tests, as the info I find on the web is rather minimal.
> >> >
> >> >  From what I read in the thread, it looks like your tests don't catch
> >> > the memleak?
> >>
> >> I'm currently on vacation, I should be back next week, try checking
> >> l2cap-tester to emulate the leak as you said it was still
> >> reproducible.
> >
> >
> > I hope you had a great vacation!
> > Using git bisect, I found that commit 4ab81f16c68a602b2b69e333ae08d8748a9398de is the one where the leaks stop being reproducable by the method we use. At first sight the changes you made seem unrelated to the memleak, so the main question for us is now: has the resolution been accidental or intentional?
>
> So it is not reproducible anymore?

No, starting from the mentioned commit, we can no longer reproduce the leak.

>
> > The commit message says:
> > commit 4ab81f16c68a602b2b69e333ae08d8748a9398de
> > Author: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
> > Date:   Mon Jun 26 17:25:06 2023 -0700
> >
> >     Bluetooth: hci_conn: Consolidate code for aborting connections
> >
> >     [ Upstream commit a13f316e90fdb1fb6df6582e845aa9b3270f3581 ]
> >
> >     This consolidates code for aborting connections using
> >     hci_cmd_sync_queue so it is synchronized with other threads, but
> >     because of the fact that some commands may block the cmd_sync_queue
> >     while waiting specific events this attempt to cancel those requests by
> >     using hci_cmd_sync_cancel.
> >
> >     Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
> >     Stable-dep-of: 94d9ba9f9888 ("Bluetooth: hci_sync: Fix UAF in hci_disconnect_all_sync")
> >     Signed-off-by: Sasha Levin <sashal@kernel.org>
> >
> > Maybe you can shed some light on why this commit seems to fix the leaks?
>
> Well I suppose it is due to the cleanup process freeing l2cap_conn in
> the process, while perhaps the previous code didn't.
>
> > Kind regards,
> > Marleen
> >
> >>
> >>
> >> > Kind regards,
> >> >
> >> > Marleen
> >> >
> >> >
> >> > On 14/09/2023 23:17, Luiz Augusto von Dentz wrote:
> >> > > Hi Olivier,
> >> > >
> >> > > On Wed, Sep 13, 2023 at 3:25 PM Olivier L'Heureux
> >> > > <olivier.lheureux@mind.be>  wrote:
> >> > >> Hello Luiz,
> >> > >>
> >> > >> Thanks for your review.
> >> > >>
> >> > >> On 05/09/2023 22:42, Luiz Augusto von Dentz wrote:
> >> > >>> Hi Olivier,
> >> > >>>
> >> > >>> On Mon, Sep 4, 2023 at 3:12 PM Olivier L'Heureux
> >> > >>> <olivier.lheureux@mind.be>  wrote:
> >> > >>>> Request for Comments
> >> > >> [...]
> >> > >>>> The "ble-memleak-repro" program reproduces the memory leak, if the
> >> > >>>> kernel is not patched. Its source is in
> >> > >>>> "package/ble-memleak-repro/ble-memleak-repro.c" [18].
> >> > >>> We should probably create a test in l2cap-tester using SO_SNDTIMEO
> >> > >>> first, so we can make sure CI test such case and we are able to detect
> >> > >>> if the problem is reintroduced later:
> >> > >>>
> >> > >>> https://github.com/bluez/bluez/blob/master/tools/l2cap-tester.c
> >> > >> I didn't know about "l2cap-tester.c". Agree, it would be great to
> >> > >> integrate my test app into it. I could try, but I don't know the test
> >> > >> framework yet.
> >> > >>
> >> > >>>> Memory Leak Fix
> >> > >>>> ===============
> >> > >>>>
> >> > >>>> We have fixed the memory leak, see the patch series in
> >> > >>>> "patches/linux/":
> >> > >>>>
> >> > >>>>    1. The first patch
> >> > >>>>       "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11]
> >> > >>>>       enables Bluetooth on the DK2.
> >> > >>> This above should probably be sent separately.
> >> > >>>
> >> > >>>>    2. The second patch
> >> > >>>>       "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12]
> >> > >>>>       adds many dynamic debug traces that help understand the kernel
> >> > >>>>       behaviour and freeing problems.
> >> > >>> I'm fine with this change, but we better use the likes of bt_dev_dbg
> >> > >>> instead of BT_DBG.
> >> > >> This commit intended to increase the visibility during debugging, and
> >> > >> I was not intending it for a permanent presence in the kernel.
> >> > >> But if you find it useful, I can submit a patch RFC v2 with
> >> > >> "bt_dev_dbg()" instead of "BT_DBG()". Note that there is currently no
> >> > >> "bt_dev_dbg()" in "l2cap_core.c" yet.
> >> > >>
> >> > >>>>       "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13]
> >> > >>>>       and
> >> > >>>>       "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14]
> >> > >>>>       fix the memory leak.
> >> > >>>>    4. Patches
> >> > >>>>       "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15],
> >> > >>>>       "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16]
> >> > >>>>       and
> >> > >>>>       "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17]
> >> > >>>>       fixes the use-after-free that appears when the "struct l2cap_conn"
> >> > >>>>       [5] and "struct hci_conn" [6] objects are freed.
> >> > >>> These I'm not very comfortable applying as they are, I'm afraid there
> >> > >>> could be regressions if they are not accompanied with proper tests,
> >> > >>> besides I think there are less intrusive ways to cleanup l2cap_conn,
> >> > >>> see below.
> >> > >>>
> >> > >>>> The commit messages explain why the commit is needed, which problem
> >> > >>>> the commit solves, and how.
> >> > >>>>
> >> > >>>> The first and second patches are present for the memory leak
> >> > >>>> reproduction only, they should not be part of a final fix.
> >> > >>>>
> >> > >>>> Patch Status
> >> > >>>> ============
> >> > >>>>
> >> > >>>> As of writing, the memory leak is fixed. The fix opened the door to
> >> > >>>> other problems, especially use-after-free, sometimes followed by
> >> > >>>> crashes due to NULL dereferences. We think there are weak references
> >> > >>>> (i.e. pointers that don't increment the refcounter) previously
> >> > >>>> pointing to memory that was never freed, but now is freed. On kernels
> >> > >>>> v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the
> >> > >>>> use-after-free and NULL dereferences, but not on kernel v6.5 yet.
> >> > >>> I think the problem is that the lifetime of l2cap_conn shall be hooked
> >> > >>> with hci_chan, but the likes of hci_chan_list_flush -> hci_chan_del
> >> > >>> don't actually trigger l2cap_conn_del, which imo is the culprit here,
> >> > >>> because connect_cfm/disconnect_cfm is not called when the connection
> >> > >>> procedure has been aborted prematurely, so perhaps we shall get rid of
> >> > >>> the likes of l2cap_connect_cfm/l2cap_disconn_cfm and instead do the
> >> > >>> cleanup with in the following order:
> >> > >>> hci_conn_cleanup->hci_chan_list_flush->hci_chan_del->l2cap_conn_del,
> >> > >>> that way we avoid having multiple code paths attempting to cleanup
> >> > >>> objects associated with hci_conn/hci_chan.
> >> > >> I fully agree with your analysis, which correspond to mine: we should
> >> > >> call "l2cap_conn_del()", it would properly clean the allocations in
> >> > >> "l2cap_conn_add()".
> >> > >> I have tried but it was not obvious to find the right place to call
> >> > >> "l2cap_conn_del()" with the proper locking.
> >> > >> As you write, connect_cfm/disconnect_cfm is not called when the
> >> > >> connection is aborted, and that is the root cause of the memory leak.
> >> > > Btw, I was trying to reproduce it with the following test set, but at
> >> > > least kmemleak was not able to detect leaks of l2cap_conn:
> >> > >
> >> > > https://patchwork.kernel.org/project/bluetooth/list/?series=784343
> >> > >
> >> > >> Your proposal is most probably the best way to go.
> >> > >>
> >> > >>> I'd change hci_chan_create to take a del callback to avoid having
> >> > >>> circular dependencies on one another though.
> >> > >> Interesting, could you explain how you would do it? Perhaps point on
> >> > >> a callback example?
> >> > >>
> >> > >>>> Reproducing with Buildroot
> >> > >>>> ==========================
> >> > >>>>
> >> > >>>> We have reproduced and fixed the memory leak with Buildroot [7] and a
> >> > >>>> "ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4].
> >> > >>>>
> >> > >>>> The "ble-memleak-repro" repository [1] contains the sources of a
> >> > >>>> complete external Buildroot customisation [8], with the patches, a
> >> > >>>> README, and more explanations to reproduce the problem with Buildroot
> >> > >>>> on a DK2.
> >> > >>>>
> >> > >>>> References:
> >> > >>>> ===========
> >> > >>>>
> >> > >>>> - [1]:<https://gitlab.com/essensium-mind/ble-memleak-repro.git>
> >> > >>>>          "ble-memleak-repro repository"
> >> > >> [...]
> >> > >>
> >> > >> --
> >> > >> Olivier L'Heureux
> >>
> >>
> >>
> >> --
> >> Luiz Augusto von Dentz
>
>
>
> --
> Luiz Augusto von Dentz