diff mbox series

[v2] Bluetooth: qca: Fix BT enable failure again for QCA6390 after warm reboot

Message ID 1715866294-1549-1-git-send-email-quic_zijuhu@quicinc.com
State Accepted
Commit d2118673f3ae27f667c0a8690d297cbb7f917e6e
Headers show
Series [v2] Bluetooth: qca: Fix BT enable failure again for QCA6390 after warm reboot | expand

Commit Message

quic_zijuhu May 16, 2024, 1:31 p.m. UTC
Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
serdev") will cause below regression issue:

BT can't be enabled after below steps:
cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
if property enable-gpios is not configured within DT|ACPI for QCA6390.

The commit is to fix a use-after-free issue within qca_serdev_shutdown()
by adding condition to avoid the serdev is flushed or wrote after closed
but also introduces this regression issue regarding above steps since the
VSC is not sent to reset controller during warm reboot.

Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
once BT was ever enabled, and the use-after-free issue is also fixed by
this change since the serdev is still opened before it is flushed or wrote.

Verified by the reported machine Dell XPS 13 9310 laptop over below two
kernel commits:
commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
implementation for QCA") of bluetooth-next tree.
commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
implementation for QCA") of linus mainline tree.

Fixes: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev")
Cc: stable@vger.kernel.org
Reported-by: Wren Turkal <wt@penguintechs.org>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218726
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Tested-by: Wren Turkal <wt@penguintechs.org>
---
V1 -> V2: Add comments and more commit messages

V1 discussion link:
https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31ebe@163.com/T/#t

 drivers/bluetooth/hci_qca.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

Comments

Lk Sii May 21, 2024, 2:50 p.m. UTC | #1
On 2024/5/16 23:55, Luiz Augusto von Dentz wrote:
> Hi,
> 
> On Thu, May 16, 2024 at 10:57 AM Lk Sii <lk_sii@163.com> wrote:
>>
>>
>>
>> On 2024/5/16 21:31, Zijun Hu wrote:
>>> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
>>> serdev") will cause below regression issue:
>>>
>>> BT can't be enabled after below steps:
>>> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
>>> if property enable-gpios is not configured within DT|ACPI for QCA6390.
>>>
>>> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
>>> by adding condition to avoid the serdev is flushed or wrote after closed
>>> but also introduces this regression issue regarding above steps since the
>>> VSC is not sent to reset controller during warm reboot.
>>>
>>> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
>>> once BT was ever enabled, and the use-after-free issue is also fixed by
>>> this change since the serdev is still opened before it is flushed or wrote.
>>>
>>> Verified by the reported machine Dell XPS 13 9310 laptop over below two
>>> kernel commits:
>>> commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
>>> implementation for QCA") of bluetooth-next tree.
>>> commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
>>> implementation for QCA") of linus mainline tree.
>>>
>>> Fixes: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev")
>>> Cc: stable@vger.kernel.org
>>> Reported-by: Wren Turkal <wt@penguintechs.org>
>>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218726
>>> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
>>> Tested-by: Wren Turkal <wt@penguintechs.org>
>>> ---
>>> V1 -> V2: Add comments and more commit messages
>>>
>>> V1 discussion link:
>>> https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31ebe@163.com/T/#t
>>>
>>>  drivers/bluetooth/hci_qca.c | 18 +++++++++++++++---
>>>  1 file changed, 15 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
>>> index 0c9c9ee56592..9a0bc86f9aac 100644
>>> --- a/drivers/bluetooth/hci_qca.c
>>> +++ b/drivers/bluetooth/hci_qca.c
>>> @@ -2450,15 +2450,27 @@ static void qca_serdev_shutdown(struct device *dev)
>>>       struct qca_serdev *qcadev = serdev_device_get_drvdata(serdev);
>>>       struct hci_uart *hu = &qcadev->serdev_hu;
>>>       struct hci_dev *hdev = hu->hdev;
>>> -     struct qca_data *qca = hu->priv;
>>>       const u8 ibs_wake_cmd[] = { 0xFD };
>>>       const u8 edl_reset_soc_cmd[] = { 0x01, 0x00, 0xFC, 0x01, 0x05 };
>>>
>>>       if (qcadev->btsoc_type == QCA_QCA6390) {
>>> -             if (test_bit(QCA_BT_OFF, &qca->flags) ||
>>> -                 !test_bit(HCI_RUNNING, &hdev->flags))
>>> +             /* The purpose of sending the VSC is to reset SOC into a initial
>>> +              * state and the state will ensure next hdev->setup() success.
>>> +              * if HCI_QUIRK_NON_PERSISTENT_SETUP is set, it means that
>>> +              * hdev->setup() can do its job regardless of SoC state, so
>>> +              * don't need to send the VSC.
>>> +              * if HCI_SETUP is set, it means that hdev->setup() was never
>>> +              * invoked and the SOC is already in the initial state, so
>>> +              * don't also need to send the VSC.
>>> +              */
>>> +             if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks) ||
>>> +                 hci_dev_test_flag(hdev, HCI_SETUP))
>>>                       return;
The main purpose of above checking is NOT to make sure the serdev within
open state as its comments explained.
>>>
>>> +             /* The serdev must be in open state when conrol logic arrives
>>> +              * here, so also fix the use-after-free issue caused by that
>>> +              * the serdev is flushed or wrote after it is closed.
>>> +              */
>>>               serdev_device_write_flush(serdev);
>>>               ret = serdev_device_write_buf(serdev, ibs_wake_cmd,
>>>                                             sizeof(ibs_wake_cmd));
>> i believe Zijun's change is able to fix both below issues and don't
>> introduce new issue.
>>
>> regression issue A:  BT enable failure after warm reboot.
>> issue B:  use-after-free issue, namely, kernel crash.
>>
>>
>> For issue B, i have more findings related to below commits ordered by time.
>>
>> Commit A: 7e7bbddd029b ("Bluetooth: hci_qca: Fix qca6390 enable failure
>> after warm reboot")
>>
>> Commit B: de8892df72be ("Bluetooth: hci_serdev: Close UART port if
>> NON_PERSISTENT_SETUP is set")
>> this commit introduces issue B, it is also not suitable to associate
>> protocol state with state of lower level transport type such as serdev
>> or uart, in my opinion, protocol state should be independent with
>> transport type state, flag HCI_UART_PROTO_READY is for protocol state,
>> it means if protocol hu->proto is initialized and if we can invoke its
>> interfaces.it is common for various kinds of transport types. perhaps,
>> this is the reason why Zijun's change doesn't use flag HCI_UART_PROTO_READY.
> 
> Don't really follow you here, if HCI_UART_PROTO_READY indicates the
> protocol state they is even _more_ important to use before invoking
> serdev APIs, so checking for the quirk sound like a problem because:
> 
> [1] hci_uart_close
>      /* When QUIRK HCI_QUIRK_NON_PERSISTENT_SETUP is set by driver,
>      * BT SOC is completely powered OFF during BT OFF, holding port
>      * open may drain the battery.
>      */
>     if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks)) {
>         clear_bit(HCI_UART_PROTO_READY, &hu->flags);
>         serdev_device_close(hu->serdev);
>     }
> 
> [2] hci_uart_unregister_device
>     if (test_bit(HCI_UART_PROTO_READY, &hu->flags)) {
>         clear_bit(HCI_UART_PROTO_READY, &hu->flags);
>         serdev_device_close(hu->serdev);
>     }
>both case 1 and case 2 were introduced by Commit B in question which
uses protocol state flag HCI_UART_PROTO_READY to track lower level
transport type state, i don't think it is perfect.

for common files hci_serdev.c and hci_ldisc.c, as you saw, the purpose
of checking HCI_UART_PROTO_READY is to call protocol relevant
interfaces, moreover, these protocol relevant interfaces do not deal
with lower transport state. you maybe even notice below present function
within which lower level serdev is flushed before HCI_UART_PROTO_READY
is checked:

static int hci_uart_flush(struct hci_dev *hdev)
{
......
        /* Flush any pending characters in the driver and discipline. */
        serdev_device_write_flush(hu->serdev);

        if (test_bit(HCI_UART_PROTO_READY, &hu->flags))
                hu->proto->flush(hu);

        return 0;
}

in my opinion, that is why qca_serdev_shutdown() does not check
HCI_UART_PROTO_READY for later lower level serdev operations.
> So only in case 1 checking the quirk is equivalent to
> HCI_UART_PROTO_READY on case 2 it does actually check the quirk and
> will proceed to call serdev_device_close, now perhaps the code is
> assuming that shutdown won't be called after that, but it looks it
> does since:
> 
qca_serdev_shutdown() will never be called after case 2 as explained
in the end.
> static void serdev_drv_remove(struct device *dev)
> {
>     const struct serdev_device_driver *sdrv =
> to_serdev_device_driver(dev->driver);
>     if (sdrv->remove)
>         sdrv->remove(to_serdev_device(dev));
> 
>     dev_pm_domain_detach(dev, true);
> }
> 
> dev_pm_domain_detach says it will power off so I assume that means
> that shutdown will be called _after_ remove, so not I'm not really
> convinced that we can avoid using HCI_UART_PROTO_READY, in fact the
> following sequence might always be triggering:
> 
dev_pm_domain_detach() should be irrelevant with qca_serdev_shutdown(),
should not trigger call of qca_serdev_shutdown() as explained in the end
> serdev_drv_remove -> qca_serdev_remove -> hci_uart_unregister_device
> -> serdev_device_close -> qca_close -> kfree(qca)
> dev_pm_domain_detach -> ??? -> qca_serdev_shutdown
> 
> If this sequence is correct then qca_serdev_shutdown accessing
> qca_data will always result in a UAF problem.
> 
above sequence should not correct as explained below.

serdev and its driver should also follow below generic device and driver
design.

1)
driver->shutdown() will be called during shut-down time at this time
driver->remove() should not have been called.

2)
driver->shutdown() is impossible to be called once driver->remove()
was called.

3) for serdev, driver->remove() does not trigger call of
driver->shutdown() since PM relevant poweroff is irrelevant with
driver->shutdown() and i also don't find any PM relevant interfaces will
call driver->shutdown().

i would like to explain issue B based on comments Zijun posted by public
as below:

issue B actually happens during reboot and let me look at these steps
boot -> enable BT -> disable BT -> reboot.

1) step boot will call driver->probe() to register hdev and the serdev
is opened after boot.

2) step enable will call hdev->open() and the serdev will still in open
state

3) step disable will call hdev->close() and the serdev will be closed
after hdev->close() for machine with config which results in
HCI_QUIRK_NON_PERSISTENT_SETUP is set.

4) step reboot will call qca_serdev_shutdown() which will flush and
write the serdev which are closed by above step disable, so cause the
UAF issue, namely, kernel crash issue.

so this issue is caused by commit B which close the serdev during
hdev->close().

driver->remove() even is not triggered during above steps.
>> Commit C: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on
>> closed serdev")
>> this commit is to fix issue B which is actually caused by Commit B, but
>> it has Fixes tag for Commit A. and it also introduces the regression
>> issue A.
>>
> 
>
Luiz Augusto von Dentz May 21, 2024, 3:48 p.m. UTC | #2
Hi,

On Tue, May 21, 2024 at 10:52 AM Lk Sii <lk_sii@163.com> wrote:
>
>
>
> On 2024/5/16 23:55, Luiz Augusto von Dentz wrote:
> > Hi,
> >
> > On Thu, May 16, 2024 at 10:57 AM Lk Sii <lk_sii@163.com> wrote:
> >>
> >>
> >>
> >> On 2024/5/16 21:31, Zijun Hu wrote:
> >>> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
> >>> serdev") will cause below regression issue:
> >>>
> >>> BT can't be enabled after below steps:
> >>> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
> >>> if property enable-gpios is not configured within DT|ACPI for QCA6390.
> >>>
> >>> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
> >>> by adding condition to avoid the serdev is flushed or wrote after closed
> >>> but also introduces this regression issue regarding above steps since the
> >>> VSC is not sent to reset controller during warm reboot.
> >>>
> >>> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
> >>> once BT was ever enabled, and the use-after-free issue is also fixed by
> >>> this change since the serdev is still opened before it is flushed or wrote.
> >>>
> >>> Verified by the reported machine Dell XPS 13 9310 laptop over below two
> >>> kernel commits:
> >>> commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
> >>> implementation for QCA") of bluetooth-next tree.
> >>> commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
> >>> implementation for QCA") of linus mainline tree.
> >>>
> >>> Fixes: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev")
> >>> Cc: stable@vger.kernel.org
> >>> Reported-by: Wren Turkal <wt@penguintechs.org>
> >>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218726
> >>> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
> >>> Tested-by: Wren Turkal <wt@penguintechs.org>
> >>> ---
> >>> V1 -> V2: Add comments and more commit messages
> >>>
> >>> V1 discussion link:
> >>> https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31ebe@163.com/T/#t
> >>>
> >>>  drivers/bluetooth/hci_qca.c | 18 +++++++++++++++---
> >>>  1 file changed, 15 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
> >>> index 0c9c9ee56592..9a0bc86f9aac 100644
> >>> --- a/drivers/bluetooth/hci_qca.c
> >>> +++ b/drivers/bluetooth/hci_qca.c
> >>> @@ -2450,15 +2450,27 @@ static void qca_serdev_shutdown(struct device *dev)
> >>>       struct qca_serdev *qcadev = serdev_device_get_drvdata(serdev);
> >>>       struct hci_uart *hu = &qcadev->serdev_hu;
> >>>       struct hci_dev *hdev = hu->hdev;
> >>> -     struct qca_data *qca = hu->priv;
> >>>       const u8 ibs_wake_cmd[] = { 0xFD };
> >>>       const u8 edl_reset_soc_cmd[] = { 0x01, 0x00, 0xFC, 0x01, 0x05 };
> >>>
> >>>       if (qcadev->btsoc_type == QCA_QCA6390) {
> >>> -             if (test_bit(QCA_BT_OFF, &qca->flags) ||
> >>> -                 !test_bit(HCI_RUNNING, &hdev->flags))
> >>> +             /* The purpose of sending the VSC is to reset SOC into a initial
> >>> +              * state and the state will ensure next hdev->setup() success.
> >>> +              * if HCI_QUIRK_NON_PERSISTENT_SETUP is set, it means that
> >>> +              * hdev->setup() can do its job regardless of SoC state, so
> >>> +              * don't need to send the VSC.
> >>> +              * if HCI_SETUP is set, it means that hdev->setup() was never
> >>> +              * invoked and the SOC is already in the initial state, so
> >>> +              * don't also need to send the VSC.
> >>> +              */
> >>> +             if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks) ||
> >>> +                 hci_dev_test_flag(hdev, HCI_SETUP))
> >>>                       return;
> The main purpose of above checking is NOT to make sure the serdev within
> open state as its comments explained.
> >>>
> >>> +             /* The serdev must be in open state when conrol logic arrives
> >>> +              * here, so also fix the use-after-free issue caused by that
> >>> +              * the serdev is flushed or wrote after it is closed.
> >>> +              */
> >>>               serdev_device_write_flush(serdev);
> >>>               ret = serdev_device_write_buf(serdev, ibs_wake_cmd,
> >>>                                             sizeof(ibs_wake_cmd));
> >> i believe Zijun's change is able to fix both below issues and don't
> >> introduce new issue.
> >>
> >> regression issue A:  BT enable failure after warm reboot.
> >> issue B:  use-after-free issue, namely, kernel crash.
> >>
> >>
> >> For issue B, i have more findings related to below commits ordered by time.
> >>
> >> Commit A: 7e7bbddd029b ("Bluetooth: hci_qca: Fix qca6390 enable failure
> >> after warm reboot")
> >>
> >> Commit B: de8892df72be ("Bluetooth: hci_serdev: Close UART port if
> >> NON_PERSISTENT_SETUP is set")
> >> this commit introduces issue B, it is also not suitable to associate
> >> protocol state with state of lower level transport type such as serdev
> >> or uart, in my opinion, protocol state should be independent with
> >> transport type state, flag HCI_UART_PROTO_READY is for protocol state,
> >> it means if protocol hu->proto is initialized and if we can invoke its
> >> interfaces.it is common for various kinds of transport types. perhaps,
> >> this is the reason why Zijun's change doesn't use flag HCI_UART_PROTO_READY.
> >
> > Don't really follow you here, if HCI_UART_PROTO_READY indicates the
> > protocol state they is even _more_ important to use before invoking
> > serdev APIs, so checking for the quirk sound like a problem because:
> >
> > [1] hci_uart_close
> >      /* When QUIRK HCI_QUIRK_NON_PERSISTENT_SETUP is set by driver,
> >      * BT SOC is completely powered OFF during BT OFF, holding port
> >      * open may drain the battery.
> >      */
> >     if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks)) {
> >         clear_bit(HCI_UART_PROTO_READY, &hu->flags);
> >         serdev_device_close(hu->serdev);
> >     }
> >
> > [2] hci_uart_unregister_device
> >     if (test_bit(HCI_UART_PROTO_READY, &hu->flags)) {
> >         clear_bit(HCI_UART_PROTO_READY, &hu->flags);
> >         serdev_device_close(hu->serdev);
> >     }
> >both case 1 and case 2 were introduced by Commit B in question which
> uses protocol state flag HCI_UART_PROTO_READY to track lower level
> transport type state, i don't think it is perfect.
>
> for common files hci_serdev.c and hci_ldisc.c, as you saw, the purpose
> of checking HCI_UART_PROTO_READY is to call protocol relevant
> interfaces, moreover, these protocol relevant interfaces do not deal
> with lower transport state. you maybe even notice below present function
> within which lower level serdev is flushed before HCI_UART_PROTO_READY
> is checked:
>
> static int hci_uart_flush(struct hci_dev *hdev)
> {
> ......
>         /* Flush any pending characters in the driver and discipline. */
>         serdev_device_write_flush(hu->serdev);
>
>         if (test_bit(HCI_UART_PROTO_READY, &hu->flags))
>                 hu->proto->flush(hu);
>
>         return 0;
> }
>
> in my opinion, that is why qca_serdev_shutdown() does not check
> HCI_UART_PROTO_READY for later lower level serdev operations.
> > So only in case 1 checking the quirk is equivalent to
> > HCI_UART_PROTO_READY on case 2 it does actually check the quirk and
> > will proceed to call serdev_device_close, now perhaps the code is
> > assuming that shutdown won't be called after that, but it looks it
> > does since:
> >
> qca_serdev_shutdown() will never be called after case 2 as explained
> in the end.
> > static void serdev_drv_remove(struct device *dev)
> > {
> >     const struct serdev_device_driver *sdrv =
> > to_serdev_device_driver(dev->driver);
> >     if (sdrv->remove)
> >         sdrv->remove(to_serdev_device(dev));
> >
> >     dev_pm_domain_detach(dev, true);
> > }
> >
> > dev_pm_domain_detach says it will power off so I assume that means
> > that shutdown will be called _after_ remove, so not I'm not really
> > convinced that we can avoid using HCI_UART_PROTO_READY, in fact the
> > following sequence might always be triggering:
> >
> dev_pm_domain_detach() should be irrelevant with qca_serdev_shutdown(),
> should not trigger call of qca_serdev_shutdown() as explained in the end
> > serdev_drv_remove -> qca_serdev_remove -> hci_uart_unregister_device
> > -> serdev_device_close -> qca_close -> kfree(qca)
> > dev_pm_domain_detach -> ??? -> qca_serdev_shutdown
> >
> > If this sequence is correct then qca_serdev_shutdown accessing
> > qca_data will always result in a UAF problem.
> >
> above sequence should not correct as explained below.
>
> serdev and its driver should also follow below generic device and driver
> design.
>
> 1)
> driver->shutdown() will be called during shut-down time at this time
> driver->remove() should not have been called.
>
> 2)
> driver->shutdown() is impossible to be called once driver->remove()
> was called.
>
> 3) for serdev, driver->remove() does not trigger call of
> driver->shutdown() since PM relevant poweroff is irrelevant with
> driver->shutdown() and i also don't find any PM relevant interfaces will
> call driver->shutdown().
>
> i would like to explain issue B based on comments Zijun posted by public
> as below:
>
> issue B actually happens during reboot and let me look at these steps
> boot -> enable BT -> disable BT -> reboot.
>
> 1) step boot will call driver->probe() to register hdev and the serdev
> is opened after boot.
>
> 2) step enable will call hdev->open() and the serdev will still in open
> state
>
> 3) step disable will call hdev->close() and the serdev will be closed
> after hdev->close() for machine with config which results in
> HCI_QUIRK_NON_PERSISTENT_SETUP is set.
>
> 4) step reboot will call qca_serdev_shutdown() which will flush and
> write the serdev which are closed by above step disable, so cause the
> UAF issue, namely, kernel crash issue.
>
> so this issue is caused by commit B which close the serdev during
> hdev->close().
>
> driver->remove() even is not triggered during above steps.
> >> Commit C: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on
> >> closed serdev")
> >> this commit is to fix issue B which is actually caused by Commit B, but
> >> it has Fixes tag for Commit A. and it also introduces the regression
> >> issue A.
> >>
> >
> >

Reading again the commit message for the UAF fix it sounds like a
different problem:

    The driver shutdown callback (which sends EDL_SOC_RESET to the device
    over serdev) should not be invoked when HCI device is not open (e.g. if
    hci_dev_open_sync() failed), because the serdev and its TTY are not open
    either.  Also skip this step if device is powered off
    (qca_power_shutdown()).

So if hci_dev_open_sync has failed it says serdev and its TTY will not
be open either, so I guess that's why HCI_SETUP was added as a
condition to bail out? So it seems correct to do that although I'd
change the comments.

@Krzysztof Kozlowski do you still have a test setup for 272970be3dab
("Bluetooth: hci_qca: Fix driver shutdown on closed serdev"), can you
try with these changes?
Krzysztof Kozlowski May 21, 2024, 4 p.m. UTC | #3
On 21/05/2024 17:48, Luiz Augusto von Dentz wrote:
>> driver->remove() even is not triggered during above steps.
>>>> Commit C: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on
>>>> closed serdev")
>>>> this commit is to fix issue B which is actually caused by Commit B, but
>>>> it has Fixes tag for Commit A. and it also introduces the regression
>>>> issue A.
>>>>
>>>
>>>
> 
> Reading again the commit message for the UAF fix it sounds like a
> different problem:
> 
>     The driver shutdown callback (which sends EDL_SOC_RESET to the device
>     over serdev) should not be invoked when HCI device is not open (e.g. if
>     hci_dev_open_sync() failed), because the serdev and its TTY are not open
>     either.  Also skip this step if device is powered off
>     (qca_power_shutdown()).
> 
> So if hci_dev_open_sync has failed it says serdev and its TTY will not
> be open either, so I guess that's why HCI_SETUP was added as a
> condition to bail out? So it seems correct to do that although I'd
> change the comments.
> 
> @Krzysztof Kozlowski do you still have a test setup for 272970be3dab
> ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev"), can you
> try with these changes?

Unfortunately not at the moment, because mainline never had a proper
support for a variant of this Bluetooth/WiFi on our boards, so it was
working with few out of tree patches. I think Bartosz is working on
fixing it via power sequence, but that's not in the mainline yet.

Best regards,
Krzysztof
Krzysztof Kozlowski May 21, 2024, 4:02 p.m. UTC | #4
On 16/05/2024 15:31, Zijun Hu wrote:
> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
> serdev") will cause below regression issue:
> 
> BT can't be enabled after below steps:
> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
> if property enable-gpios is not configured within DT|ACPI for QCA6390.
> 
> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
> by adding condition to avoid the serdev is flushed or wrote after closed
> but also introduces this regression issue regarding above steps since the
> VSC is not sent to reset controller during warm reboot.
> 
> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
> once BT was ever enabled, and the use-after-free issue is also fixed by
> this change since the serdev is still opened before it is flushed or wrote.
> 
> Verified by the reported machine Dell XPS 13 9310 laptop over below two
> kernel commits:

I don't understand how does it solve my question. I asked you: on which
hardware did you, not the reporter, test?

> commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
> implementation for QCA") of bluetooth-next tree.
> commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
> implementation for QCA") of linus mainline tree.

? Same commit with different hashes? No, it looks like you are working
on some downstream tree with cherry picks.

No, test it on mainline and answer finally, after *five* tries, which
kernel and which hardware did you use for testing this.



Best regards,
Krzysztof
Dmitry Baryshkov May 21, 2024, 7:26 p.m. UTC | #5
On Thu, May 16, 2024 at 09:31:34PM +0800, Zijun Hu wrote:
> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
> serdev") will cause below regression issue:
> 
> BT can't be enabled after below steps:
> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
> if property enable-gpios is not configured within DT|ACPI for QCA6390.

Please provide a trace or a detailed description and analysis, why it is
failing in such a case. Just claiming that it doesn't work or regresses
doesn't help us to understand whether the fix is correct or not. E.g.
the message - expected response - the actual response.

Next. I can only consider it to be hardware bug if there is no sensible
way to reset the BT controller. What if the running OS crashes with the
BT being enabled? Resetting the controller should be done at the powerup
time, if it fails to respond correctly. Which might include sending the
magic command. There is little point in resetting it at the shutdown
time just for the sake of the next boot.

> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
> by adding condition to avoid the serdev is flushed or wrote after closed
> but also introduces this regression issue regarding above steps since the
> VSC is not sent to reset controller during warm reboot.

Is this commit already merged or not? If it is, then the tense is
incorrect. It introduced the regression. My first impression (as of the
non-native speaker) was to consider that this patch introduces a
regression, which left me puzzled for quite a while.

> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
> once BT was ever enabled, and the use-after-free issue is also fixed by
> this change since the serdev is still opened before it is flushed or wrote.

Please see Documentation/process/submitting-patches.txt for a suggested
language.


> Verified by the reported machine Dell XPS 13 9310 laptop over below two
> kernel commits:
> commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
> implementation for QCA") of bluetooth-next tree.
> commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
> implementation for QCA") of linus mainline tree.

I couldn't parse this phrase. Not to mention that only one of these
commits is a part of the Linus's tree. Maybe you had some other commit
in mind?

> Fixes: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev")
> Cc: stable@vger.kernel.org
> Reported-by: Wren Turkal <wt@penguintechs.org>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218726
> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
> Tested-by: Wren Turkal <wt@penguintechs.org>
> ---
> V1 -> V2: Add comments and more commit messages
> 
> V1 discussion link:
> https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31ebe@163.com/T/#t
> 
>  drivers/bluetooth/hci_qca.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
> index 0c9c9ee56592..9a0bc86f9aac 100644
> --- a/drivers/bluetooth/hci_qca.c
> +++ b/drivers/bluetooth/hci_qca.c
> @@ -2450,15 +2450,27 @@ static void qca_serdev_shutdown(struct device *dev)
>  	struct qca_serdev *qcadev = serdev_device_get_drvdata(serdev);
>  	struct hci_uart *hu = &qcadev->serdev_hu;
>  	struct hci_dev *hdev = hu->hdev;
> -	struct qca_data *qca = hu->priv;
>  	const u8 ibs_wake_cmd[] = { 0xFD };
>  	const u8 edl_reset_soc_cmd[] = { 0x01, 0x00, 0xFC, 0x01, 0x05 };
>  
>  	if (qcadev->btsoc_type == QCA_QCA6390) {
> -		if (test_bit(QCA_BT_OFF, &qca->flags) ||
> -		    !test_bit(HCI_RUNNING, &hdev->flags))
> +		/* The purpose of sending the VSC is to reset SOC into a initial
> +		 * state and the state will ensure next hdev->setup() success.
> +		 * if HCI_QUIRK_NON_PERSISTENT_SETUP is set, it means that
> +		 * hdev->setup() can do its job regardless of SoC state, so
> +		 * don't need to send the VSC.

As I wrote earlier, sending anything on shutdown has a limited
usability. Send the commands on init instead and drop the shutdown
quirk, unless it is there to prevent BT from draining power while the
system is in shutdown state.

> +		 * if HCI_SETUP is set, it means that hdev->setup() was never
> +		 * invoked and the SOC is already in the initial state, so
> +		 * don't also need to send the VSC.
> +		 */
> +		if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks) ||
> +		    hci_dev_test_flag(hdev, HCI_SETUP))
>  			return;
>  
> +		/* The serdev must be in open state when conrol logic arrives
> +		 * here, so also fix the use-after-free issue caused by that
> +		 * the serdev is flushed or wrote after it is closed.
> +		 */
>  		serdev_device_write_flush(serdev);
>  		ret = serdev_device_write_buf(serdev, ibs_wake_cmd,
>  					      sizeof(ibs_wake_cmd));
> -- 
> 2.7.4
>
Lk Sii May 22, 2024, 1:31 p.m. UTC | #6
On 2024/5/21 23:48, Luiz Augusto von Dentz wrote:
> Hi,
> 
> On Tue, May 21, 2024 at 10:52 AM Lk Sii <lk_sii@163.com> wrote:
>>
>>
>>
>> On 2024/5/16 23:55, Luiz Augusto von Dentz wrote:
>>> Hi,
>>>
>>> On Thu, May 16, 2024 at 10:57 AM Lk Sii <lk_sii@163.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2024/5/16 21:31, Zijun Hu wrote:
>>>>> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
>>>>> serdev") will cause below regression issue:
>>>>>
>>>>> BT can't be enabled after below steps:
>>>>> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
>>>>> if property enable-gpios is not configured within DT|ACPI for QCA6390.
>>>>>
>>>>> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
>>>>> by adding condition to avoid the serdev is flushed or wrote after closed
>>>>> but also introduces this regression issue regarding above steps since the
>>>>> VSC is not sent to reset controller during warm reboot.
>>>>>
>>>>> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
>>>>> once BT was ever enabled, and the use-after-free issue is also fixed by
>>>>> this change since the serdev is still opened before it is flushed or wrote.
>>>>>
>>>>> Verified by the reported machine Dell XPS 13 9310 laptop over below two
>>>>> kernel commits:
>>>>> commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
>>>>> implementation for QCA") of bluetooth-next tree.
>>>>> commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
>>>>> implementation for QCA") of linus mainline tree.
>>>>>
>>>>> Fixes: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev")
>>>>> Cc: stable@vger.kernel.org
>>>>> Reported-by: Wren Turkal <wt@penguintechs.org>
>>>>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218726
>>>>> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
>>>>> Tested-by: Wren Turkal <wt@penguintechs.org>
>>>>> ---
>>>>> V1 -> V2: Add comments and more commit messages
>>>>>
>>>>> V1 discussion link:
>>>>> https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31ebe@163.com/T/#t
>>>>>
>>>>>  drivers/bluetooth/hci_qca.c | 18 +++++++++++++++---
>>>>>  1 file changed, 15 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
>>>>> index 0c9c9ee56592..9a0bc86f9aac 100644
>>>>> --- a/drivers/bluetooth/hci_qca.c
>>>>> +++ b/drivers/bluetooth/hci_qca.c
>>>>> @@ -2450,15 +2450,27 @@ static void qca_serdev_shutdown(struct device *dev)
>>>>>       struct qca_serdev *qcadev = serdev_device_get_drvdata(serdev);
>>>>>       struct hci_uart *hu = &qcadev->serdev_hu;
>>>>>       struct hci_dev *hdev = hu->hdev;
>>>>> -     struct qca_data *qca = hu->priv;
>>>>>       const u8 ibs_wake_cmd[] = { 0xFD };
>>>>>       const u8 edl_reset_soc_cmd[] = { 0x01, 0x00, 0xFC, 0x01, 0x05 };
>>>>>
>>>>>       if (qcadev->btsoc_type == QCA_QCA6390) {
>>>>> -             if (test_bit(QCA_BT_OFF, &qca->flags) ||
>>>>> -                 !test_bit(HCI_RUNNING, &hdev->flags))
>>>>> +             /* The purpose of sending the VSC is to reset SOC into a initial
>>>>> +              * state and the state will ensure next hdev->setup() success.
>>>>> +              * if HCI_QUIRK_NON_PERSISTENT_SETUP is set, it means that
>>>>> +              * hdev->setup() can do its job regardless of SoC state, so
>>>>> +              * don't need to send the VSC.
>>>>> +              * if HCI_SETUP is set, it means that hdev->setup() was never
>>>>> +              * invoked and the SOC is already in the initial state, so
>>>>> +              * don't also need to send the VSC.
>>>>> +              */
>>>>> +             if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks) ||
>>>>> +                 hci_dev_test_flag(hdev, HCI_SETUP))
>>>>>                       return;
>> The main purpose of above checking is NOT to make sure the serdev within
>> open state as its comments explained.
>>>>>
>>>>> +             /* The serdev must be in open state when conrol logic arrives
>>>>> +              * here, so also fix the use-after-free issue caused by that
>>>>> +              * the serdev is flushed or wrote after it is closed.
>>>>> +              */
>>>>>               serdev_device_write_flush(serdev);
>>>>>               ret = serdev_device_write_buf(serdev, ibs_wake_cmd,
>>>>>                                             sizeof(ibs_wake_cmd));
>>>> i believe Zijun's change is able to fix both below issues and don't
>>>> introduce new issue.
>>>>
>>>> regression issue A:  BT enable failure after warm reboot.
>>>> issue B:  use-after-free issue, namely, kernel crash.
>>>>
>>>>
>>>> For issue B, i have more findings related to below commits ordered by time.
>>>>
>>>> Commit A: 7e7bbddd029b ("Bluetooth: hci_qca: Fix qca6390 enable failure
>>>> after warm reboot")
>>>>
>>>> Commit B: de8892df72be ("Bluetooth: hci_serdev: Close UART port if
>>>> NON_PERSISTENT_SETUP is set")
>>>> this commit introduces issue B, it is also not suitable to associate
>>>> protocol state with state of lower level transport type such as serdev
>>>> or uart, in my opinion, protocol state should be independent with
>>>> transport type state, flag HCI_UART_PROTO_READY is for protocol state,
>>>> it means if protocol hu->proto is initialized and if we can invoke its
>>>> interfaces.it is common for various kinds of transport types. perhaps,
>>>> this is the reason why Zijun's change doesn't use flag HCI_UART_PROTO_READY.
>>>
>>> Don't really follow you here, if HCI_UART_PROTO_READY indicates the
>>> protocol state they is even _more_ important to use before invoking
>>> serdev APIs, so checking for the quirk sound like a problem because:
>>>
>>> [1] hci_uart_close
>>>      /* When QUIRK HCI_QUIRK_NON_PERSISTENT_SETUP is set by driver,
>>>      * BT SOC is completely powered OFF during BT OFF, holding port
>>>      * open may drain the battery.
>>>      */
>>>     if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks)) {
>>>         clear_bit(HCI_UART_PROTO_READY, &hu->flags);
>>>         serdev_device_close(hu->serdev);
>>>     }
>>>
>>> [2] hci_uart_unregister_device
>>>     if (test_bit(HCI_UART_PROTO_READY, &hu->flags)) {
>>>         clear_bit(HCI_UART_PROTO_READY, &hu->flags);
>>>         serdev_device_close(hu->serdev);
>>>     }
>>> both case 1 and case 2 were introduced by Commit B in question which
>> uses protocol state flag HCI_UART_PROTO_READY to track lower level
>> transport type state, i don't think it is perfect.
>>
>> for common files hci_serdev.c and hci_ldisc.c, as you saw, the purpose
>> of checking HCI_UART_PROTO_READY is to call protocol relevant
>> interfaces, moreover, these protocol relevant interfaces do not deal
>> with lower transport state. you maybe even notice below present function
>> within which lower level serdev is flushed before HCI_UART_PROTO_READY
>> is checked:
>>
>> static int hci_uart_flush(struct hci_dev *hdev)
>> {
>> ......
>>         /* Flush any pending characters in the driver and discipline. */
>>         serdev_device_write_flush(hu->serdev);
>>
>>         if (test_bit(HCI_UART_PROTO_READY, &hu->flags))
>>                 hu->proto->flush(hu);
>>
>>         return 0;
>> }
>>
>> in my opinion, that is why qca_serdev_shutdown() does not check
>> HCI_UART_PROTO_READY for later lower level serdev operations.
>>> So only in case 1 checking the quirk is equivalent to
>>> HCI_UART_PROTO_READY on case 2 it does actually check the quirk and
>>> will proceed to call serdev_device_close, now perhaps the code is
>>> assuming that shutdown won't be called after that, but it looks it
>>> does since:
>>>
>> qca_serdev_shutdown() will never be called after case 2 as explained
>> in the end.
>>> static void serdev_drv_remove(struct device *dev)
>>> {
>>>     const struct serdev_device_driver *sdrv =
>>> to_serdev_device_driver(dev->driver);
>>>     if (sdrv->remove)
>>>         sdrv->remove(to_serdev_device(dev));
>>>
>>>     dev_pm_domain_detach(dev, true);
>>> }
>>>
>>> dev_pm_domain_detach says it will power off so I assume that means
>>> that shutdown will be called _after_ remove, so not I'm not really
>>> convinced that we can avoid using HCI_UART_PROTO_READY, in fact the
>>> following sequence might always be triggering:
>>>
>> dev_pm_domain_detach() should be irrelevant with qca_serdev_shutdown(),
>> should not trigger call of qca_serdev_shutdown() as explained in the end
>>> serdev_drv_remove -> qca_serdev_remove -> hci_uart_unregister_device
>>> -> serdev_device_close -> qca_close -> kfree(qca)
>>> dev_pm_domain_detach -> ??? -> qca_serdev_shutdown
>>>
>>> If this sequence is correct then qca_serdev_shutdown accessing
>>> qca_data will always result in a UAF problem.
>>>
>> above sequence should not correct as explained below.
>>
>> serdev and its driver should also follow below generic device and driver
>> design.
>>
>> 1)
>> driver->shutdown() will be called during shut-down time at this time
>> driver->remove() should not have been called.
>>
>> 2)
>> driver->shutdown() is impossible to be called once driver->remove()
>> was called.
>>
>> 3) for serdev, driver->remove() does not trigger call of
>> driver->shutdown() since PM relevant poweroff is irrelevant with
>> driver->shutdown() and i also don't find any PM relevant interfaces will
>> call driver->shutdown().
>>
>> i would like to explain issue B based on comments Zijun posted by public
>> as below:
>>
>> issue B actually happens during reboot and let me look at these steps
>> boot -> enable BT -> disable BT -> reboot.
>>
>> 1) step boot will call driver->probe() to register hdev and the serdev
>> is opened after boot.
>>
>> 2) step enable will call hdev->open() and the serdev will still in open
>> state
>>
>> 3) step disable will call hdev->close() and the serdev will be closed
>> after hdev->close() for machine with config which results in
>> HCI_QUIRK_NON_PERSISTENT_SETUP is set.
>>
>> 4) step reboot will call qca_serdev_shutdown() which will flush and
>> write the serdev which are closed by above step disable, so cause the
>> UAF issue, namely, kernel crash issue.
>>
>> so this issue is caused by commit B which close the serdev during
>> hdev->close().
>>
>> driver->remove() even is not triggered during above steps.
>>>> Commit C: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on
>>>> closed serdev")
>>>> this commit is to fix issue B which is actually caused by Commit B, but
>>>> it has Fixes tag for Commit A. and it also introduces the regression
>>>> issue A.
>>>>
>>>
>>>
> 
> Reading again the commit message for the UAF fix it sounds like a
> different problem:
> 
no, the UAF issue commit C fixes should be the same issue descripted by
me previously as explained below:

the UAF issue happened with machine "qualcomm Technologies, Inc.
Robotics RB5 (DT)", the machine uses qca6390 and have property
enable-gpios configured, which will results in that quirk
HCI_QUIRK_NON_PERSISTENT_SETUP is set, so must meet the UAF issue
for normal operation sequences "boot -> enable BT -> disable BT -> reboot".

Actually, only machines which uses QCA6390 and have property
enable-gpios configured will meet the UAF issue as commented by Zijun
with below link
https://lore.kernel.org/linux-bluetooth/9ac11453-b7cf-43f3-8e46-f96e41ef190d@quicinc.com/

>     The driver shutdown callback (which sends EDL_SOC_RESET to the device
>     over serdev) should not be invoked when HCI device is not open (e.g. if
>     hci_dev_open_sync() failed), because the serdev and its TTY are not open
>     either.  Also skip this step if device is powered off
>     (qca_power_shutdown()).
> 
> So if hci_dev_open_sync has failed it says serdev and its TTY will not
> be open either, so I guess that's why HCI_SETUP was added as a
> condition to bail out? So it seems correct to do that although I'd
> change the comments.
> 
i believe hci_dev_open_sync failure should not really happens with the
machine Robotics RB5, the purpose that it is mentioned with commit
message is to illustrate that the serdev in closed state is operated and
causes the UAF issue.

let us assume that hci_dev_open_sync failure -> serdev is not opened ->
UAF issue happens within qca_serdev_shutdown(), then BT will not be
working at all and the commit C is actually a workaroud instead of a fix
since the right approach is to solve the hci_dev_open_sync failure which
happens firstly.


Frankly, only checking quirk HCI_QUIRK_NON_PERSISTENT_SETUP is enough to
fix the UAF issue caused by either "normal operation sequences" or
"hci_dev_open_sync failure".
> @Krzysztof Kozlowski do you still have a test setup for 272970be3dab
> ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev"), can you
> try with these changes?
>
Luiz Augusto von Dentz May 22, 2024, 2:25 p.m. UTC | #7
Hi,

On Wed, May 22, 2024 at 9:33 AM Lk Sii <lk_sii@163.com> wrote:
>
>
>
> On 2024/5/21 23:48, Luiz Augusto von Dentz wrote:
> > Hi,
> >
> > On Tue, May 21, 2024 at 10:52 AM Lk Sii <lk_sii@163.com> wrote:
> >>
> >>
> >>
> >> On 2024/5/16 23:55, Luiz Augusto von Dentz wrote:
> >>> Hi,
> >>>
> >>> On Thu, May 16, 2024 at 10:57 AM Lk Sii <lk_sii@163.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2024/5/16 21:31, Zijun Hu wrote:
> >>>>> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
> >>>>> serdev") will cause below regression issue:
> >>>>>
> >>>>> BT can't be enabled after below steps:
> >>>>> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
> >>>>> if property enable-gpios is not configured within DT|ACPI for QCA6390.
> >>>>>
> >>>>> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
> >>>>> by adding condition to avoid the serdev is flushed or wrote after closed
> >>>>> but also introduces this regression issue regarding above steps since the
> >>>>> VSC is not sent to reset controller during warm reboot.
> >>>>>
> >>>>> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
> >>>>> once BT was ever enabled, and the use-after-free issue is also fixed by
> >>>>> this change since the serdev is still opened before it is flushed or wrote.
> >>>>>
> >>>>> Verified by the reported machine Dell XPS 13 9310 laptop over below two
> >>>>> kernel commits:
> >>>>> commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
> >>>>> implementation for QCA") of bluetooth-next tree.
> >>>>> commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
> >>>>> implementation for QCA") of linus mainline tree.
> >>>>>
> >>>>> Fixes: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev")
> >>>>> Cc: stable@vger.kernel.org
> >>>>> Reported-by: Wren Turkal <wt@penguintechs.org>
> >>>>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218726
> >>>>> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
> >>>>> Tested-by: Wren Turkal <wt@penguintechs.org>
> >>>>> ---
> >>>>> V1 -> V2: Add comments and more commit messages
> >>>>>
> >>>>> V1 discussion link:
> >>>>> https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31ebe@163.com/T/#t
> >>>>>
> >>>>>  drivers/bluetooth/hci_qca.c | 18 +++++++++++++++---
> >>>>>  1 file changed, 15 insertions(+), 3 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
> >>>>> index 0c9c9ee56592..9a0bc86f9aac 100644
> >>>>> --- a/drivers/bluetooth/hci_qca.c
> >>>>> +++ b/drivers/bluetooth/hci_qca.c
> >>>>> @@ -2450,15 +2450,27 @@ static void qca_serdev_shutdown(struct device *dev)
> >>>>>       struct qca_serdev *qcadev = serdev_device_get_drvdata(serdev);
> >>>>>       struct hci_uart *hu = &qcadev->serdev_hu;
> >>>>>       struct hci_dev *hdev = hu->hdev;
> >>>>> -     struct qca_data *qca = hu->priv;
> >>>>>       const u8 ibs_wake_cmd[] = { 0xFD };
> >>>>>       const u8 edl_reset_soc_cmd[] = { 0x01, 0x00, 0xFC, 0x01, 0x05 };
> >>>>>
> >>>>>       if (qcadev->btsoc_type == QCA_QCA6390) {
> >>>>> -             if (test_bit(QCA_BT_OFF, &qca->flags) ||
> >>>>> -                 !test_bit(HCI_RUNNING, &hdev->flags))
> >>>>> +             /* The purpose of sending the VSC is to reset SOC into a initial
> >>>>> +              * state and the state will ensure next hdev->setup() success.
> >>>>> +              * if HCI_QUIRK_NON_PERSISTENT_SETUP is set, it means that
> >>>>> +              * hdev->setup() can do its job regardless of SoC state, so
> >>>>> +              * don't need to send the VSC.
> >>>>> +              * if HCI_SETUP is set, it means that hdev->setup() was never
> >>>>> +              * invoked and the SOC is already in the initial state, so
> >>>>> +              * don't also need to send the VSC.
> >>>>> +              */
> >>>>> +             if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks) ||
> >>>>> +                 hci_dev_test_flag(hdev, HCI_SETUP))
> >>>>>                       return;
> >> The main purpose of above checking is NOT to make sure the serdev within
> >> open state as its comments explained.
> >>>>>
> >>>>> +             /* The serdev must be in open state when conrol logic arrives
> >>>>> +              * here, so also fix the use-after-free issue caused by that
> >>>>> +              * the serdev is flushed or wrote after it is closed.
> >>>>> +              */
> >>>>>               serdev_device_write_flush(serdev);
> >>>>>               ret = serdev_device_write_buf(serdev, ibs_wake_cmd,
> >>>>>                                             sizeof(ibs_wake_cmd));
> >>>> i believe Zijun's change is able to fix both below issues and don't
> >>>> introduce new issue.
> >>>>
> >>>> regression issue A:  BT enable failure after warm reboot.
> >>>> issue B:  use-after-free issue, namely, kernel crash.
> >>>>
> >>>>
> >>>> For issue B, i have more findings related to below commits ordered by time.
> >>>>
> >>>> Commit A: 7e7bbddd029b ("Bluetooth: hci_qca: Fix qca6390 enable failure
> >>>> after warm reboot")
> >>>>
> >>>> Commit B: de8892df72be ("Bluetooth: hci_serdev: Close UART port if
> >>>> NON_PERSISTENT_SETUP is set")
> >>>> this commit introduces issue B, it is also not suitable to associate
> >>>> protocol state with state of lower level transport type such as serdev
> >>>> or uart, in my opinion, protocol state should be independent with
> >>>> transport type state, flag HCI_UART_PROTO_READY is for protocol state,
> >>>> it means if protocol hu->proto is initialized and if we can invoke its
> >>>> interfaces.it is common for various kinds of transport types. perhaps,
> >>>> this is the reason why Zijun's change doesn't use flag HCI_UART_PROTO_READY.
> >>>
> >>> Don't really follow you here, if HCI_UART_PROTO_READY indicates the
> >>> protocol state they is even _more_ important to use before invoking
> >>> serdev APIs, so checking for the quirk sound like a problem because:
> >>>
> >>> [1] hci_uart_close
> >>>      /* When QUIRK HCI_QUIRK_NON_PERSISTENT_SETUP is set by driver,
> >>>      * BT SOC is completely powered OFF during BT OFF, holding port
> >>>      * open may drain the battery.
> >>>      */
> >>>     if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks)) {
> >>>         clear_bit(HCI_UART_PROTO_READY, &hu->flags);
> >>>         serdev_device_close(hu->serdev);
> >>>     }
> >>>
> >>> [2] hci_uart_unregister_device
> >>>     if (test_bit(HCI_UART_PROTO_READY, &hu->flags)) {
> >>>         clear_bit(HCI_UART_PROTO_READY, &hu->flags);
> >>>         serdev_device_close(hu->serdev);
> >>>     }
> >>> both case 1 and case 2 were introduced by Commit B in question which
> >> uses protocol state flag HCI_UART_PROTO_READY to track lower level
> >> transport type state, i don't think it is perfect.
> >>
> >> for common files hci_serdev.c and hci_ldisc.c, as you saw, the purpose
> >> of checking HCI_UART_PROTO_READY is to call protocol relevant
> >> interfaces, moreover, these protocol relevant interfaces do not deal
> >> with lower transport state. you maybe even notice below present function
> >> within which lower level serdev is flushed before HCI_UART_PROTO_READY
> >> is checked:
> >>
> >> static int hci_uart_flush(struct hci_dev *hdev)
> >> {
> >> ......
> >>         /* Flush any pending characters in the driver and discipline. */
> >>         serdev_device_write_flush(hu->serdev);
> >>
> >>         if (test_bit(HCI_UART_PROTO_READY, &hu->flags))
> >>                 hu->proto->flush(hu);
> >>
> >>         return 0;
> >> }
> >>
> >> in my opinion, that is why qca_serdev_shutdown() does not check
> >> HCI_UART_PROTO_READY for later lower level serdev operations.
> >>> So only in case 1 checking the quirk is equivalent to
> >>> HCI_UART_PROTO_READY on case 2 it does actually check the quirk and
> >>> will proceed to call serdev_device_close, now perhaps the code is
> >>> assuming that shutdown won't be called after that, but it looks it
> >>> does since:
> >>>
> >> qca_serdev_shutdown() will never be called after case 2 as explained
> >> in the end.
> >>> static void serdev_drv_remove(struct device *dev)
> >>> {
> >>>     const struct serdev_device_driver *sdrv =
> >>> to_serdev_device_driver(dev->driver);
> >>>     if (sdrv->remove)
> >>>         sdrv->remove(to_serdev_device(dev));
> >>>
> >>>     dev_pm_domain_detach(dev, true);
> >>> }
> >>>
> >>> dev_pm_domain_detach says it will power off so I assume that means
> >>> that shutdown will be called _after_ remove, so not I'm not really
> >>> convinced that we can avoid using HCI_UART_PROTO_READY, in fact the
> >>> following sequence might always be triggering:
> >>>
> >> dev_pm_domain_detach() should be irrelevant with qca_serdev_shutdown(),
> >> should not trigger call of qca_serdev_shutdown() as explained in the end
> >>> serdev_drv_remove -> qca_serdev_remove -> hci_uart_unregister_device
> >>> -> serdev_device_close -> qca_close -> kfree(qca)
> >>> dev_pm_domain_detach -> ??? -> qca_serdev_shutdown
> >>>
> >>> If this sequence is correct then qca_serdev_shutdown accessing
> >>> qca_data will always result in a UAF problem.
> >>>
> >> above sequence should not correct as explained below.
> >>
> >> serdev and its driver should also follow below generic device and driver
> >> design.
> >>
> >> 1)
> >> driver->shutdown() will be called during shut-down time at this time
> >> driver->remove() should not have been called.
> >>
> >> 2)
> >> driver->shutdown() is impossible to be called once driver->remove()
> >> was called.
> >>
> >> 3) for serdev, driver->remove() does not trigger call of
> >> driver->shutdown() since PM relevant poweroff is irrelevant with
> >> driver->shutdown() and i also don't find any PM relevant interfaces will
> >> call driver->shutdown().
> >>
> >> i would like to explain issue B based on comments Zijun posted by public
> >> as below:
> >>
> >> issue B actually happens during reboot and let me look at these steps
> >> boot -> enable BT -> disable BT -> reboot.
> >>
> >> 1) step boot will call driver->probe() to register hdev and the serdev
> >> is opened after boot.
> >>
> >> 2) step enable will call hdev->open() and the serdev will still in open
> >> state
> >>
> >> 3) step disable will call hdev->close() and the serdev will be closed
> >> after hdev->close() for machine with config which results in
> >> HCI_QUIRK_NON_PERSISTENT_SETUP is set.
> >>
> >> 4) step reboot will call qca_serdev_shutdown() which will flush and
> >> write the serdev which are closed by above step disable, so cause the
> >> UAF issue, namely, kernel crash issue.
> >>
> >> so this issue is caused by commit B which close the serdev during
> >> hdev->close().
> >>
> >> driver->remove() even is not triggered during above steps.
> >>>> Commit C: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on
> >>>> closed serdev")
> >>>> this commit is to fix issue B which is actually caused by Commit B, but
> >>>> it has Fixes tag for Commit A. and it also introduces the regression
> >>>> issue A.
> >>>>
> >>>
> >>>
> >
> > Reading again the commit message for the UAF fix it sounds like a
> > different problem:
> >
> no, the UAF issue commit C fixes should be the same issue descripted by
> me previously as explained below:
>
> the UAF issue happened with machine "qualcomm Technologies, Inc.
> Robotics RB5 (DT)", the machine uses qca6390 and have property
> enable-gpios configured, which will results in that quirk
> HCI_QUIRK_NON_PERSISTENT_SETUP is set, so must meet the UAF issue
> for normal operation sequences "boot -> enable BT -> disable BT -> reboot".

Wait, are you telling me that UAF was wrongly described? Or perhaps
they are just different trees and you don't see the problem because
you are not actually running with the mainline code?

> Actually, only machines which uses QCA6390 and have property
> enable-gpios configured will meet the UAF issue as commented by Zijun
> with below link
> https://lore.kernel.org/linux-bluetooth/9ac11453-b7cf-43f3-8e46-f96e41ef190d@quicinc.com/
>
> >     The driver shutdown callback (which sends EDL_SOC_RESET to the device
> >     over serdev) should not be invoked when HCI device is not open (e.g. if
> >     hci_dev_open_sync() failed), because the serdev and its TTY are not open
> >     either.  Also skip this step if device is powered off
> >     (qca_power_shutdown()).
> >
> > So if hci_dev_open_sync has failed it says serdev and its TTY will not
> > be open either, so I guess that's why HCI_SETUP was added as a
> > condition to bail out? So it seems correct to do that although I'd
> > change the comments.
> >
> i believe hci_dev_open_sync failure should not really happens with the
> machine Robotics RB5, the purpose that it is mentioned with commit
> message is to illustrate that the serdev in closed state is operated and
> causes the UAF issue.

Sorry but Im not really following, there seems to be instances where
qca driver will fail on hci_dev_open_sync, now the shutdown sequence
is only done for a specific model, not a machine like Robotics RB5,
btw you are not really contributing to solving the problem if you are
throwing us back in different directions like this.

> let us assume that hci_dev_open_sync failure -> serdev is not opened ->
> UAF issue happens within qca_serdev_shutdown(), then BT will not be
> working at all and the commit C is actually a workaroud instead of a fix
> since the right approach is to solve the hci_dev_open_sync failure which
> happens firstly.
>
>
> Frankly, only checking quirk HCI_QUIRK_NON_PERSISTENT_SETUP is enough to
> fix the UAF issue caused by either "normal operation sequences" or
> "hci_dev_open_sync failure".

Ok, just stop contributing to this thread, you don't know what you are
talking about, HCI_QUIRK_NON_PERSISTENT_SETUP has nothing to do with
serdev state.

> > @Krzysztof Kozlowski do you still have a test setup for 272970be3dab
> > ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev"), can you
> > try with these changes?
> >
>
Lk Sii May 22, 2024, 11:25 p.m. UTC | #8
On 2024/5/22 22:25, Luiz Augusto von Dentz wrote:
> Hi,
> 
> On Wed, May 22, 2024 at 9:33 AM Lk Sii <lk_sii@163.com> wrote:
>>
>>
>>
>> On 2024/5/21 23:48, Luiz Augusto von Dentz wrote:
>>> Hi,
>>>
>>> On Tue, May 21, 2024 at 10:52 AM Lk Sii <lk_sii@163.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2024/5/16 23:55, Luiz Augusto von Dentz wrote:
>>>>> Hi,
>>>>>
>>>>> On Thu, May 16, 2024 at 10:57 AM Lk Sii <lk_sii@163.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2024/5/16 21:31, Zijun Hu wrote:
>>>>>>> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
>>>>>>> serdev") will cause below regression issue:
>>>>>>>
>>>>>>> BT can't be enabled after below steps:
>>>>>>> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
>>>>>>> if property enable-gpios is not configured within DT|ACPI for QCA6390.
>>>>>>>
>>>>>>> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
>>>>>>> by adding condition to avoid the serdev is flushed or wrote after closed
>>>>>>> but also introduces this regression issue regarding above steps since the
>>>>>>> VSC is not sent to reset controller during warm reboot.
>>>>>>>
>>>>>>> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
>>>>>>> once BT was ever enabled, and the use-after-free issue is also fixed by
>>>>>>> this change since the serdev is still opened before it is flushed or wrote.
>>>>>>>
>>>>>>> Verified by the reported machine Dell XPS 13 9310 laptop over below two
>>>>>>> kernel commits:
>>>>>>> commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
>>>>>>> implementation for QCA") of bluetooth-next tree.
>>>>>>> commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
>>>>>>> implementation for QCA") of linus mainline tree.
>>>>>>>
>>>>>>> Fixes: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev")
>>>>>>> Cc: stable@vger.kernel.org
>>>>>>> Reported-by: Wren Turkal <wt@penguintechs.org>
>>>>>>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218726
>>>>>>> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
>>>>>>> Tested-by: Wren Turkal <wt@penguintechs.org>
>>>>>>> ---
>>>>>>> V1 -> V2: Add comments and more commit messages
>>>>>>>
>>>>>>> V1 discussion link:
>>>>>>> https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31ebe@163.com/T/#t
>>>>>>>
>>>>>>>  drivers/bluetooth/hci_qca.c | 18 +++++++++++++++---
>>>>>>>  1 file changed, 15 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
>>>>>>> index 0c9c9ee56592..9a0bc86f9aac 100644
>>>>>>> --- a/drivers/bluetooth/hci_qca.c
>>>>>>> +++ b/drivers/bluetooth/hci_qca.c
>>>>>>> @@ -2450,15 +2450,27 @@ static void qca_serdev_shutdown(struct device *dev)
>>>>>>>       struct qca_serdev *qcadev = serdev_device_get_drvdata(serdev);
>>>>>>>       struct hci_uart *hu = &qcadev->serdev_hu;
>>>>>>>       struct hci_dev *hdev = hu->hdev;
>>>>>>> -     struct qca_data *qca = hu->priv;
>>>>>>>       const u8 ibs_wake_cmd[] = { 0xFD };
>>>>>>>       const u8 edl_reset_soc_cmd[] = { 0x01, 0x00, 0xFC, 0x01, 0x05 };
>>>>>>>
>>>>>>>       if (qcadev->btsoc_type == QCA_QCA6390) {
>>>>>>> -             if (test_bit(QCA_BT_OFF, &qca->flags) ||
>>>>>>> -                 !test_bit(HCI_RUNNING, &hdev->flags))
>>>>>>> +             /* The purpose of sending the VSC is to reset SOC into a initial
>>>>>>> +              * state and the state will ensure next hdev->setup() success.
>>>>>>> +              * if HCI_QUIRK_NON_PERSISTENT_SETUP is set, it means that
>>>>>>> +              * hdev->setup() can do its job regardless of SoC state, so
>>>>>>> +              * don't need to send the VSC.
>>>>>>> +              * if HCI_SETUP is set, it means that hdev->setup() was never
>>>>>>> +              * invoked and the SOC is already in the initial state, so
>>>>>>> +              * don't also need to send the VSC.
>>>>>>> +              */
>>>>>>> +             if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks) ||
>>>>>>> +                 hci_dev_test_flag(hdev, HCI_SETUP))
>>>>>>>                       return;
>>>> The main purpose of above checking is NOT to make sure the serdev within
>>>> open state as its comments explained.
>>>>>>>
>>>>>>> +             /* The serdev must be in open state when conrol logic arrives
>>>>>>> +              * here, so also fix the use-after-free issue caused by that
>>>>>>> +              * the serdev is flushed or wrote after it is closed.
>>>>>>> +              */
>>>>>>>               serdev_device_write_flush(serdev);
>>>>>>>               ret = serdev_device_write_buf(serdev, ibs_wake_cmd,
>>>>>>>                                             sizeof(ibs_wake_cmd));
>>>>>> i believe Zijun's change is able to fix both below issues and don't
>>>>>> introduce new issue.
>>>>>>
>>>>>> regression issue A:  BT enable failure after warm reboot.
>>>>>> issue B:  use-after-free issue, namely, kernel crash.
>>>>>>
>>>>>>
>>>>>> For issue B, i have more findings related to below commits ordered by time.
>>>>>>
>>>>>> Commit A: 7e7bbddd029b ("Bluetooth: hci_qca: Fix qca6390 enable failure
>>>>>> after warm reboot")
>>>>>>
>>>>>> Commit B: de8892df72be ("Bluetooth: hci_serdev: Close UART port if
>>>>>> NON_PERSISTENT_SETUP is set")
>>>>>> this commit introduces issue B, it is also not suitable to associate
>>>>>> protocol state with state of lower level transport type such as serdev
>>>>>> or uart, in my opinion, protocol state should be independent with
>>>>>> transport type state, flag HCI_UART_PROTO_READY is for protocol state,
>>>>>> it means if protocol hu->proto is initialized and if we can invoke its
>>>>>> interfaces.it is common for various kinds of transport types. perhaps,
>>>>>> this is the reason why Zijun's change doesn't use flag HCI_UART_PROTO_READY.
>>>>>
>>>>> Don't really follow you here, if HCI_UART_PROTO_READY indicates the
>>>>> protocol state they is even _more_ important to use before invoking
>>>>> serdev APIs, so checking for the quirk sound like a problem because:
>>>>>
>>>>> [1] hci_uart_close
>>>>>      /* When QUIRK HCI_QUIRK_NON_PERSISTENT_SETUP is set by driver,
>>>>>      * BT SOC is completely powered OFF during BT OFF, holding port
>>>>>      * open may drain the battery.
>>>>>      */
>>>>>     if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks)) {
>>>>>         clear_bit(HCI_UART_PROTO_READY, &hu->flags);
>>>>>         serdev_device_close(hu->serdev);
>>>>>     }
>>>>>
>>>>> [2] hci_uart_unregister_device
>>>>>     if (test_bit(HCI_UART_PROTO_READY, &hu->flags)) {
>>>>>         clear_bit(HCI_UART_PROTO_READY, &hu->flags);
>>>>>         serdev_device_close(hu->serdev);
>>>>>     }
>>>>> both case 1 and case 2 were introduced by Commit B in question which
>>>> uses protocol state flag HCI_UART_PROTO_READY to track lower level
>>>> transport type state, i don't think it is perfect.
>>>>
>>>> for common files hci_serdev.c and hci_ldisc.c, as you saw, the purpose
>>>> of checking HCI_UART_PROTO_READY is to call protocol relevant
>>>> interfaces, moreover, these protocol relevant interfaces do not deal
>>>> with lower transport state. you maybe even notice below present function
>>>> within which lower level serdev is flushed before HCI_UART_PROTO_READY
>>>> is checked:
>>>>
>>>> static int hci_uart_flush(struct hci_dev *hdev)
>>>> {
>>>> ......
>>>>         /* Flush any pending characters in the driver and discipline. */
>>>>         serdev_device_write_flush(hu->serdev);
>>>>
>>>>         if (test_bit(HCI_UART_PROTO_READY, &hu->flags))
>>>>                 hu->proto->flush(hu);
>>>>
>>>>         return 0;
>>>> }
>>>>
>>>> in my opinion, that is why qca_serdev_shutdown() does not check
>>>> HCI_UART_PROTO_READY for later lower level serdev operations.
>>>>> So only in case 1 checking the quirk is equivalent to
>>>>> HCI_UART_PROTO_READY on case 2 it does actually check the quirk and
>>>>> will proceed to call serdev_device_close, now perhaps the code is
>>>>> assuming that shutdown won't be called after that, but it looks it
>>>>> does since:
>>>>>
>>>> qca_serdev_shutdown() will never be called after case 2 as explained
>>>> in the end.
>>>>> static void serdev_drv_remove(struct device *dev)
>>>>> {
>>>>>     const struct serdev_device_driver *sdrv =
>>>>> to_serdev_device_driver(dev->driver);
>>>>>     if (sdrv->remove)
>>>>>         sdrv->remove(to_serdev_device(dev));
>>>>>
>>>>>     dev_pm_domain_detach(dev, true);
>>>>> }
>>>>>
>>>>> dev_pm_domain_detach says it will power off so I assume that means
>>>>> that shutdown will be called _after_ remove, so not I'm not really
>>>>> convinced that we can avoid using HCI_UART_PROTO_READY, in fact the
>>>>> following sequence might always be triggering:
>>>>>
>>>> dev_pm_domain_detach() should be irrelevant with qca_serdev_shutdown(),
>>>> should not trigger call of qca_serdev_shutdown() as explained in the end
>>>>> serdev_drv_remove -> qca_serdev_remove -> hci_uart_unregister_device
>>>>> -> serdev_device_close -> qca_close -> kfree(qca)
>>>>> dev_pm_domain_detach -> ??? -> qca_serdev_shutdown
>>>>>
>>>>> If this sequence is correct then qca_serdev_shutdown accessing
>>>>> qca_data will always result in a UAF problem.
>>>>>
>>>> above sequence should not correct as explained below.
>>>>
>>>> serdev and its driver should also follow below generic device and driver
>>>> design.
>>>>
>>>> 1)
>>>> driver->shutdown() will be called during shut-down time at this time
>>>> driver->remove() should not have been called.
>>>>
>>>> 2)
>>>> driver->shutdown() is impossible to be called once driver->remove()
>>>> was called.
>>>>
>>>> 3) for serdev, driver->remove() does not trigger call of
>>>> driver->shutdown() since PM relevant poweroff is irrelevant with
>>>> driver->shutdown() and i also don't find any PM relevant interfaces will
>>>> call driver->shutdown().
>>>>
>>>> i would like to explain issue B based on comments Zijun posted by public
>>>> as below:
>>>>
>>>> issue B actually happens during reboot and let me look at these steps
>>>> boot -> enable BT -> disable BT -> reboot.
>>>>
>>>> 1) step boot will call driver->probe() to register hdev and the serdev
>>>> is opened after boot.
>>>>
>>>> 2) step enable will call hdev->open() and the serdev will still in open
>>>> state
>>>>
>>>> 3) step disable will call hdev->close() and the serdev will be closed
>>>> after hdev->close() for machine with config which results in
>>>> HCI_QUIRK_NON_PERSISTENT_SETUP is set.
>>>>
>>>> 4) step reboot will call qca_serdev_shutdown() which will flush and
>>>> write the serdev which are closed by above step disable, so cause the
>>>> UAF issue, namely, kernel crash issue.
>>>>
>>>> so this issue is caused by commit B which close the serdev during
>>>> hdev->close().
>>>>
>>>> driver->remove() even is not triggered during above steps.
>>>>>> Commit C: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on
>>>>>> closed serdev")
>>>>>> this commit is to fix issue B which is actually caused by Commit B, but
>>>>>> it has Fixes tag for Commit A. and it also introduces the regression
>>>>>> issue A.
>>>>>>
>>>>>
>>>>>
>>>
>>> Reading again the commit message for the UAF fix it sounds like a
>>> different problem:
>>>
>> no, the UAF issue commit C fixes should be the same issue descripted by
>> me previously as explained below:
>>
>> the UAF issue happened with machine "qualcomm Technologies, Inc.
>> Robotics RB5 (DT)", the machine uses qca6390 and have property
>> enable-gpios configured, which will results in that quirk
>> HCI_QUIRK_NON_PERSISTENT_SETUP is set, so must meet the UAF issue
>> for normal operation sequences "boot -> enable BT -> disable BT -> reboot".
> 
> Wait, are you telling me that UAF was wrongly described? Or perhaps
> they are just different trees and you don't see the problem because
> you are not actually running with the mainline code?
> 
sorry, please ignore my last comment to simply discussion
>> Actually, only machines which uses QCA6390 and have property
>> enable-gpios configured will meet the UAF issue as commented by Zijun
>> with below link
>> https://lore.kernel.org/linux-bluetooth/9ac11453-b7cf-43f3-8e46-f96e41ef190d@quicinc.com/
>>
>>>     The driver shutdown callback (which sends EDL_SOC_RESET to the device
>>>     over serdev) should not be invoked when HCI device is not open (e.g. if
>>>     hci_dev_open_sync() failed), because the serdev and its TTY are not open
>>>     either.  Also skip this step if device is powered off
>>>     (qca_power_shutdown()).
>>>
>>> So if hci_dev_open_sync has failed it says serdev and its TTY will not
>>> be open either, so I guess that's why HCI_SETUP was added as a
>>> condition to bail out? So it seems correct to do that although I'd
>>> change the comments.
>>>
>> i believe hci_dev_open_sync failure should not really happens with the
>> machine Robotics RB5, the purpose that it is mentioned with commit
>> message is to illustrate that the serdev in closed state is operated and
>> causes the UAF issue.
> 
> Sorry but Im not really following, there seems to be instances where
> qca driver will fail on hci_dev_open_sync, now the shutdown sequence
> is only done for a specific model, not a machine like Robotics RB5,
> btw you are not really contributing to solving the problem if you are
> throwing us back in different directions like this.
> 
sorry, please ignore my last comment to simply discussion
>> let us assume that hci_dev_open_sync failure -> serdev is not opened ->
>> UAF issue happens within qca_serdev_shutdown(), then BT will not be
>> working at all and the commit C is actually a workaroud instead of a fix
>> since the right approach is to solve the hci_dev_open_sync failure which
>> happens firstly.
>>
>>
>> Frankly, only checking quirk HCI_QUIRK_NON_PERSISTENT_SETUP is enough to
>> fix the UAF issue caused by either "normal operation sequences" or
>> "hci_dev_open_sync failure".
> 
> Ok, just stop contributing to this thread, you don't know what you are
> talking about, HCI_QUIRK_NON_PERSISTENT_SETUP has nothing to do with
> serdev state.
> 
okay, please ignore my last comment to simply discussion about Zijun's
change.
>>> @Krzysztof Kozlowski do you still have a test setup for 272970be3dab
>>> ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev"), can you
>>> try with these changes?
>>>
>>
> 
>
Lk Sii May 23, 2024, 12:17 a.m. UTC | #9
On 2024/5/21 23:48, Luiz Augusto von Dentz wrote:
> Hi,
> 
> On Tue, May 21, 2024 at 10:52 AM Lk Sii <lk_sii@163.com> wrote:
>>
>>
>>
>> On 2024/5/16 23:55, Luiz Augusto von Dentz wrote:
>>> Hi,
>>>
>>> On Thu, May 16, 2024 at 10:57 AM Lk Sii <lk_sii@163.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2024/5/16 21:31, Zijun Hu wrote:
>>>>> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
>>>>> serdev") will cause below regression issue:
>>>>>
>>>>> BT can't be enabled after below steps:
>>>>> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
>>>>> if property enable-gpios is not configured within DT|ACPI for QCA6390.
>>>>>
>>>>> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
>>>>> by adding condition to avoid the serdev is flushed or wrote after closed
>>>>> but also introduces this regression issue regarding above steps since the
>>>>> VSC is not sent to reset controller during warm reboot.
>>>>>
>>>>> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
>>>>> once BT was ever enabled, and the use-after-free issue is also fixed by
>>>>> this change since the serdev is still opened before it is flushed or wrote.
>>>>>
>>>>> Verified by the reported machine Dell XPS 13 9310 laptop over below two
>>>>> kernel commits:
>>>>> commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
>>>>> implementation for QCA") of bluetooth-next tree.
>>>>> commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
>>>>> implementation for QCA") of linus mainline tree.
>>>>>
>>>>> Fixes: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev")
>>>>> Cc: stable@vger.kernel.org
>>>>> Reported-by: Wren Turkal <wt@penguintechs.org>
>>>>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218726
>>>>> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
>>>>> Tested-by: Wren Turkal <wt@penguintechs.org>
>>>>> ---
>>>>> V1 -> V2: Add comments and more commit messages
>>>>>
>>>>> V1 discussion link:
>>>>> https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31ebe@163.com/T/#t
>>>>>
>>>>>  drivers/bluetooth/hci_qca.c | 18 +++++++++++++++---
>>>>>  1 file changed, 15 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
>>>>> index 0c9c9ee56592..9a0bc86f9aac 100644
>>>>> --- a/drivers/bluetooth/hci_qca.c
>>>>> +++ b/drivers/bluetooth/hci_qca.c
>>>>> @@ -2450,15 +2450,27 @@ static void qca_serdev_shutdown(struct device *dev)
>>>>>       struct qca_serdev *qcadev = serdev_device_get_drvdata(serdev);
>>>>>       struct hci_uart *hu = &qcadev->serdev_hu;
>>>>>       struct hci_dev *hdev = hu->hdev;
>>>>> -     struct qca_data *qca = hu->priv;
>>>>>       const u8 ibs_wake_cmd[] = { 0xFD };
>>>>>       const u8 edl_reset_soc_cmd[] = { 0x01, 0x00, 0xFC, 0x01, 0x05 };
>>>>>
>>>>>       if (qcadev->btsoc_type == QCA_QCA6390) {
>>>>> -             if (test_bit(QCA_BT_OFF, &qca->flags) ||
>>>>> -                 !test_bit(HCI_RUNNING, &hdev->flags))
>>>>> +             /* The purpose of sending the VSC is to reset SOC into a initial
>>>>> +              * state and the state will ensure next hdev->setup() success.
>>>>> +              * if HCI_QUIRK_NON_PERSISTENT_SETUP is set, it means that
>>>>> +              * hdev->setup() can do its job regardless of SoC state, so
>>>>> +              * don't need to send the VSC.
>>>>> +              * if HCI_SETUP is set, it means that hdev->setup() was never
>>>>> +              * invoked and the SOC is already in the initial state, so
>>>>> +              * don't also need to send the VSC.
>>>>> +              */
>>>>> +             if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks) ||
>>>>> +                 hci_dev_test_flag(hdev, HCI_SETUP))
>>>>>                       return;
>> The main purpose of above checking is NOT to make sure the serdev within
>> open state as its comments explained.
>>>>>
>>>>> +             /* The serdev must be in open state when conrol logic arrives
>>>>> +              * here, so also fix the use-after-free issue caused by that
>>>>> +              * the serdev is flushed or wrote after it is closed.
>>>>> +              */
>>>>>               serdev_device_write_flush(serdev);
>>>>>               ret = serdev_device_write_buf(serdev, ibs_wake_cmd,
>>>>>                                             sizeof(ibs_wake_cmd));
>>>> i believe Zijun's change is able to fix both below issues and don't
>>>> introduce new issue.
>>>>
>>>> regression issue A:  BT enable failure after warm reboot.
>>>> issue B:  use-after-free issue, namely, kernel crash.
>>>>
>>>>
>>>> For issue B, i have more findings related to below commits ordered by time.
>>>>
>>>> Commit A: 7e7bbddd029b ("Bluetooth: hci_qca: Fix qca6390 enable failure
>>>> after warm reboot")
>>>>
>>>> Commit B: de8892df72be ("Bluetooth: hci_serdev: Close UART port if
>>>> NON_PERSISTENT_SETUP is set")
>>>> this commit introduces issue B, it is also not suitable to associate
>>>> protocol state with state of lower level transport type such as serdev
>>>> or uart, in my opinion, protocol state should be independent with
>>>> transport type state, flag HCI_UART_PROTO_READY is for protocol state,
>>>> it means if protocol hu->proto is initialized and if we can invoke its
>>>> interfaces.it is common for various kinds of transport types. perhaps,
>>>> this is the reason why Zijun's change doesn't use flag HCI_UART_PROTO_READY.
>>>
>>> Don't really follow you here, if HCI_UART_PROTO_READY indicates the
>>> protocol state they is even _more_ important to use before invoking
>>> serdev APIs, so checking for the quirk sound like a problem because:
>>>
>>> [1] hci_uart_close
>>>      /* When QUIRK HCI_QUIRK_NON_PERSISTENT_SETUP is set by driver,
>>>      * BT SOC is completely powered OFF during BT OFF, holding port
>>>      * open may drain the battery.
>>>      */
>>>     if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks)) {
>>>         clear_bit(HCI_UART_PROTO_READY, &hu->flags);
>>>         serdev_device_close(hu->serdev);
>>>     }
>>>
>>> [2] hci_uart_unregister_device
>>>     if (test_bit(HCI_UART_PROTO_READY, &hu->flags)) {
>>>         clear_bit(HCI_UART_PROTO_READY, &hu->flags);
>>>         serdev_device_close(hu->serdev);
>>>     }
>>> both case 1 and case 2 were introduced by Commit B in question which
>> uses protocol state flag HCI_UART_PROTO_READY to track lower level
>> transport type state, i don't think it is perfect.
>>
>> for common files hci_serdev.c and hci_ldisc.c, as you saw, the purpose
>> of checking HCI_UART_PROTO_READY is to call protocol relevant
>> interfaces, moreover, these protocol relevant interfaces do not deal
>> with lower transport state. you maybe even notice below present function
>> within which lower level serdev is flushed before HCI_UART_PROTO_READY
>> is checked:
>>
>> static int hci_uart_flush(struct hci_dev *hdev)
>> {
>> ......
>>         /* Flush any pending characters in the driver and discipline. */
>>         serdev_device_write_flush(hu->serdev);
>>
>>         if (test_bit(HCI_UART_PROTO_READY, &hu->flags))
>>                 hu->proto->flush(hu);
>>
>>         return 0;
>> }
>>
>> in my opinion, that is why qca_serdev_shutdown() does not check
>> HCI_UART_PROTO_READY for later lower level serdev operations.
>>> So only in case 1 checking the quirk is equivalent to
>>> HCI_UART_PROTO_READY on case 2 it does actually check the quirk and
>>> will proceed to call serdev_device_close, now perhaps the code is
>>> assuming that shutdown won't be called after that, but it looks it
>>> does since:
>>>
>> qca_serdev_shutdown() will never be called after case 2 as explained
>> in the end.
>>> static void serdev_drv_remove(struct device *dev)
>>> {
>>>     const struct serdev_device_driver *sdrv =
>>> to_serdev_device_driver(dev->driver);
>>>     if (sdrv->remove)
>>>         sdrv->remove(to_serdev_device(dev));
>>>
>>>     dev_pm_domain_detach(dev, true);
>>> }
>>>
>>> dev_pm_domain_detach says it will power off so I assume that means
>>> that shutdown will be called _after_ remove, so not I'm not really
>>> convinced that we can avoid using HCI_UART_PROTO_READY, in fact the
>>> following sequence might always be triggering:
>>>
>> dev_pm_domain_detach() should be irrelevant with qca_serdev_shutdown(),
>> should not trigger call of qca_serdev_shutdown() as explained in the end
>>> serdev_drv_remove -> qca_serdev_remove -> hci_uart_unregister_device
>>> -> serdev_device_close -> qca_close -> kfree(qca)
>>> dev_pm_domain_detach -> ??? -> qca_serdev_shutdown
>>>
>>> If this sequence is correct then qca_serdev_shutdown accessing
>>> qca_data will always result in a UAF problem.
>>>
>> above sequence should not correct as explained below.
>>
>> serdev and its driver should also follow below generic device and driver
>> design.
>>
>> 1)
>> driver->shutdown() will be called during shut-down time at this time
>> driver->remove() should not have been called.
>>
>> 2)
>> driver->shutdown() is impossible to be called once driver->remove()
>> was called.
>>
>> 3) for serdev, driver->remove() does not trigger call of
>> driver->shutdown() since PM relevant poweroff is irrelevant with
>> driver->shutdown() and i also don't find any PM relevant interfaces will
>> call driver->shutdown().
>>
>> i would like to explain issue B based on comments Zijun posted by public
>> as below:
>>
>> issue B actually happens during reboot and let me look at these steps
>> boot -> enable BT -> disable BT -> reboot.
>>
>> 1) step boot will call driver->probe() to register hdev and the serdev
>> is opened after boot.
>>
>> 2) step enable will call hdev->open() and the serdev will still in open
>> state
>>
>> 3) step disable will call hdev->close() and the serdev will be closed
>> after hdev->close() for machine with config which results in
>> HCI_QUIRK_NON_PERSISTENT_SETUP is set.
>>
>> 4) step reboot will call qca_serdev_shutdown() which will flush and
>> write the serdev which are closed by above step disable, so cause the
>> UAF issue, namely, kernel crash issue.
>>
>> so this issue is caused by commit B which close the serdev during
>> hdev->close().
>>
>> driver->remove() even is not triggered during above steps.
>>>> Commit C: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on
>>>> closed serdev")
>>>> this commit is to fix issue B which is actually caused by Commit B, but
>>>> it has Fixes tag for Commit A. and it also introduces the regression
>>>> issue A.
>>>>
>>>
>>>
> 
> Reading again the commit message for the UAF fix it sounds like a
> different problem:
> 
>     The driver shutdown callback (which sends EDL_SOC_RESET to the device
>     over serdev) should not be invoked when HCI device is not open (e.g. if
>     hci_dev_open_sync() failed), because the serdev and its TTY are not open
>     either.  Also skip this step if device is powered off
>     (qca_power_shutdown()).
> 
> So if hci_dev_open_sync has failed it says serdev and its TTY will not
> be open either, so I guess that's why HCI_SETUP was added as a
> condition to bail out? So it seems correct to do that although I'd
> change the comments.
> 
yes, agree with you on these points, Zijun's change is able to fix this
different problem as well.
> @Krzysztof Kozlowski do you still have a test setup for 272970be3dab
> ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev"), can you
> try with these changes?
>
Lk Sii May 30, 2024, 1:33 p.m. UTC | #10
On 2024/5/23 08:17, Lk Sii wrote:
> 
> 
> On 2024/5/21 23:48, Luiz Augusto von Dentz wrote:
>> Hi,
>>
>> On Tue, May 21, 2024 at 10:52 AM Lk Sii <lk_sii@163.com> wrote:
>>>
>>>
>>>
>>> On 2024/5/16 23:55, Luiz Augusto von Dentz wrote:
>>>> Hi,
>>>>
>>>> On Thu, May 16, 2024 at 10:57 AM Lk Sii <lk_sii@163.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2024/5/16 21:31, Zijun Hu wrote:
>>>>>> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
>>>>>> serdev") will cause below regression issue:
>>>>>>
>>>>>> BT can't be enabled after below steps:
>>>>>> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
>>>>>> if property enable-gpios is not configured within DT|ACPI for QCA6390.
>>>>>>
>>>>>> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
>>>>>> by adding condition to avoid the serdev is flushed or wrote after closed
>>>>>> but also introduces this regression issue regarding above steps since the
>>>>>> VSC is not sent to reset controller during warm reboot.
>>>>>>
>>>>>> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
>>>>>> once BT was ever enabled, and the use-after-free issue is also fixed by
>>>>>> this change since the serdev is still opened before it is flushed or wrote.
>>>>>>
>>>>>> Verified by the reported machine Dell XPS 13 9310 laptop over below two
>>>>>> kernel commits:
>>>>>> commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
>>>>>> implementation for QCA") of bluetooth-next tree.
>>>>>> commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
>>>>>> implementation for QCA") of linus mainline tree.
>>>>>>
>>>>>> Fixes: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev")
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Reported-by: Wren Turkal <wt@penguintechs.org>
>>>>>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218726
>>>>>> Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
>>>>>> Tested-by: Wren Turkal <wt@penguintechs.org>
>>>>>> ---
>>>>>> V1 -> V2: Add comments and more commit messages
>>>>>>
>>>>>> V1 discussion link:
>>>>>> https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31ebe@163.com/T/#t
>>>>>>
>>>>>>  drivers/bluetooth/hci_qca.c | 18 +++++++++++++++---
>>>>>>  1 file changed, 15 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
>>>>>> index 0c9c9ee56592..9a0bc86f9aac 100644
>>>>>> --- a/drivers/bluetooth/hci_qca.c
>>>>>> +++ b/drivers/bluetooth/hci_qca.c
>>>>>> @@ -2450,15 +2450,27 @@ static void qca_serdev_shutdown(struct device *dev)
>>>>>>       struct qca_serdev *qcadev = serdev_device_get_drvdata(serdev);
>>>>>>       struct hci_uart *hu = &qcadev->serdev_hu;
>>>>>>       struct hci_dev *hdev = hu->hdev;
>>>>>> -     struct qca_data *qca = hu->priv;
>>>>>>       const u8 ibs_wake_cmd[] = { 0xFD };
>>>>>>       const u8 edl_reset_soc_cmd[] = { 0x01, 0x00, 0xFC, 0x01, 0x05 };
>>>>>>
>>>>>>       if (qcadev->btsoc_type == QCA_QCA6390) {
>>>>>> -             if (test_bit(QCA_BT_OFF, &qca->flags) ||
>>>>>> -                 !test_bit(HCI_RUNNING, &hdev->flags))
>>>>>> +             /* The purpose of sending the VSC is to reset SOC into a initial
>>>>>> +              * state and the state will ensure next hdev->setup() success.
>>>>>> +              * if HCI_QUIRK_NON_PERSISTENT_SETUP is set, it means that
>>>>>> +              * hdev->setup() can do its job regardless of SoC state, so
>>>>>> +              * don't need to send the VSC.
>>>>>> +              * if HCI_SETUP is set, it means that hdev->setup() was never
>>>>>> +              * invoked and the SOC is already in the initial state, so
>>>>>> +              * don't also need to send the VSC.
>>>>>> +              */
>>>>>> +             if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks) ||
>>>>>> +                 hci_dev_test_flag(hdev, HCI_SETUP))
>>>>>>                       return;
>>> The main purpose of above checking is NOT to make sure the serdev within
>>> open state as its comments explained.
>>>>>>
>>>>>> +             /* The serdev must be in open state when conrol logic arrives
>>>>>> +              * here, so also fix the use-after-free issue caused by that
>>>>>> +              * the serdev is flushed or wrote after it is closed.
>>>>>> +              */
>>>>>>               serdev_device_write_flush(serdev);
>>>>>>               ret = serdev_device_write_buf(serdev, ibs_wake_cmd,
>>>>>>                                             sizeof(ibs_wake_cmd));
>>>>> i believe Zijun's change is able to fix both below issues and don't
>>>>> introduce new issue.
>>>>>
>>>>> regression issue A:  BT enable failure after warm reboot.
>>>>> issue B:  use-after-free issue, namely, kernel crash.
>>>>>
>>>>>
>>>>> For issue B, i have more findings related to below commits ordered by time.
>>>>>
>>>>> Commit A: 7e7bbddd029b ("Bluetooth: hci_qca: Fix qca6390 enable failure
>>>>> after warm reboot")
>>>>>
>>>>> Commit B: de8892df72be ("Bluetooth: hci_serdev: Close UART port if
>>>>> NON_PERSISTENT_SETUP is set")
>>>>> this commit introduces issue B, it is also not suitable to associate
>>>>> protocol state with state of lower level transport type such as serdev
>>>>> or uart, in my opinion, protocol state should be independent with
>>>>> transport type state, flag HCI_UART_PROTO_READY is for protocol state,
>>>>> it means if protocol hu->proto is initialized and if we can invoke its
>>>>> interfaces.it is common for various kinds of transport types. perhaps,
>>>>> this is the reason why Zijun's change doesn't use flag HCI_UART_PROTO_READY.
>>>>
>>>> Don't really follow you here, if HCI_UART_PROTO_READY indicates the
>>>> protocol state they is even _more_ important to use before invoking
>>>> serdev APIs, so checking for the quirk sound like a problem because:
>>>>
>>>> [1] hci_uart_close
>>>>      /* When QUIRK HCI_QUIRK_NON_PERSISTENT_SETUP is set by driver,
>>>>      * BT SOC is completely powered OFF during BT OFF, holding port
>>>>      * open may drain the battery.
>>>>      */
>>>>     if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks)) {
>>>>         clear_bit(HCI_UART_PROTO_READY, &hu->flags);
>>>>         serdev_device_close(hu->serdev);
>>>>     }
>>>>
>>>> [2] hci_uart_unregister_device
>>>>     if (test_bit(HCI_UART_PROTO_READY, &hu->flags)) {
>>>>         clear_bit(HCI_UART_PROTO_READY, &hu->flags);
>>>>         serdev_device_close(hu->serdev);
>>>>     }
>>>> both case 1 and case 2 were introduced by Commit B in question which
>>> uses protocol state flag HCI_UART_PROTO_READY to track lower level
>>> transport type state, i don't think it is perfect.
>>>
>>> for common files hci_serdev.c and hci_ldisc.c, as you saw, the purpose
>>> of checking HCI_UART_PROTO_READY is to call protocol relevant
>>> interfaces, moreover, these protocol relevant interfaces do not deal
>>> with lower transport state. you maybe even notice below present function
>>> within which lower level serdev is flushed before HCI_UART_PROTO_READY
>>> is checked:
>>>
>>> static int hci_uart_flush(struct hci_dev *hdev)
>>> {
>>> ......
>>>         /* Flush any pending characters in the driver and discipline. */
>>>         serdev_device_write_flush(hu->serdev);
>>>
>>>         if (test_bit(HCI_UART_PROTO_READY, &hu->flags))
>>>                 hu->proto->flush(hu);
>>>
>>>         return 0;
>>> }
>>>
>>> in my opinion, that is why qca_serdev_shutdown() does not check
>>> HCI_UART_PROTO_READY for later lower level serdev operations.
>>>> So only in case 1 checking the quirk is equivalent to
>>>> HCI_UART_PROTO_READY on case 2 it does actually check the quirk and
>>>> will proceed to call serdev_device_close, now perhaps the code is
>>>> assuming that shutdown won't be called after that, but it looks it
>>>> does since:
>>>>
>>> qca_serdev_shutdown() will never be called after case 2 as explained
>>> in the end.
>>>> static void serdev_drv_remove(struct device *dev)
>>>> {
>>>>     const struct serdev_device_driver *sdrv =
>>>> to_serdev_device_driver(dev->driver);
>>>>     if (sdrv->remove)
>>>>         sdrv->remove(to_serdev_device(dev));
>>>>
>>>>     dev_pm_domain_detach(dev, true);
>>>> }
>>>>
>>>> dev_pm_domain_detach says it will power off so I assume that means
>>>> that shutdown will be called _after_ remove, so not I'm not really
>>>> convinced that we can avoid using HCI_UART_PROTO_READY, in fact the
>>>> following sequence might always be triggering:
>>>>
>>> dev_pm_domain_detach() should be irrelevant with qca_serdev_shutdown(),
>>> should not trigger call of qca_serdev_shutdown() as explained in the end
>>>> serdev_drv_remove -> qca_serdev_remove -> hci_uart_unregister_device
>>>> -> serdev_device_close -> qca_close -> kfree(qca)
>>>> dev_pm_domain_detach -> ??? -> qca_serdev_shutdown
>>>>
>>>> If this sequence is correct then qca_serdev_shutdown accessing
>>>> qca_data will always result in a UAF problem.
>>>>
>>> above sequence should not correct as explained below.
>>>
>>> serdev and its driver should also follow below generic device and driver
>>> design.
>>>
>>> 1)
>>> driver->shutdown() will be called during shut-down time at this time
>>> driver->remove() should not have been called.
>>>
>>> 2)
>>> driver->shutdown() is impossible to be called once driver->remove()
>>> was called.
>>>
>>> 3) for serdev, driver->remove() does not trigger call of
>>> driver->shutdown() since PM relevant poweroff is irrelevant with
>>> driver->shutdown() and i also don't find any PM relevant interfaces will
>>> call driver->shutdown().
>>>
>>> i would like to explain issue B based on comments Zijun posted by public
>>> as below:
>>>
>>> issue B actually happens during reboot and let me look at these steps
>>> boot -> enable BT -> disable BT -> reboot.
>>>
>>> 1) step boot will call driver->probe() to register hdev and the serdev
>>> is opened after boot.
>>>
>>> 2) step enable will call hdev->open() and the serdev will still in open
>>> state
>>>
>>> 3) step disable will call hdev->close() and the serdev will be closed
>>> after hdev->close() for machine with config which results in
>>> HCI_QUIRK_NON_PERSISTENT_SETUP is set.
>>>
>>> 4) step reboot will call qca_serdev_shutdown() which will flush and
>>> write the serdev which are closed by above step disable, so cause the
>>> UAF issue, namely, kernel crash issue.
>>>
>>> so this issue is caused by commit B which close the serdev during
>>> hdev->close().
>>>
>>> driver->remove() even is not triggered during above steps.
>>>>> Commit C: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on
>>>>> closed serdev")
>>>>> this commit is to fix issue B which is actually caused by Commit B, but
>>>>> it has Fixes tag for Commit A. and it also introduces the regression
>>>>> issue A.
>>>>>
>>>>
>>>>
>>
>> Reading again the commit message for the UAF fix it sounds like a
>> different problem:
>>
>>     The driver shutdown callback (which sends EDL_SOC_RESET to the device
>>     over serdev) should not be invoked when HCI device is not open (e.g. if
>>     hci_dev_open_sync() failed), because the serdev and its TTY are not open
>>     either.  Also skip this step if device is powered off
>>     (qca_power_shutdown()).
>>
>> So if hci_dev_open_sync has failed it says serdev and its TTY will not
>> be open either, so I guess that's why HCI_SETUP was added as a
>> condition to bail out? So it seems correct to do that although I'd
>> change the comments.
>>
> yes, agree with you on these points, Zijun's change is able to fix this
> different problem as well.
@Luiz:
do you have any other concerns?
how to move forward ?
>> @Krzysztof Kozlowski do you still have a test setup for 272970be3dab
>> ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev"), can you
>> try with these changes?
>>
Luiz Augusto von Dentz June 3, 2024, 7:27 p.m. UTC | #11
Hi,

On Thu, May 30, 2024 at 9:34 AM Lk Sii <lk_sii@163.com> wrote:
> @Luiz:
> do you have any other concerns?
> how to move forward ?

Well I was expecting some answer to Krzysztof comments:

No, test it on the mainline and answer finally, after *five* tries, which
kernel and which hardware did you use for testing this.
Lk Sii June 4, 2024, 1:23 p.m. UTC | #12
On 2024/6/4 03:27, Luiz Augusto von Dentz wrote:
> Hi,
> 
> On Thu, May 30, 2024 at 9:34 AM Lk Sii <lk_sii@163.com> wrote:
>> @Luiz:
>> do you have any other concerns?
>> how to move forward ?
> 
> Well I was expecting some answer to Krzysztof comments:
> 
> No, test it on the mainline and answer finally, after *five* tries, which
> kernel and which hardware did you use for testing this.
> 
That is wonderful, let us solve Krzysztof's concerns containing above
one mentioned.
Lk Sii June 4, 2024, 2:25 p.m. UTC | #13
On 2024/5/22 00:02, Krzysztof Kozlowski wrote:
> On 16/05/2024 15:31, Zijun Hu wrote:
>> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
>> serdev") will cause below regression issue:
>>
>> BT can't be enabled after below steps:
>> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
>> if property enable-gpios is not configured within DT|ACPI for QCA6390.
>>
>> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
>> by adding condition to avoid the serdev is flushed or wrote after closed
>> but also introduces this regression issue regarding above steps since the
>> VSC is not sent to reset controller during warm reboot.
>>
>> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
>> once BT was ever enabled, and the use-after-free issue is also fixed by
>> this change since the serdev is still opened before it is flushed or wrote.
>>
>> Verified by the reported machine Dell XPS 13 9310 laptop over below two
>> kernel commits:
> 
> I don't understand how does it solve my question. I asked you: on which
> hardware did you, not the reporter, test?
>It seems Zijun did NOT perform any tests obviously.
All these tests were performed by reporter Wren with her machine
"Dell XPS 13 9310 laptop".
Krzysztof Kozlowski June 4, 2024, 3:18 p.m. UTC | #14
On 04/06/2024 16:25, Lk Sii wrote:
> 
> 
> On 2024/5/22 00:02, Krzysztof Kozlowski wrote:
>> On 16/05/2024 15:31, Zijun Hu wrote:
>>> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
>>> serdev") will cause below regression issue:
>>>
>>> BT can't be enabled after below steps:
>>> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
>>> if property enable-gpios is not configured within DT|ACPI for QCA6390.
>>>
>>> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
>>> by adding condition to avoid the serdev is flushed or wrote after closed
>>> but also introduces this regression issue regarding above steps since the
>>> VSC is not sent to reset controller during warm reboot.
>>>
>>> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
>>> once BT was ever enabled, and the use-after-free issue is also fixed by
>>> this change since the serdev is still opened before it is flushed or wrote.
>>>
>>> Verified by the reported machine Dell XPS 13 9310 laptop over below two
>>> kernel commits:
>>
>> I don't understand how does it solve my question. I asked you: on which
>> hardware did you, not the reporter, test?
>> It seems Zijun did NOT perform any tests obviously.
> All these tests were performed by reporter Wren with her machine
> "Dell XPS 13 9310 laptop".

Wren != Zijun.

> 
> From previous discussion, it seems she have tested this change
> several times with positive results over different trees with her
> machine. i noticed she given you reply for your questions within
> below v1 discussion link as following:
> 
> Here are v1 discussion link.
> https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31ebe@163.com/T/#m7371df555fd58ba215d0da63055134126a43c460
> 
> Here are Krzysztof's questions.
> "I asked already *two times*:
> 1. On which kernel did you test it?
> 2. On which hardware did you test it?"
> 
> Here are Wren's reply for Krzysztof's questions
> "I thought I had already chimed in with this information. I am using a
> Dell XPS 13 9310. It's the only hardware I have access to. I can say
> that the fix seems to work as advertised in that it fixes the warm boot
> issue I have been experiencing."

I asked Zijun, not Wren. I believe all this is tested or done by
Qualcomm on some other kernel, so that's my question.

That's important because Wren did not test particular scenarios, like
PREEMPT_RT or RB5 hardware, but Zijun is claiming problems are solved.
Maybe indeed solved, but if takes one month and still not answer which
kernel you are using, then I am sure: this was nowhere tested by Zijun
on the hardware and on the kernel the Qualcomm wants it to be.

> 
>>> commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
>>> implementation for QCA") of bluetooth-next tree.
>>> commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
>>> implementation for QCA") of linus mainline tree.
>>
>> ? Same commit with different hashes? No, it looks like you are working
>> on some downstream tree with cherry picks.
>>
> From Zijun's commit message, for the same commit, it seems
> bluetooth-next tree has different hashes as linus tree.
> not sure if this scenario is normal during some time window.
>> No, test it on mainline and answer finally, after *five* tries, which
>> kernel and which hardware did you use for testing this.
>>
>>
> it seems there are two issues mentioned with Zijun's commit message.
> regression issue A:  BT enable failure after warm reboot.
> issue B:  use-after-free issue, namely, kernel crash.
> 
> @Krzysztof
> which issue to test based on your concerns with mainline tree?

No one tested this on non-laptop platform. Wren did not, which is fine.
Qualcomm should, but since they avoid any talks about it for so long
(plus pushy comments during review, re-spinning v1 suggesting entire
discussion is gone), I do not trust their statements at all.

So really, did anything test it on any Qualcomm embedded platform?
Anyone tested the actual race visible with PREEMPT_RT?

Why Zijun cannot provide answer on which kernel was it tested? Why the
hardware cannot be mentioned?

Best regards,
Krzysztof
Lk Sii June 5, 2024, 1:49 a.m. UTC | #15
On 2024/6/4 23:18, Krzysztof Kozlowski wrote:
> On 04/06/2024 16:25, Lk Sii wrote:
>>
>>
>> On 2024/5/22 00:02, Krzysztof Kozlowski wrote:
>>> On 16/05/2024 15:31, Zijun Hu wrote:
>>>> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
>>>> serdev") will cause below regression issue:
>>>>
>>>> BT can't be enabled after below steps:
>>>> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
>>>> if property enable-gpios is not configured within DT|ACPI for QCA6390.
>>>>
>>>> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
>>>> by adding condition to avoid the serdev is flushed or wrote after closed
>>>> but also introduces this regression issue regarding above steps since the
>>>> VSC is not sent to reset controller during warm reboot.
>>>>
>>>> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
>>>> once BT was ever enabled, and the use-after-free issue is also fixed by
>>>> this change since the serdev is still opened before it is flushed or wrote.
>>>>
>>>> Verified by the reported machine Dell XPS 13 9310 laptop over below two
>>>> kernel commits:
>>>
>>> I don't understand how does it solve my question. I asked you: on which
>>> hardware did you, not the reporter, test?
>>> It seems Zijun did NOT perform any tests obviously.
>> All these tests were performed by reporter Wren with her machine
>> "Dell XPS 13 9310 laptop".
> 
> Wren != Zijun.
> 
>>
>> From previous discussion, it seems she have tested this change
>> several times with positive results over different trees with her
>> machine. i noticed she given you reply for your questions within
>> below v1 discussion link as following:
>>
>> Here are v1 discussion link.
>> https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31ebe@163.com/T/#m7371df555fd58ba215d0da63055134126a43c460
>>
>> Here are Krzysztof's questions.
>> "I asked already *two times*:
>> 1. On which kernel did you test it?
>> 2. On which hardware did you test it?"
>>
>> Here are Wren's reply for Krzysztof's questions
>> "I thought I had already chimed in with this information. I am using a
>> Dell XPS 13 9310. It's the only hardware I have access to. I can say
>> that the fix seems to work as advertised in that it fixes the warm boot
>> issue I have been experiencing."
> 
> I asked Zijun, not Wren. I believe all this is tested or done by
> Qualcomm on some other kernel, so that's my question.
>
Zijun is the only guy from Qualcomm who ever joined our discussion,
he ever said he belongs to Bluetooth team, so let us suppose the term
"Qualcomm" you mentioned above is Zijun.

from discussion history. in fact, ALL these tests were performed by
reporter Wren instead of Zijun, and there are also NOT Zijun's Tested-by
tag, so what you believe above is wrong in my opinion.

Only Zijun and reporter were involved during those early debugging days,
Zijun shared changes for reporter to verify with reporter's machine,
then Zijun posted his fixes after debugging and verification were done.

> That's important because Wren did not test particular scenarios, like
> PREEMPT_RT or RB5 hardware, but Zijun is claiming problems are solved.
> Maybe indeed solved, but if takes one month and still not answer which
> kernel you are using, then I am sure: this was nowhere tested by Zijun
> on the hardware and on the kernel the Qualcomm wants it to be.
> 
>>
>>>> commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
>>>> implementation for QCA") of bluetooth-next tree.
>>>> commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
>>>> implementation for QCA") of linus mainline tree.
>>>
>>> ? Same commit with different hashes? No, it looks like you are working
>>> on some downstream tree with cherry picks.
>>>
>> From Zijun's commit message, for the same commit, it seems
>> bluetooth-next tree has different hashes as linus tree.
>> not sure if this scenario is normal during some time window.
>>> No, test it on mainline and answer finally, after *five* tries, which
>>> kernel and which hardware did you use for testing this.
>>>
>>>
>> it seems there are two issues mentioned with Zijun's commit message.
>> regression issue A:  BT enable failure after warm reboot.
>> issue B:  use-after-free issue, namely, kernel crash.
>>
>> @Krzysztof
>> which issue to test based on your concerns with mainline tree?
> 
> No one tested this on non-laptop platform. Wren did not, which is fine.
> Qualcomm should, but since they avoid any talks about it for so long
> (plus pushy comments during review, re-spinning v1 suggesting entire
> discussion is gone), I do not trust their statements at all.
>

For issue A:
reporter's tests are enough in my opinion.
Zijun ever said that "he known the root cause and this fix logic was
introduced from the very beginning when he saw reporter's issue
description" by below link:
https://lore.kernel.org/lkml/1d0878e0-d138-4de2-86b8-326ab9ebde3f@quicinc.com/

> So really, did anything test it on any Qualcomm embedded platform?
> Anyone tested the actual race visible with PREEMPT_RT?
> For issue B, it was originally fixed and verified by you,
it is obvious for the root cause and current fix solution after
our discussion.

luzi also ever tried to ask you if you have a chance to verify issue B
with your machine for this change.

> Why Zijun cannot provide answer on which kernel was it tested? Why the
> hardware cannot be mentioned?
> 
i believe zijun never perform any tests for these two issues as
explained above.
> Best regards,
> Krzysztof
Krzysztof Kozlowski June 5, 2024, 7:14 a.m. UTC | #16
On 05/06/2024 03:49, Lk Sii wrote:
> 
> 
> On 2024/6/4 23:18, Krzysztof Kozlowski wrote:
>> On 04/06/2024 16:25, Lk Sii wrote:
>>>
>>>
>>> On 2024/5/22 00:02, Krzysztof Kozlowski wrote:
>>>> On 16/05/2024 15:31, Zijun Hu wrote:
>>>>> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
>>>>> serdev") will cause below regression issue:
>>>>>
>>>>> BT can't be enabled after below steps:
>>>>> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
>>>>> if property enable-gpios is not configured within DT|ACPI for QCA6390.
>>>>>
>>>>> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
>>>>> by adding condition to avoid the serdev is flushed or wrote after closed
>>>>> but also introduces this regression issue regarding above steps since the
>>>>> VSC is not sent to reset controller during warm reboot.
>>>>>
>>>>> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
>>>>> once BT was ever enabled, and the use-after-free issue is also fixed by
>>>>> this change since the serdev is still opened before it is flushed or wrote.
>>>>>
>>>>> Verified by the reported machine Dell XPS 13 9310 laptop over below two
>>>>> kernel commits:
>>>>
>>>> I don't understand how does it solve my question. I asked you: on which
>>>> hardware did you, not the reporter, test?
>>>> It seems Zijun did NOT perform any tests obviously.
>>> All these tests were performed by reporter Wren with her machine
>>> "Dell XPS 13 9310 laptop".
>>
>> Wren != Zijun.
>>
>>>
>>> From previous discussion, it seems she have tested this change
>>> several times with positive results over different trees with her
>>> machine. i noticed she given you reply for your questions within
>>> below v1 discussion link as following:
>>>
>>> Here are v1 discussion link.
>>> https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31ebe@163.com/T/#m7371df555fd58ba215d0da63055134126a43c460
>>>
>>> Here are Krzysztof's questions.
>>> "I asked already *two times*:
>>> 1. On which kernel did you test it?
>>> 2. On which hardware did you test it?"
>>>
>>> Here are Wren's reply for Krzysztof's questions
>>> "I thought I had already chimed in with this information. I am using a
>>> Dell XPS 13 9310. It's the only hardware I have access to. I can say
>>> that the fix seems to work as advertised in that it fixes the warm boot
>>> issue I have been experiencing."
>>
>> I asked Zijun, not Wren. I believe all this is tested or done by
>> Qualcomm on some other kernel, so that's my question.
>>
> Zijun is the only guy from Qualcomm who ever joined our discussion,
> he ever said he belongs to Bluetooth team, so let us suppose the term
> "Qualcomm" you mentioned above is Zijun.
> 
> from discussion history. in fact, ALL these tests were performed by
> reporter Wren instead of Zijun, and there are also NOT Zijun's Tested-by
> tag, so what you believe above is wrong in my opinion.

Patch author is supposed to test the code. Are you implying that
Qualcomm Bluetooth team cannot test the patch on any of Qualcomm
Bluetooth devices?

> 
> Only Zijun and reporter were involved during those early debugging days,
> Zijun shared changes for reporter to verify with reporter's machine,
> then Zijun posted his fixes after debugging and verification were done.
> 
>> That's important because Wren did not test particular scenarios, like
>> PREEMPT_RT or RB5 hardware, but Zijun is claiming problems are solved.
>> Maybe indeed solved, but if takes one month and still not answer which
>> kernel you are using, then I am sure: this was nowhere tested by Zijun
>> on the hardware and on the kernel the Qualcomm wants it to be.
>>
>>>
>>>>> commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
>>>>> implementation for QCA") of bluetooth-next tree.
>>>>> commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
>>>>> implementation for QCA") of linus mainline tree.
>>>>
>>>> ? Same commit with different hashes? No, it looks like you are working
>>>> on some downstream tree with cherry picks.
>>>>
>>> From Zijun's commit message, for the same commit, it seems
>>> bluetooth-next tree has different hashes as linus tree.
>>> not sure if this scenario is normal during some time window.
>>>> No, test it on mainline and answer finally, after *five* tries, which
>>>> kernel and which hardware did you use for testing this.
>>>>
>>>>
>>> it seems there are two issues mentioned with Zijun's commit message.
>>> regression issue A:  BT enable failure after warm reboot.
>>> issue B:  use-after-free issue, namely, kernel crash.
>>>
>>> @Krzysztof
>>> which issue to test based on your concerns with mainline tree?
>>
>> No one tested this on non-laptop platform. Wren did not, which is fine.
>> Qualcomm should, but since they avoid any talks about it for so long
>> (plus pushy comments during review, re-spinning v1 suggesting entire
>> discussion is gone), I do not trust their statements at all.
>>
> 
> For issue A:
> reporter's tests are enough in my opinion.
> Zijun ever said that "he known the root cause and this fix logic was
> introduced from the very beginning when he saw reporter's issue
> description" by below link:
> https://lore.kernel.org/lkml/1d0878e0-d138-4de2-86b8-326ab9ebde3f@quicinc.com/
> 
>> So really, did anything test it on any Qualcomm embedded platform?
>> Anyone tested the actual race visible with PREEMPT_RT?
>> For issue B, it was originally fixed and verified by you,
> it is obvious for the root cause and current fix solution after
> our discussion.
> 
> luzi also ever tried to ask you if you have a chance to verify issue B
> with your machine for this change.

I tried, but my setup is incomplete since ~half a year and will remain
probably for another short time, depending on ongoing work on power
sequencing. Therefore I cannot test whether anything improves or
deteriorates regarding this patch.

> 
>> Why Zijun cannot provide answer on which kernel was it tested? Why the
>> hardware cannot be mentioned?
>>
> i believe zijun never perform any tests for these two issues as
> explained above.

yeah, and that was worrying me.

Best regards,
Krzysztof
Lk Sii June 6, 2024, 12:54 p.m. UTC | #17
On 2024/6/5 15:14, Krzysztof Kozlowski wrote:
> On 05/06/2024 03:49, Lk Sii wrote:
>>
>>
>> On 2024/6/4 23:18, Krzysztof Kozlowski wrote:
>>> On 04/06/2024 16:25, Lk Sii wrote:
>>>>
>>>>
>>>> On 2024/5/22 00:02, Krzysztof Kozlowski wrote:
>>>>> On 16/05/2024 15:31, Zijun Hu wrote:
>>>>>> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
>>>>>> serdev") will cause below regression issue:
>>>>>>
>>>>>> BT can't be enabled after below steps:
>>>>>> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
>>>>>> if property enable-gpios is not configured within DT|ACPI for QCA6390.
>>>>>>
>>>>>> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
>>>>>> by adding condition to avoid the serdev is flushed or wrote after closed
>>>>>> but also introduces this regression issue regarding above steps since the
>>>>>> VSC is not sent to reset controller during warm reboot.
>>>>>>
>>>>>> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
>>>>>> once BT was ever enabled, and the use-after-free issue is also fixed by
>>>>>> this change since the serdev is still opened before it is flushed or wrote.
>>>>>>
>>>>>> Verified by the reported machine Dell XPS 13 9310 laptop over below two
>>>>>> kernel commits:
>>>>>
>>>>> I don't understand how does it solve my question. I asked you: on which
>>>>> hardware did you, not the reporter, test?
>>>>> It seems Zijun did NOT perform any tests obviously.
>>>> All these tests were performed by reporter Wren with her machine
>>>> "Dell XPS 13 9310 laptop".
>>>
>>> Wren != Zijun.
>>>
>>>>
>>>> From previous discussion, it seems she have tested this change
>>>> several times with positive results over different trees with her
>>>> machine. i noticed she given you reply for your questions within
>>>> below v1 discussion link as following:
>>>>
>>>> Here are v1 discussion link.
>>>> https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31ebe@163.com/T/#m7371df555fd58ba215d0da63055134126a43c460
>>>>
>>>> Here are Krzysztof's questions.
>>>> "I asked already *two times*:
>>>> 1. On which kernel did you test it?
>>>> 2. On which hardware did you test it?"
>>>>
>>>> Here are Wren's reply for Krzysztof's questions
>>>> "I thought I had already chimed in with this information. I am using a
>>>> Dell XPS 13 9310. It's the only hardware I have access to. I can say
>>>> that the fix seems to work as advertised in that it fixes the warm boot
>>>> issue I have been experiencing."
>>>
>>> I asked Zijun, not Wren. I believe all this is tested or done by
>>> Qualcomm on some other kernel, so that's my question.
>>>
>> Zijun is the only guy from Qualcomm who ever joined our discussion,
>> he ever said he belongs to Bluetooth team, so let us suppose the term
>> "Qualcomm" you mentioned above is Zijun.
>>
>> from discussion history. in fact, ALL these tests were performed by
>> reporter Wren instead of Zijun, and there are also NOT Zijun's Tested-by
>> tag, so what you believe above is wrong in my opinion.
> 
> Patch author is supposed to test the code. Are you implying that
> Qualcomm Bluetooth team cannot test the patch on any of Qualcomm
> Bluetooth devices?
> 
i guess Zijun did not test the patch on himself based on below reasons:
1) the patch has been tested by reporter with report's machine.
2) perhaps, Zijun is confident about his patch based on his experience.
3) perhaps, it is difficult for Zijun to find a suitable machine to
perform tests, and test machines must have QCA6390 *embedded* and use
Bluez solution.

>>
>> Only Zijun and reporter were involved during those early debugging days,
>> Zijun shared changes for reporter to verify with reporter's machine,
>> then Zijun posted his fixes after debugging and verification were done.
>>
>>> That's important because Wren did not test particular scenarios, like
>>> PREEMPT_RT or RB5 hardware, but Zijun is claiming problems are solved.
>>> Maybe indeed solved, but if takes one month and still not answer which
>>> kernel you are using, then I am sure: this was nowhere tested by Zijun
>>> on the hardware and on the kernel the Qualcomm wants it to be.
>>>
>>>>
>>>>>> commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
>>>>>> implementation for QCA") of bluetooth-next tree.
>>>>>> commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
>>>>>> implementation for QCA") of linus mainline tree.
>>>>>
>>>>> ? Same commit with different hashes? No, it looks like you are working
>>>>> on some downstream tree with cherry picks.
>>>>>
>>>> From Zijun's commit message, for the same commit, it seems
>>>> bluetooth-next tree has different hashes as linus tree.
>>>> not sure if this scenario is normal during some time window.
>>>>> No, test it on mainline and answer finally, after *five* tries, which
>>>>> kernel and which hardware did you use for testing this.
>>>>>
>>>>>
>>>> it seems there are two issues mentioned with Zijun's commit message.
>>>> regression issue A:  BT enable failure after warm reboot.
>>>> issue B:  use-after-free issue, namely, kernel crash.
>>>>
>>>> @Krzysztof
>>>> which issue to test based on your concerns with mainline tree?
>>>
>>> No one tested this on non-laptop platform. Wren did not, which is fine.
>>> Qualcomm should, but since they avoid any talks about it for so long
>>> (plus pushy comments during review, re-spinning v1 suggesting entire
>>> discussion is gone), I do not trust their statements at all.
>>>
>>
>> For issue A:
>> reporter's tests are enough in my opinion.
>> Zijun ever said that "he known the root cause and this fix logic was
>> introduced from the very beginning when he saw reporter's issue
>> description" by below link:
>> https://lore.kernel.org/lkml/1d0878e0-d138-4de2-86b8-326ab9ebde3f@quicinc.com/
>>
>>> So really, did anything test it on any Qualcomm embedded platform?
>>> Anyone tested the actual race visible with PREEMPT_RT?
>>> For issue B, it was originally fixed and verified by you,
>> it is obvious for the root cause and current fix solution after
>> our discussion.
>>
>> luzi also ever tried to ask you if you have a chance to verify issue B
>> with your machine for this change.
> 
> I tried, but my setup is incomplete since ~half a year and will remain
> probably for another short time, depending on ongoing work on power
> sequencing. Therefore I cannot test whether anything improves or
> deteriorates regarding this patch.
> 
>>
>>> Why Zijun cannot provide answer on which kernel was it tested? Why the
>>> hardware cannot be mentioned?
>>>
>> i believe zijun never perform any tests for these two issues as
>> explained above.
> 
> yeah, and that was worrying me.
>
Only RB5 has QCA6390 *embedded* among DTS of mainline kernel, but we
can't have a RB5 to test.

Don't worry about due to below points:
1) Reporter have tested it with her machine
2) issue B and relevant fix is obvious after discussion.

> Best regards,
> Krzysztof
Lk Sii June 10, 2024, 1:17 p.m. UTC | #18
On 2024/6/6 20:54, Lk Sii wrote:
> 
> 
> On 2024/6/5 15:14, Krzysztof Kozlowski wrote:
>> On 05/06/2024 03:49, Lk Sii wrote:
>>>
>>>
>>> On 2024/6/4 23:18, Krzysztof Kozlowski wrote:
>>>> On 04/06/2024 16:25, Lk Sii wrote:
>>>>>
>>>>>
>>>>> On 2024/5/22 00:02, Krzysztof Kozlowski wrote:
>>>>>> On 16/05/2024 15:31, Zijun Hu wrote:
>>>>>>> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
>>>>>>> serdev") will cause below regression issue:
>>>>>>>
>>>>>>> BT can't be enabled after below steps:
>>>>>>> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
>>>>>>> if property enable-gpios is not configured within DT|ACPI for QCA6390.
>>>>>>>
>>>>>>> The commit is to fix a use-after-free issue within qca_serdev_shutdown()
>>>>>>> by adding condition to avoid the serdev is flushed or wrote after closed
>>>>>>> but also introduces this regression issue regarding above steps since the
>>>>>>> VSC is not sent to reset controller during warm reboot.
>>>>>>>
>>>>>>> Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
>>>>>>> once BT was ever enabled, and the use-after-free issue is also fixed by
>>>>>>> this change since the serdev is still opened before it is flushed or wrote.
>>>>>>>
>>>>>>> Verified by the reported machine Dell XPS 13 9310 laptop over below two
>>>>>>> kernel commits:
>>>>>>
>>>>>> I don't understand how does it solve my question. I asked you: on which
>>>>>> hardware did you, not the reporter, test?
>>>>>> It seems Zijun did NOT perform any tests obviously.
>>>>> All these tests were performed by reporter Wren with her machine
>>>>> "Dell XPS 13 9310 laptop".
>>>>
>>>> Wren != Zijun.
>>>>
>>>>>
>>>>> From previous discussion, it seems she have tested this change
>>>>> several times with positive results over different trees with her
>>>>> machine. i noticed she given you reply for your questions within
>>>>> below v1 discussion link as following:
>>>>>
>>>>> Here are v1 discussion link.
>>>>> https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31ebe@163.com/T/#m7371df555fd58ba215d0da63055134126a43c460
>>>>>
>>>>> Here are Krzysztof's questions.
>>>>> "I asked already *two times*:
>>>>> 1. On which kernel did you test it?
>>>>> 2. On which hardware did you test it?"
>>>>>
>>>>> Here are Wren's reply for Krzysztof's questions
>>>>> "I thought I had already chimed in with this information. I am using a
>>>>> Dell XPS 13 9310. It's the only hardware I have access to. I can say
>>>>> that the fix seems to work as advertised in that it fixes the warm boot
>>>>> issue I have been experiencing."
>>>>
>>>> I asked Zijun, not Wren. I believe all this is tested or done by
>>>> Qualcomm on some other kernel, so that's my question.
>>>>
>>> Zijun is the only guy from Qualcomm who ever joined our discussion,
>>> he ever said he belongs to Bluetooth team, so let us suppose the term
>>> "Qualcomm" you mentioned above is Zijun.
>>>
>>> from discussion history. in fact, ALL these tests were performed by
>>> reporter Wren instead of Zijun, and there are also NOT Zijun's Tested-by
>>> tag, so what you believe above is wrong in my opinion.
>>
>> Patch author is supposed to test the code. Are you implying that
>> Qualcomm Bluetooth team cannot test the patch on any of Qualcomm
>> Bluetooth devices?
>>
> i guess Zijun did not test the patch on himself based on below reasons:
> 1) the patch has been tested by reporter with report's machine.
> 2) perhaps, Zijun is confident about his patch based on his experience.
> 3) perhaps, it is difficult for Zijun to find a suitable machine to
> perform tests, and test machines must have QCA6390 *embedded* and use
> Bluez solution.
> 
>>>
>>> Only Zijun and reporter were involved during those early debugging days,
>>> Zijun shared changes for reporter to verify with reporter's machine,
>>> then Zijun posted his fixes after debugging and verification were done.
>>>
>>>> That's important because Wren did not test particular scenarios, like
>>>> PREEMPT_RT or RB5 hardware, but Zijun is claiming problems are solved.
>>>> Maybe indeed solved, but if takes one month and still not answer which
>>>> kernel you are using, then I am sure: this was nowhere tested by Zijun
>>>> on the hardware and on the kernel the Qualcomm wants it to be.
>>>>
>>>>>
>>>>>>> commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
>>>>>>> implementation for QCA") of bluetooth-next tree.
>>>>>>> commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
>>>>>>> implementation for QCA") of linus mainline tree.
>>>>>>
>>>>>> ? Same commit with different hashes? No, it looks like you are working
>>>>>> on some downstream tree with cherry picks.
>>>>>>
>>>>> From Zijun's commit message, for the same commit, it seems
>>>>> bluetooth-next tree has different hashes as linus tree.
>>>>> not sure if this scenario is normal during some time window.
>>>>>> No, test it on mainline and answer finally, after *five* tries, which
>>>>>> kernel and which hardware did you use for testing this.
>>>>>>
>>>>>>
>>>>> it seems there are two issues mentioned with Zijun's commit message.
>>>>> regression issue A:  BT enable failure after warm reboot.
>>>>> issue B:  use-after-free issue, namely, kernel crash.
>>>>>
>>>>> @Krzysztof
>>>>> which issue to test based on your concerns with mainline tree?
>>>>
>>>> No one tested this on non-laptop platform. Wren did not, which is fine.
>>>> Qualcomm should, but since they avoid any talks about it for so long
>>>> (plus pushy comments during review, re-spinning v1 suggesting entire
>>>> discussion is gone), I do not trust their statements at all.
>>>>
>>>
>>> For issue A:
>>> reporter's tests are enough in my opinion.
>>> Zijun ever said that "he known the root cause and this fix logic was
>>> introduced from the very beginning when he saw reporter's issue
>>> description" by below link:
>>> https://lore.kernel.org/lkml/1d0878e0-d138-4de2-86b8-326ab9ebde3f@quicinc.com/
>>>
>>>> So really, did anything test it on any Qualcomm embedded platform?
>>>> Anyone tested the actual race visible with PREEMPT_RT?
>>>> For issue B, it was originally fixed and verified by you,
>>> it is obvious for the root cause and current fix solution after
>>> our discussion.
>>>
>>> luzi also ever tried to ask you if you have a chance to verify issue B
>>> with your machine for this change.
>>
>> I tried, but my setup is incomplete since ~half a year and will remain
>> probably for another short time, depending on ongoing work on power
>> sequencing. Therefore I cannot test whether anything improves or
>> deteriorates regarding this patch.
>>
>>>
>>>> Why Zijun cannot provide answer on which kernel was it tested? Why the
>>>> hardware cannot be mentioned?
>>>>
>>> i believe zijun never perform any tests for these two issues as
>>> explained above.
>>
>> yeah, and that was worrying me.
>>
> Only RB5 has QCA6390 *embedded* among DTS of mainline kernel, but we
> can't have a RB5 to test.
> 
> Don't worry about due to below points:
> 1) Reporter have tested it with her machine
> 2) issue B and relevant fix is obvious after discussion.
> 
I believe we have had too much discussion for this simple change.
@Krzysztof
do you have any other concerns?
thanks
>> Best regards,
>> Krzysztof
>
Krzysztof Kozlowski June 10, 2024, 1:24 p.m. UTC | #19
On 10/06/2024 15:17, Lk Sii wrote:
>>>>> Why Zijun cannot provide answer on which kernel was it tested? Why the
>>>>> hardware cannot be mentioned?
>>>>>
>>>> i believe zijun never perform any tests for these two issues as
>>>> explained above.
>>>
>>> yeah, and that was worrying me.
>>>
>> Only RB5 has QCA6390 *embedded* among DTS of mainline kernel, but we
>> can't have a RB5 to test.
>>
>> Don't worry about due to below points:
>> 1) Reporter have tested it with her machine
>> 2) issue B and relevant fix is obvious after discussion.
>>
> I believe we have had too much discussion for this simple change.
> @Krzysztof
> do you have any other concerns?

No, nothing from me.

Best regards,
Krzysztof
Luiz Augusto von Dentz June 10, 2024, 2:02 p.m. UTC | #20
Hi Krzysztof,

On Mon, Jun 10, 2024 at 9:24 AM Krzysztof Kozlowski
<krzysztof.kozlowski@linaro.org> wrote:
>
> On 10/06/2024 15:17, Lk Sii wrote:
> >>>>> Why Zijun cannot provide answer on which kernel was it tested? Why the
> >>>>> hardware cannot be mentioned?
> >>>>>
> >>>> i believe zijun never perform any tests for these two issues as
> >>>> explained above.
> >>>
> >>> yeah, and that was worrying me.
> >>>
> >> Only RB5 has QCA6390 *embedded* among DTS of mainline kernel, but we
> >> can't have a RB5 to test.
> >>
> >> Don't worry about due to below points:
> >> 1) Reporter have tested it with her machine
> >> 2) issue B and relevant fix is obvious after discussion.
> >>
> > I believe we have had too much discussion for this simple change.
> > @Krzysztof
> > do you have any other concerns?
>
> No, nothing from me.

Ok, but I guess since you didn't sign off that means you are still
unconvinced that this should be applied? I could try pushing it to
bluetooth-next to check if it blows up on the next merge window, but
it is not that nice to have things completely untested, as far
upstream goes, being pushed that way.
Krzysztof Kozlowski June 10, 2024, 2:19 p.m. UTC | #21
On 10/06/2024 16:02, Luiz Augusto von Dentz wrote:
> Hi Krzysztof,
> 
> On Mon, Jun 10, 2024 at 9:24 AM Krzysztof Kozlowski
> <krzysztof.kozlowski@linaro.org> wrote:
>>
>> On 10/06/2024 15:17, Lk Sii wrote:
>>>>>>> Why Zijun cannot provide answer on which kernel was it tested? Why the
>>>>>>> hardware cannot be mentioned?
>>>>>>>
>>>>>> i believe zijun never perform any tests for these two issues as
>>>>>> explained above.
>>>>>
>>>>> yeah, and that was worrying me.
>>>>>
>>>> Only RB5 has QCA6390 *embedded* among DTS of mainline kernel, but we
>>>> can't have a RB5 to test.
>>>>
>>>> Don't worry about due to below points:
>>>> 1) Reporter have tested it with her machine
>>>> 2) issue B and relevant fix is obvious after discussion.
>>>>
>>> I believe we have had too much discussion for this simple change.
>>> @Krzysztof
>>> do you have any other concerns?
>>
>> No, nothing from me.
> 
> Ok, but I guess since you didn't sign off that means you are still
> unconvinced that this should be applied? I could try pushing it to
> bluetooth-next to check if it blows up on the next merge window, but
> it is not that nice to have things completely untested, as far
> upstream goes, being pushed that way.

The patch is fixing at least one case, so at least one group of users
would be happy with it.

Best regards,
Krzysztof
patchwork-bot+bluetooth@kernel.org June 10, 2024, 3:50 p.m. UTC | #22
Hello:

This patch was applied to bluetooth/bluetooth-next.git (master)
by Luiz Augusto von Dentz <luiz.von.dentz@intel.com>:

On Thu, 16 May 2024 21:31:34 +0800 you wrote:
> Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
> serdev") will cause below regression issue:
> 
> BT can't be enabled after below steps:
> cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
> if property enable-gpios is not configured within DT|ACPI for QCA6390.
> 
> [...]

Here is the summary with links:
  - [v2] Bluetooth: qca: Fix BT enable failure again for QCA6390 after warm reboot
    https://git.kernel.org/bluetooth/bluetooth-next/c/d2118673f3ae

You are awesome, thank you!
diff mbox series

Patch

diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
index 0c9c9ee56592..9a0bc86f9aac 100644
--- a/drivers/bluetooth/hci_qca.c
+++ b/drivers/bluetooth/hci_qca.c
@@ -2450,15 +2450,27 @@  static void qca_serdev_shutdown(struct device *dev)
 	struct qca_serdev *qcadev = serdev_device_get_drvdata(serdev);
 	struct hci_uart *hu = &qcadev->serdev_hu;
 	struct hci_dev *hdev = hu->hdev;
-	struct qca_data *qca = hu->priv;
 	const u8 ibs_wake_cmd[] = { 0xFD };
 	const u8 edl_reset_soc_cmd[] = { 0x01, 0x00, 0xFC, 0x01, 0x05 };
 
 	if (qcadev->btsoc_type == QCA_QCA6390) {
-		if (test_bit(QCA_BT_OFF, &qca->flags) ||
-		    !test_bit(HCI_RUNNING, &hdev->flags))
+		/* The purpose of sending the VSC is to reset SOC into a initial
+		 * state and the state will ensure next hdev->setup() success.
+		 * if HCI_QUIRK_NON_PERSISTENT_SETUP is set, it means that
+		 * hdev->setup() can do its job regardless of SoC state, so
+		 * don't need to send the VSC.
+		 * if HCI_SETUP is set, it means that hdev->setup() was never
+		 * invoked and the SOC is already in the initial state, so
+		 * don't also need to send the VSC.
+		 */
+		if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks) ||
+		    hci_dev_test_flag(hdev, HCI_SETUP))
 			return;
 
+		/* The serdev must be in open state when conrol logic arrives
+		 * here, so also fix the use-after-free issue caused by that
+		 * the serdev is flushed or wrote after it is closed.
+		 */
 		serdev_device_write_flush(serdev);
 		ret = serdev_device_write_buf(serdev, ibs_wake_cmd,
 					      sizeof(ibs_wake_cmd));