Message ID | 20250430111617.1151390-1-quic_cxin@quicinc.com |
---|---|
State | New |
Headers | show |
Series | [v1] tty: serdev: serdev-ttyport: Fix use-after-free in ttyport_close() due to uninitialized serport->tty | expand |
On 4/30/2025 7:40 PM, Greg Kroah-Hartman wrote: > On Wed, Apr 30, 2025 at 07:16:17PM +0800, Xin Chen wrote: >> When ttyport_open() fails to initialize a tty device, serport->tty is not >> --- a/drivers/tty/serdev/serdev-ttyport.c >> +++ b/drivers/tty/serdev/serdev-ttyport.c >> @@ -88,6 +88,10 @@ static void ttyport_write_flush(struct serdev_controller *ctrl) >> { >> struct serport *serport = serdev_controller_get_drvdata(ctrl); >> struct tty_struct *tty = serport->tty; >> + if (!tty) { >> + dev_err(&ctrl->dev, "tty is null\n"); >> + return; >> + } > > What prevents tty from going NULL right after you just checked this? First sorry for reply so late for I have a long statutory holidays. Maybe I don't get your point. From my side, there is nothing to prevent it. Check here is to avoid code go on if tty is NULL. > > And why print out that message, what can userspace do with it? > I add the print just ref to code in other place. This can't be used by userspace, but it can be used in DMesg log when system crashes. >> >> tty_driver_flush_buffer(tty); >> } >> @@ -108,8 +112,10 @@ static int ttyport_open(struct serdev_controller *ctrl) >> int ret; >> >> tty = tty_init_dev(serport->tty_drv, serport->tty_idx); >> - if (IS_ERR(tty)) >> + if (IS_ERR(tty)) { >> + serport->tty = NULL; >> return PTR_ERR(tty); >> + } >> serport->tty = tty; >> >> if (!tty->ops->open || !tty->ops->close) { >> @@ -156,6 +162,11 @@ static void ttyport_close(struct serdev_controller *ctrl) >> >> clear_bit(SERPORT_ACTIVE, &serport->flags); >> >> + if (!tty) { >> + dev_err(&ctrl->dev, "tty is null\n"); >> + return; >> + } > > Again, what prevents it from changing right after you just checked it? Same with above, there is nothing prevent it. Check here is to avoid code go on if tty is NULL. The check is for changes in ttyport_open(). In my project, it's possible that ttyport_close() and ttyport_write_flush() get called after ttyport_open() failed, at which time tty is invalid. thanks, Xin
On Thu, May 08, 2025 at 05:29:18PM +0800, Xin Chen wrote: > > On 4/30/2025 7:40 PM, Greg Kroah-Hartman wrote: > > On Wed, Apr 30, 2025 at 07:16:17PM +0800, Xin Chen wrote: > >> When ttyport_open() fails to initialize a tty device, serport->tty is not > >> --- a/drivers/tty/serdev/serdev-ttyport.c > >> +++ b/drivers/tty/serdev/serdev-ttyport.c > >> @@ -88,6 +88,10 @@ static void ttyport_write_flush(struct serdev_controller *ctrl) > >> { > >> struct serport *serport = serdev_controller_get_drvdata(ctrl); > >> struct tty_struct *tty = serport->tty; > >> + if (!tty) { > >> + dev_err(&ctrl->dev, "tty is null\n"); > >> + return; > >> + } > > > > What prevents tty from going NULL right after you just checked this? > > First sorry for reply so late for I have a long statutory holidays. > Maybe I don't get your point. From my side, there is nothing to prevent it. > Check here is to avoid code go on if tty is NULL. Yes, but the problem is, serport->tty could change to be NULL right after you check it, so you have not removed the real race that can happen here. There is no lock, so by adding this check you are only reducing the risk of the problem happening, not actually fixing the issue so that it will never happen. Please fix it so that this can never happen. thanks, greg k-h
On 5/14/2025 5:14 PM, Xin Chen wrote: > > > On 5/8/2025 5:41 PM, Greg Kroah-Hartman wrote: >> On Thu, May 08, 2025 at 05:29:18PM +0800, Xin Chen wrote: >>> >>> On 4/30/2025 7:40 PM, Greg Kroah-Hartman wrote: >>>> On Wed, Apr 30, 2025 at 07:16:17PM +0800, Xin Chen wrote: >>>>> When ttyport_open() fails to initialize a tty device, serport->tty is not >>>>> --- a/drivers/tty/serdev/serdev-ttyport.c >>>>> +++ b/drivers/tty/serdev/serdev-ttyport.c >>>>> @@ -88,6 +88,10 @@ static void ttyport_write_flush(struct serdev_controller *ctrl) >>>>> { >>>>> struct serport *serport = serdev_controller_get_drvdata(ctrl); >>>>> struct tty_struct *tty = serport->tty; >>>>> + if (!tty) { >>>>> + dev_err(&ctrl->dev, "tty is null\n"); >>>>> + return; >>>>> + } >>>> >>>> What prevents tty from going NULL right after you just checked this? >>> >>> First sorry for reply so late for I have a long statutory holidays. >>> Maybe I don't get your point. From my side, there is nothing to prevent it. >>> Check here is to avoid code go on if tty is NULL. >> >> Yes, but the problem is, serport->tty could change to be NULL right >> after you check it, so you have not removed the real race that can >> happen here. There is no lock, so by adding this check you are only >> reducing the risk of the problem happening, not actually fixing the >> issue so that it will never happen. >> >> Please fix it so that this can never happen. >> > > Actually I have never thought the race condition issue since the crash I met is > not caused by race condition. It's caused due to Bluetooth driver call > ttyport_close() after ttyport_open() failed. This two action happen one after > another in one thread and it seems impossible to have race condition. And with > my fix the crash doesn't happen again in several test of same case. > > Let me introduce the complete process for you: > 1) hci_dev_open_sync()-> > hci_dev_init_sync()->hci_dev_setup_sync()->hdev->setup()(hci_uart_setup)->qca_setup(), > here in qca_setup(), qca_read_soc_version() fails and goto out, then calls > serdev_device_close() to close tty normally. And then call serdev_device_open() > to retry. > 2) serdev_device_open() fails due to tty_init_dev() fails, then tty gets > released, which means this time the tty has been freed succesfully. > 3) Return back to upper func hci_dev_open_sync(), > hdev->close()(hci_uart_close) is called. And hci_uart_close calls > hci_uart_flush() and serdev_device_close(). serdev_device_close() tries to close > tty again, it's calltrace is serdev_device_close()->ttyport_close()->tty_lock(), > tty_unlock(), tty_release_struct(). The four funcs hci_uart_flush(), tty_lock(), > tty_unlock(), tty_release_struct() read tty pointer's value, which is invalid > and causes crash. > Hi Greg, could you please take some time to review my reply? Thanks very much!
On Fri, May 23, 2025 at 10:52:27AM +0800, Xin Chen wrote: > > > On 5/14/2025 5:14 PM, Xin Chen wrote: > > > > > > On 5/8/2025 5:41 PM, Greg Kroah-Hartman wrote: > >> On Thu, May 08, 2025 at 05:29:18PM +0800, Xin Chen wrote: > >>> > >>> On 4/30/2025 7:40 PM, Greg Kroah-Hartman wrote: > >>>> On Wed, Apr 30, 2025 at 07:16:17PM +0800, Xin Chen wrote: > >>>>> When ttyport_open() fails to initialize a tty device, serport->tty is not > >>>>> --- a/drivers/tty/serdev/serdev-ttyport.c > >>>>> +++ b/drivers/tty/serdev/serdev-ttyport.c > >>>>> @@ -88,6 +88,10 @@ static void ttyport_write_flush(struct serdev_controller *ctrl) > >>>>> { > >>>>> struct serport *serport = serdev_controller_get_drvdata(ctrl); > >>>>> struct tty_struct *tty = serport->tty; > >>>>> + if (!tty) { > >>>>> + dev_err(&ctrl->dev, "tty is null\n"); > >>>>> + return; > >>>>> + } > >>>> > >>>> What prevents tty from going NULL right after you just checked this? > >>> > >>> First sorry for reply so late for I have a long statutory holidays. > >>> Maybe I don't get your point. From my side, there is nothing to prevent it. > >>> Check here is to avoid code go on if tty is NULL. > >> > >> Yes, but the problem is, serport->tty could change to be NULL right > >> after you check it, so you have not removed the real race that can > >> happen here. There is no lock, so by adding this check you are only > >> reducing the risk of the problem happening, not actually fixing the > >> issue so that it will never happen. > >> > >> Please fix it so that this can never happen. > >> > > > > Actually I have never thought the race condition issue since the crash I met is > > not caused by race condition. It's caused due to Bluetooth driver call > > ttyport_close() after ttyport_open() failed. This two action happen one after > > another in one thread and it seems impossible to have race condition. And with > > my fix the crash doesn't happen again in several test of same case. > > > > Let me introduce the complete process for you: > > 1) hci_dev_open_sync()-> > > hci_dev_init_sync()->hci_dev_setup_sync()->hdev->setup()(hci_uart_setup)->qca_setup(), > > here in qca_setup(), qca_read_soc_version() fails and goto out, then calls > > serdev_device_close() to close tty normally. And then call serdev_device_open() > > to retry. > > 2) serdev_device_open() fails due to tty_init_dev() fails, then tty gets > > released, which means this time the tty has been freed succesfully. > > 3) Return back to upper func hci_dev_open_sync(), > > hdev->close()(hci_uart_close) is called. And hci_uart_close calls > > hci_uart_flush() and serdev_device_close(). serdev_device_close() tries to close > > tty again, it's calltrace is serdev_device_close()->ttyport_close()->tty_lock(), > > tty_unlock(), tty_release_struct(). The four funcs hci_uart_flush(), tty_lock(), > > tty_unlock(), tty_release_struct() read tty pointer's value, which is invalid > > and causes crash. > > > > Hi Greg, could you please take some time to review my reply? I am not disputing the fact that there is a bug here, I'm just saying that you can't test for a value and then act on it without a lock protecting that action because the value can be changed right after you test for it. You might not see this in your testing, as you have narrowed the window that the value can change, but you have not solved the issue properly, right? thanks, greg k-h
On Thu, May 29, 2025 at 11:07:25AM +0200, Greg Kroah-Hartman wrote: > On Fri, May 23, 2025 at 10:52:27AM +0800, Xin Chen wrote: > > > > > > On 5/14/2025 5:14 PM, Xin Chen wrote: > > > > > > > > > On 5/8/2025 5:41 PM, Greg Kroah-Hartman wrote: > > >> On Thu, May 08, 2025 at 05:29:18PM +0800, Xin Chen wrote: > > >>> > > >>> On 4/30/2025 7:40 PM, Greg Kroah-Hartman wrote: > > >>>> On Wed, Apr 30, 2025 at 07:16:17PM +0800, Xin Chen wrote: > > >>>>> When ttyport_open() fails to initialize a tty device, serport->tty is not > > >>>>> --- a/drivers/tty/serdev/serdev-ttyport.c > > >>>>> +++ b/drivers/tty/serdev/serdev-ttyport.c > > >>>>> @@ -88,6 +88,10 @@ static void ttyport_write_flush(struct serdev_controller *ctrl) > > >>>>> { > > >>>>> struct serport *serport = serdev_controller_get_drvdata(ctrl); > > >>>>> struct tty_struct *tty = serport->tty; > > >>>>> + if (!tty) { > > >>>>> + dev_err(&ctrl->dev, "tty is null\n"); > > >>>>> + return; > > >>>>> + } > > >>>> > > >>>> What prevents tty from going NULL right after you just checked this? > > >>> > > >>> First sorry for reply so late for I have a long statutory holidays. > > >>> Maybe I don't get your point. From my side, there is nothing to prevent it. > > >>> Check here is to avoid code go on if tty is NULL. > > >> > > >> Yes, but the problem is, serport->tty could change to be NULL right > > >> after you check it, so you have not removed the real race that can > > >> happen here. There is no lock, so by adding this check you are only > > >> reducing the risk of the problem happening, not actually fixing the > > >> issue so that it will never happen. > > >> > > >> Please fix it so that this can never happen. > > >> > > > > > > Actually I have never thought the race condition issue since the crash I met is > > > not caused by race condition. It's caused due to Bluetooth driver call > > > ttyport_close() after ttyport_open() failed. This two action happen one after > > > another in one thread and it seems impossible to have race condition. And with > > > my fix the crash doesn't happen again in several test of same case. > > > > > > Let me introduce the complete process for you: > > > 1) hci_dev_open_sync()-> > > > hci_dev_init_sync()->hci_dev_setup_sync()->hdev->setup()(hci_uart_setup)->qca_setup(), > > > here in qca_setup(), qca_read_soc_version() fails and goto out, then calls > > > serdev_device_close() to close tty normally. And then call serdev_device_open() > > > to retry. Wait, what? Why is qca_read_soc_version() failing? Why are you retrying multiple times until either you run out of attempts? Why are you closing the port and then opening it again right away? What close/open pair seems totally unnecessary, why do that at all? If I read that function qca_setup(), it can NEVER detect if a failure really happened (i.e. if it does run out of retries, you just plow on and keep going and keep on registering things and THEN return an error for some reason. In other words, the error handling in qca_setup() is very suspect, why not fix all of that up first? thanks, greg k-h
On 5/29/2025 5:07 PM, Greg Kroah-Hartman wrote: > On Fri, May 23, 2025 at 10:52:27AM +0800, Xin Chen wrote: >> >> >> On 5/14/2025 5:14 PM, Xin Chen wrote: >>> >>> >>> On 5/8/2025 5:41 PM, Greg Kroah-Hartman wrote: >>>> On Thu, May 08, 2025 at 05:29:18PM +0800, Xin Chen wrote: >>>>> >>>>> On 4/30/2025 7:40 PM, Greg Kroah-Hartman wrote: >>>>>> On Wed, Apr 30, 2025 at 07:16:17PM +0800, Xin Chen wrote: >>>>>>> When ttyport_open() fails to initialize a tty device, serport->tty is not >>>>>>> --- a/drivers/tty/serdev/serdev-ttyport.c >>>>>>> +++ b/drivers/tty/serdev/serdev-ttyport.c >>>>>>> @@ -88,6 +88,10 @@ static void ttyport_write_flush(struct serdev_controller *ctrl) >>>>>>> { >>>>>>> struct serport *serport = serdev_controller_get_drvdata(ctrl); >>>>>>> struct tty_struct *tty = serport->tty; >>>>>>> + if (!tty) { >>>>>>> + dev_err(&ctrl->dev, "tty is null\n"); >>>>>>> + return; >>>>>>> + } >>>>>> >>>>>> What prevents tty from going NULL right after you just checked this? >>>>> >>>>> First sorry for reply so late for I have a long statutory holidays. >>>>> Maybe I don't get your point. From my side, there is nothing to prevent it. >>>>> Check here is to avoid code go on if tty is NULL. >>>> >>>> Yes, but the problem is, serport->tty could change to be NULL right >>>> after you check it, so you have not removed the real race that can >>>> happen here. There is no lock, so by adding this check you are only >>>> reducing the risk of the problem happening, not actually fixing the >>>> issue so that it will never happen. >>>> >>>> Please fix it so that this can never happen. >>>> >>> >>> Actually I have never thought the race condition issue since the crash I met is >>> not caused by race condition. It's caused due to Bluetooth driver call >>> ttyport_close() after ttyport_open() failed. This two action happen one after >>> another in one thread and it seems impossible to have race condition. And with >>> my fix the crash doesn't happen again in several test of same case. >>> >>> Let me introduce the complete process for you: >>> 1) hci_dev_open_sync()-> >>> hci_dev_init_sync()->hci_dev_setup_sync()->hdev->setup()(hci_uart_setup)->qca_setup(), >>> here in qca_setup(), qca_read_soc_version() fails and goto out, then calls >>> serdev_device_close() to close tty normally. And then call serdev_device_open() >>> to retry. >>> 2) serdev_device_open() fails due to tty_init_dev() fails, then tty gets >>> released, which means this time the tty has been freed succesfully. >>> 3) Return back to upper func hci_dev_open_sync(), >>> hdev->close()(hci_uart_close) is called. And hci_uart_close calls >>> hci_uart_flush() and serdev_device_close(). serdev_device_close() tries to close >>> tty again, it's calltrace is serdev_device_close()->ttyport_close()->tty_lock(), >>> tty_unlock(), tty_release_struct(). The four funcs hci_uart_flush(), tty_lock(), >>> tty_unlock(), tty_release_struct() read tty pointer's value, which is invalid >>> and causes crash. >>> >> >> Hi Greg, could you please take some time to review my reply? > > I am not disputing the fact that there is a bug here, I'm just saying > that you can't test for a value and then act on it without a lock > protecting that action because the value can be changed right after you > test for it. > > You might not see this in your testing, as you have narrowed the window > that the value can change, but you have not solved the issue properly, > right? > > thanks, > > greg k-h
On 5/29/2025 5:41 PM, Greg Kroah-Hartman wrote: > On Thu, May 29, 2025 at 11:07:25AM +0200, Greg Kroah-Hartman wrote: >> On Fri, May 23, 2025 at 10:52:27AM +0800, Xin Chen wrote: >>> >>> >>> On 5/14/2025 5:14 PM, Xin Chen wrote: >>>> >>>> >>>> On 5/8/2025 5:41 PM, Greg Kroah-Hartman wrote: >>>>> On Thu, May 08, 2025 at 05:29:18PM +0800, Xin Chen wrote: >>>>>> >>>>>> On 4/30/2025 7:40 PM, Greg Kroah-Hartman wrote: >>>>>>> On Wed, Apr 30, 2025 at 07:16:17PM +0800, Xin Chen wrote: >>>>>>>> When ttyport_open() fails to initialize a tty device, serport->tty is not >>>>>>>> --- a/drivers/tty/serdev/serdev-ttyport.c >>>>>>>> +++ b/drivers/tty/serdev/serdev-ttyport.c >>>>>>>> @@ -88,6 +88,10 @@ static void ttyport_write_flush(struct serdev_controller *ctrl) >>>>>>>> { >>>>>>>> struct serport *serport = serdev_controller_get_drvdata(ctrl); >>>>>>>> struct tty_struct *tty = serport->tty; >>>>>>>> + if (!tty) { >>>>>>>> + dev_err(&ctrl->dev, "tty is null\n"); >>>>>>>> + return; >>>>>>>> + } >>>>>>> >>>>>>> What prevents tty from going NULL right after you just checked this? >>>>>> >>>>>> First sorry for reply so late for I have a long statutory holidays. >>>>>> Maybe I don't get your point. From my side, there is nothing to prevent it. >>>>>> Check here is to avoid code go on if tty is NULL. >>>>> >>>>> Yes, but the problem is, serport->tty could change to be NULL right >>>>> after you check it, so you have not removed the real race that can >>>>> happen here. There is no lock, so by adding this check you are only >>>>> reducing the risk of the problem happening, not actually fixing the >>>>> issue so that it will never happen. >>>>> >>>>> Please fix it so that this can never happen. >>>>> >>>> >>>> Actually I have never thought the race condition issue since the crash I met is >>>> not caused by race condition. It's caused due to Bluetooth driver call >>>> ttyport_close() after ttyport_open() failed. This two action happen one after >>>> another in one thread and it seems impossible to have race condition. And with >>>> my fix the crash doesn't happen again in several test of same case. >>>> >>>> Let me introduce the complete process for you: >>>> 1) hci_dev_open_sync()-> >>>> hci_dev_init_sync()->hci_dev_setup_sync()->hdev->setup()(hci_uart_setup)->qca_setup(), >>>> here in qca_setup(), qca_read_soc_version() fails and goto out, then calls >>>> serdev_device_close() to close tty normally. And then call serdev_device_open() >>>> to retry. > > Wait, what? Why is qca_read_soc_version() failing? Actually I have not root cause why qca_read_soc_version() fails of __hci_cmd_sync_ev(). It may be relative to FW issue. > Why are you retrying multiple times until either you run out of attempts? This is a retry mechanism. I find the reason in the change commit message "Currently driver only retries to download FW if FW downloading is failed. Sometimes observed command timeout for version request command, if this happen on some platforms during boot time, then a reboot is needed to turn ON BT. Instead to avoid a reboot, now extended retry logic for version request command too." > Why are you closing the port and then opening it again right away? This is a retry mechanism as above said. Do you mean there should be a gap between close and open? The change owner maybe don't think about this issue. > What close/open pair seems totally unnecessary, why do that at all? > > If I read that function qca_setup(), it can NEVER detect if a failure > really happened (i.e. if it does run out of retries, you just plow on > and keep going and keep on registering things and THEN return an error > for some reason. > > In other words, the error handling in qca_setup() is very suspect, why > not fix all of that up first? > qca_read_soc_version() in qca_setup() can detect whether the hci_dev is set up successfully. If if fails then a failure happens. You mean I should fix why qca_read_soc_version() fails? Thanks, Xin
On Fri, May 30, 2025 at 04:11:20PM +0800, Xin Chen wrote: > > > On 5/29/2025 5:07 PM, Greg Kroah-Hartman wrote: > > On Fri, May 23, 2025 at 10:52:27AM +0800, Xin Chen wrote: > >> > >> > >> On 5/14/2025 5:14 PM, Xin Chen wrote: > >>> > >>> > >>> On 5/8/2025 5:41 PM, Greg Kroah-Hartman wrote: > >>>> On Thu, May 08, 2025 at 05:29:18PM +0800, Xin Chen wrote: > >>>>> > >>>>> On 4/30/2025 7:40 PM, Greg Kroah-Hartman wrote: > >>>>>> On Wed, Apr 30, 2025 at 07:16:17PM +0800, Xin Chen wrote: > >>>>>>> When ttyport_open() fails to initialize a tty device, serport->tty is not > >>>>>>> --- a/drivers/tty/serdev/serdev-ttyport.c > >>>>>>> +++ b/drivers/tty/serdev/serdev-ttyport.c > >>>>>>> @@ -88,6 +88,10 @@ static void ttyport_write_flush(struct serdev_controller *ctrl) > >>>>>>> { > >>>>>>> struct serport *serport = serdev_controller_get_drvdata(ctrl); > >>>>>>> struct tty_struct *tty = serport->tty; > >>>>>>> + if (!tty) { > >>>>>>> + dev_err(&ctrl->dev, "tty is null\n"); > >>>>>>> + return; > >>>>>>> + } > >>>>>> > >>>>>> What prevents tty from going NULL right after you just checked this? > >>>>> > >>>>> First sorry for reply so late for I have a long statutory holidays. > >>>>> Maybe I don't get your point. From my side, there is nothing to prevent it. > >>>>> Check here is to avoid code go on if tty is NULL. > >>>> > >>>> Yes, but the problem is, serport->tty could change to be NULL right > >>>> after you check it, so you have not removed the real race that can > >>>> happen here. There is no lock, so by adding this check you are only > >>>> reducing the risk of the problem happening, not actually fixing the > >>>> issue so that it will never happen. > >>>> > >>>> Please fix it so that this can never happen. > >>>> > >>> > >>> Actually I have never thought the race condition issue since the crash I met is > >>> not caused by race condition. It's caused due to Bluetooth driver call > >>> ttyport_close() after ttyport_open() failed. This two action happen one after > >>> another in one thread and it seems impossible to have race condition. And with > >>> my fix the crash doesn't happen again in several test of same case. > >>> > >>> Let me introduce the complete process for you: > >>> 1) hci_dev_open_sync()-> > >>> hci_dev_init_sync()->hci_dev_setup_sync()->hdev->setup()(hci_uart_setup)->qca_setup(), > >>> here in qca_setup(), qca_read_soc_version() fails and goto out, then calls > >>> serdev_device_close() to close tty normally. And then call serdev_device_open() > >>> to retry. > >>> 2) serdev_device_open() fails due to tty_init_dev() fails, then tty gets > >>> released, which means this time the tty has been freed succesfully. > >>> 3) Return back to upper func hci_dev_open_sync(), > >>> hdev->close()(hci_uart_close) is called. And hci_uart_close calls > >>> hci_uart_flush() and serdev_device_close(). serdev_device_close() tries to close > >>> tty again, it's calltrace is serdev_device_close()->ttyport_close()->tty_lock(), > >>> tty_unlock(), tty_release_struct(). The four funcs hci_uart_flush(), tty_lock(), > >>> tty_unlock(), tty_release_struct() read tty pointer's value, which is invalid > >>> and causes crash. > >>> > >> > >> Hi Greg, could you please take some time to review my reply? > > > > I am not disputing the fact that there is a bug here, I'm just saying > > that you can't test for a value and then act on it without a lock > > protecting that action because the value can be changed right after you > > test for it. > > > > You might not see this in your testing, as you have narrowed the window > > that the value can change, but you have not solved the issue properly, > > right? > > > > thanks, > > > > greg k-h > > >From my analysis, I think there is only one thread operating the tty of > Bluetooth. So the case of tty changed after check will not happen. This might be the case for your specific system, but how can you guarantee this is the case for all serdev users? Hint, the way you _CAN_ guarantee it is to use a lock... {sigh} greg k-h
On Fri, May 30, 2025 at 04:34:49PM +0800, Xin Chen wrote: > > > On 5/29/2025 5:41 PM, Greg Kroah-Hartman wrote: > > On Thu, May 29, 2025 at 11:07:25AM +0200, Greg Kroah-Hartman wrote: > >> On Fri, May 23, 2025 at 10:52:27AM +0800, Xin Chen wrote: > >>> > >>> > >>> On 5/14/2025 5:14 PM, Xin Chen wrote: > >>>> > >>>> > >>>> On 5/8/2025 5:41 PM, Greg Kroah-Hartman wrote: > >>>>> On Thu, May 08, 2025 at 05:29:18PM +0800, Xin Chen wrote: > >>>>>> > >>>>>> On 4/30/2025 7:40 PM, Greg Kroah-Hartman wrote: > >>>>>>> On Wed, Apr 30, 2025 at 07:16:17PM +0800, Xin Chen wrote: > >>>>>>>> When ttyport_open() fails to initialize a tty device, serport->tty is not > >>>>>>>> --- a/drivers/tty/serdev/serdev-ttyport.c > >>>>>>>> +++ b/drivers/tty/serdev/serdev-ttyport.c > >>>>>>>> @@ -88,6 +88,10 @@ static void ttyport_write_flush(struct serdev_controller *ctrl) > >>>>>>>> { > >>>>>>>> struct serport *serport = serdev_controller_get_drvdata(ctrl); > >>>>>>>> struct tty_struct *tty = serport->tty; > >>>>>>>> + if (!tty) { > >>>>>>>> + dev_err(&ctrl->dev, "tty is null\n"); > >>>>>>>> + return; > >>>>>>>> + } > >>>>>>> > >>>>>>> What prevents tty from going NULL right after you just checked this? > >>>>>> > >>>>>> First sorry for reply so late for I have a long statutory holidays. > >>>>>> Maybe I don't get your point. From my side, there is nothing to prevent it. > >>>>>> Check here is to avoid code go on if tty is NULL. > >>>>> > >>>>> Yes, but the problem is, serport->tty could change to be NULL right > >>>>> after you check it, so you have not removed the real race that can > >>>>> happen here. There is no lock, so by adding this check you are only > >>>>> reducing the risk of the problem happening, not actually fixing the > >>>>> issue so that it will never happen. > >>>>> > >>>>> Please fix it so that this can never happen. > >>>>> > >>>> > >>>> Actually I have never thought the race condition issue since the crash I met is > >>>> not caused by race condition. It's caused due to Bluetooth driver call > >>>> ttyport_close() after ttyport_open() failed. This two action happen one after > >>>> another in one thread and it seems impossible to have race condition. And with > >>>> my fix the crash doesn't happen again in several test of same case. > >>>> > >>>> Let me introduce the complete process for you: > >>>> 1) hci_dev_open_sync()-> > >>>> hci_dev_init_sync()->hci_dev_setup_sync()->hdev->setup()(hci_uart_setup)->qca_setup(), > >>>> here in qca_setup(), qca_read_soc_version() fails and goto out, then calls > >>>> serdev_device_close() to close tty normally. And then call serdev_device_open() > >>>> to retry. > > > > Wait, what? Why is qca_read_soc_version() failing? > > Actually I have not root cause why qca_read_soc_version() fails of > __hci_cmd_sync_ev(). It may be relative to FW issue. Please start there, don't you want to know why things are failing? > > Why are you retrying multiple times until either you run out of attempts? > > This is a retry mechanism. I find the reason in the change commit message > "Currently driver only retries to download FW if FW downloading > is failed. Sometimes observed command timeout for version request > command, if this happen on some platforms during boot time, then > a reboot is needed to turn ON BT. Instead to avoid a reboot, now > extended retry logic for version request command too." > > > Why are you closing the port and then opening it again right away? > > This is a retry mechanism as above said. Do you mean there should be a gap > between close and open? The change owner maybe don't think about this issue. Why are you calling close/open at all? Why does that do anything? Doesn't that feel wrong? Again, please root-cause the failure, don't try to paper over it by loads of looping and odd open/close attempts that are not understood and seem to actually cause other types of crashes :) > > What close/open pair seems totally unnecessary, why do that at all? > > > > If I read that function qca_setup(), it can NEVER detect if a failure > > really happened (i.e. if it does run out of retries, you just plow on > > and keep going and keep on registering things and THEN return an error > > for some reason. > > > > In other words, the error handling in qca_setup() is very suspect, why > > not fix all of that up first? > > > > qca_read_soc_version() in qca_setup() can detect whether the hci_dev is set up > successfully. If if fails then a failure happens. > You mean I should fix why qca_read_soc_version() fails? Yes, why wouldn't you want to do that? thanks, greg k-h
On 5/31/2025 3:20 PM, Greg Kroah-Hartman wrote: > On Fri, May 30, 2025 at 04:34:49PM +0800, Xin Chen wrote: >> >> >> On 5/29/2025 5:41 PM, Greg Kroah-Hartman wrote: >>> On Thu, May 29, 2025 at 11:07:25AM +0200, Greg Kroah-Hartman wrote: >>>> On Fri, May 23, 2025 at 10:52:27AM +0800, Xin Chen wrote: >>>>> >>>>> >>>>> On 5/14/2025 5:14 PM, Xin Chen wrote: >>>>>> >>>>>> >>>>>> On 5/8/2025 5:41 PM, Greg Kroah-Hartman wrote: >>>>>>> On Thu, May 08, 2025 at 05:29:18PM +0800, Xin Chen wrote: >>>>>>>> >>>>>>>> On 4/30/2025 7:40 PM, Greg Kroah-Hartman wrote: >>>>>>>>> On Wed, Apr 30, 2025 at 07:16:17PM +0800, Xin Chen wrote: >>>>>>>>>> When ttyport_open() fails to initialize a tty device, serport->tty is not >>>>>>>>>> --- a/drivers/tty/serdev/serdev-ttyport.c >>>>>>>>>> +++ b/drivers/tty/serdev/serdev-ttyport.c >>>>>>>>>> @@ -88,6 +88,10 @@ static void ttyport_write_flush(struct serdev_controller *ctrl) >>>>>>>>>> { >>>>>>>>>> struct serport *serport = serdev_controller_get_drvdata(ctrl); >>>>>>>>>> struct tty_struct *tty = serport->tty; >>>>>>>>>> + if (!tty) { >>>>>>>>>> + dev_err(&ctrl->dev, "tty is null\n"); >>>>>>>>>> + return; >>>>>>>>>> + } >>>>>>>>> >>>>>>>>> What prevents tty from going NULL right after you just checked this? >>>>>>>> >>>>>>>> First sorry for reply so late for I have a long statutory holidays. >>>>>>>> Maybe I don't get your point. From my side, there is nothing to prevent it. >>>>>>>> Check here is to avoid code go on if tty is NULL. >>>>>>> >>>>>>> Yes, but the problem is, serport->tty could change to be NULL right >>>>>>> after you check it, so you have not removed the real race that can >>>>>>> happen here. There is no lock, so by adding this check you are only >>>>>>> reducing the risk of the problem happening, not actually fixing the >>>>>>> issue so that it will never happen. >>>>>>> >>>>>>> Please fix it so that this can never happen. >>>>>>> >>>>>> >>>>>> Actually I have never thought the race condition issue since the crash I met is >>>>>> not caused by race condition. It's caused due to Bluetooth driver call >>>>>> ttyport_close() after ttyport_open() failed. This two action happen one after >>>>>> another in one thread and it seems impossible to have race condition. And with >>>>>> my fix the crash doesn't happen again in several test of same case. >>>>>> >>>>>> Let me introduce the complete process for you: >>>>>> 1) hci_dev_open_sync()-> >>>>>> hci_dev_init_sync()->hci_dev_setup_sync()->hdev->setup()(hci_uart_setup)->qca_setup(), >>>>>> here in qca_setup(), qca_read_soc_version() fails and goto out, then calls >>>>>> serdev_device_close() to close tty normally. And then call serdev_device_open() >>>>>> to retry. >>> >>> Wait, what? Why is qca_read_soc_version() failing? >> >> Actually I have not root cause why qca_read_soc_version() fails of >> __hci_cmd_sync_ev(). It may be relative to FW issue. > > Please start there, don't you want to know why things are failing? > >>> Why are you retrying multiple times until either you run out of attempts? >> >> This is a retry mechanism. I find the reason in the change commit message >> "Currently driver only retries to download FW if FW downloading >> is failed. Sometimes observed command timeout for version request >> command, if this happen on some platforms during boot time, then >> a reboot is needed to turn ON BT. Instead to avoid a reboot, now >> extended retry logic for version request command too." >> >>> Why are you closing the port and then opening it again right away? >> >> This is a retry mechanism as above said. Do you mean there should be a gap >> between close and open? The change owner maybe don't think about this issue. > > Why are you calling close/open at all? Why does that do anything? > Doesn't that feel wrong? > > Again, please root-cause the failure, don't try to paper over it by > loads of looping and odd open/close attempts that are not understood and > seem to actually cause other types of crashes :) > >>> What close/open pair seems totally unnecessary, why do that at all? >>> >>> If I read that function qca_setup(), it can NEVER detect if a failure >>> really happened (i.e. if it does run out of retries, you just plow on >>> and keep going and keep on registering things and THEN return an error >>> for some reason. >>> >>> In other words, the error handling in qca_setup() is very suspect, why >>> not fix all of that up first? >>> >> >> qca_read_soc_version() in qca_setup() can detect whether the hci_dev is set up >> successfully. If if fails then a failure happens. >> You mean I should fix why qca_read_soc_version() fails? > > Yes, why wouldn't you want to do that? > Yeah you are right. I will try to root cause the qca read version issue and fix it first. Thanks very much! Xin
diff --git a/drivers/tty/serdev/serdev-ttyport.c b/drivers/tty/serdev/serdev-ttyport.c index 3d7ae7fa5018..287908f2009b 100644 --- a/drivers/tty/serdev/serdev-ttyport.c +++ b/drivers/tty/serdev/serdev-ttyport.c @@ -88,6 +88,10 @@ static void ttyport_write_flush(struct serdev_controller *ctrl) { struct serport *serport = serdev_controller_get_drvdata(ctrl); struct tty_struct *tty = serport->tty; + if (!tty) { + dev_err(&ctrl->dev, "tty is null\n"); + return; + } tty_driver_flush_buffer(tty); } @@ -108,8 +112,10 @@ static int ttyport_open(struct serdev_controller *ctrl) int ret; tty = tty_init_dev(serport->tty_drv, serport->tty_idx); - if (IS_ERR(tty)) + if (IS_ERR(tty)) { + serport->tty = NULL; return PTR_ERR(tty); + } serport->tty = tty; if (!tty->ops->open || !tty->ops->close) { @@ -156,6 +162,11 @@ static void ttyport_close(struct serdev_controller *ctrl) clear_bit(SERPORT_ACTIVE, &serport->flags); + if (!tty) { + dev_err(&ctrl->dev, "tty is null\n"); + return; + } + tty_lock(tty); if (tty->ops->close) tty->ops->close(tty, NULL);