mbox series

[v2,0/2] Introduce XHCI_FULL_RESET_ON_REMOVE quirk for DWC3

Message ID 20250515185227.1507363-1-royluo@google.com
Headers show
Series Introduce XHCI_FULL_RESET_ON_REMOVE quirk for DWC3 | expand

Message

Roy Luo May 15, 2025, 6:52 p.m. UTC
Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state()
helper") introduced an optimization to xhci_reset() during xhci removal,
allowing it to bail out early without waiting for the reset to complete.

This behavior can cause issues on SNPS DWC3 USB controller with dual-role
capability. When the DWC3 controller exits host mode and removes xhci
while a reset is still in progress, and then tries to configure its
hardware for device mode, the ongoing reset leads to register access
issues; specifically, all register reads returns 0. These issues extend
beyond the xhci register space (which is expected during a reset) and
affect the entire DWC3 IP block, causing the DWC3 device mode to
malfunction.

To fix this, introduce XHCI_FULL_RESET_ON_REMOVE quirk andd enable it on
DWC3. This ensures xhci_reset() completes its full handshake before
proceeding.

---
Changes in v2:
- no code change
- add Fixes tag and cc stable kernel
Link to v1: https://lore.kernel.org/r/20250515040207.1253690-1-royluo@google.com/
---

Roy Luo (2):
  xhci: Add a quirk for full reset on removal
  usb: dwc3: Force full reset on xhci removal

 drivers/usb/dwc3/host.c      | 5 ++++-
 drivers/usb/host/xhci-plat.c | 3 +++
 drivers/usb/host/xhci.c      | 8 +++++++-
 drivers/usb/host/xhci.h      | 1 +
 4 files changed, 15 insertions(+), 2 deletions(-)


base-commit: c94d59a126cb9a8d1f71e3e044363d654dcd7af8

Comments

Thinh Nguyen May 16, 2025, 11:38 p.m. UTC | #1
Hi Roy, Michał,

On Fri, May 16, 2025, Roy Luo wrote:
> > There's no state 0. Checking against that is odd. Couldn't we just use
> > xhci_handshake() equivalent instead?
> 
> Ok, I will change it in the next version.
> 
> On Thu, May 15, 2025 at 11:33 PM Michał Pecio <michal.pecio@gmail.com> wrote:
> >
> > On Thu, 15 May 2025 23:42:50 +0000, Thinh Nguyen wrote:
> > > In any case, this is basically a revert of this change:
> > > 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state()
> > > helper")
> > >
> > > Can't we just revert or fix the above patch that causes a regression?
> >
> > Also note that 6ccb83d6c497 claimed to fix actual problems, so
> > disabling it on selected hardware could bring the old bug back:
> >
> > > In some situations where xhci removal happens parallel to
> > > xhci_handshake, we encounter a scenario where the xhci_handshake
> > > can't succeed, and it polls until timeout.
> > >
> > > If xhci_handshake runs until timeout it can on some platforms result
> > > in a long wait which might lead to a watchdog timeout.
> 
> On top of this, xhci_handshake_check_state(XHCI_STATE_REMOVING)
> is also used elsewhere like xhci_abort_cmd_ring(), so a simple revert is
> off the table. Commit 6ccb83d6c497 did not specify which platform and
> in what circumstance would xhci handshake timeout, adding a quirk for
> DWC3 seems to be the better option here.
> 

Regarding the commit 6ccb83d6c497, I'm assuming Udipto made the change
for Qcom platforms. Hi @Udipto, if you're reading this, please confirm.

Many of the Qcom platforms are using dwc3 controller. The change you
made here are affecting all the dwc3 DRD controllers, which has a good
chance to also impact the Qcom platforms.

> >
> > But on the other hand, xhci_handshake() has long timeouts because
> > the handshakes themselves can take a surprisingly long time (and
> > sometimes still succeed), so any reliance on handshake completing
> > before timeout is frankly a bug in itself.
> 
> This patch simply honors the contract between the software and
> hardware, allowing the handshake to complete. It doesn't assume the
> handshake will finish on time. If it times out, then it times out and
> returns a failure.
> 

As Michał pointed out, disregarding the xhci handshake timeout is not
proper. The change 6ccb83d6c497 seems to workaround some different
watchdog warning timeout instead of resolving the actual issue. The
watchdog timeout should not be less than the handshake timeout here.

BR,
Thinh
Roy Luo May 17, 2025, 12:50 a.m. UTC | #2
On Fri, May 16, 2025 at 4:38 PM Thinh Nguyen <Thinh.Nguyen@synopsys.com> wrote:
>
> Hi Roy, Michał,
>
> On Fri, May 16, 2025, Roy Luo wrote:
> > > There's no state 0. Checking against that is odd. Couldn't we just use
> > > xhci_handshake() equivalent instead?
> >
> > Ok, I will change it in the next version.
> >
> > On Thu, May 15, 2025 at 11:33 PM Michał Pecio <michal.pecio@gmail.com> wrote:
> > >
> > > On Thu, 15 May 2025 23:42:50 +0000, Thinh Nguyen wrote:
> > > > In any case, this is basically a revert of this change:
> > > > 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state()
> > > > helper")
> > > >
> > > > Can't we just revert or fix the above patch that causes a regression?
> > >
> > > Also note that 6ccb83d6c497 claimed to fix actual problems, so
> > > disabling it on selected hardware could bring the old bug back:
> > >
> > > > In some situations where xhci removal happens parallel to
> > > > xhci_handshake, we encounter a scenario where the xhci_handshake
> > > > can't succeed, and it polls until timeout.
> > > >
> > > > If xhci_handshake runs until timeout it can on some platforms result
> > > > in a long wait which might lead to a watchdog timeout.
> >
> > On top of this, xhci_handshake_check_state(XHCI_STATE_REMOVING)
> > is also used elsewhere like xhci_abort_cmd_ring(), so a simple revert is
> > off the table. Commit 6ccb83d6c497 did not specify which platform and
> > in what circumstance would xhci handshake timeout, adding a quirk for
> > DWC3 seems to be the better option here.
> >
>
> Regarding the commit 6ccb83d6c497, I'm assuming Udipto made the change
> for Qcom platforms. Hi @Udipto, if you're reading this, please confirm.
>
> Many of the Qcom platforms are using dwc3 controller. The change you
> made here are affecting all the dwc3 DRD controllers, which has a good
> chance to also impact the Qcom platforms.
>
> > >
> > > But on the other hand, xhci_handshake() has long timeouts because
> > > the handshakes themselves can take a surprisingly long time (and
> > > sometimes still succeed), so any reliance on handshake completing
> > > before timeout is frankly a bug in itself.
> >
> > This patch simply honors the contract between the software and
> > hardware, allowing the handshake to complete. It doesn't assume the
> > handshake will finish on time. If it times out, then it times out and
> > returns a failure.
> >
>
> As Michał pointed out, disregarding the xhci handshake timeout is not
> proper. The change 6ccb83d6c497 seems to workaround some different
> watchdog warning timeout instead of resolving the actual issue. The
> watchdog timeout should not be less than the handshake timeout here.
>
This makes sense, I will send out a revert of 6ccb83d6c497 then.

Thanks,
Roy
Michał Pecio May 17, 2025, 4:39 a.m. UTC | #3
On Fri, 16 May 2025 23:38:33 +0000, Thinh Nguyen wrote:
> > > But on the other hand, xhci_handshake() has long timeouts because
> > > the handshakes themselves can take a surprisingly long time (and
> > > sometimes still succeed), so any reliance on handshake completing
> > > before timeout is frankly a bug in itself.  
> > 
> > This patch simply honors the contract between the software and
> > hardware, allowing the handshake to complete. It doesn't assume the
> > handshake will finish on time. If it times out, then it times out
> > and returns a failure.
> >   
> 
> As Michał pointed out, disregarding the xhci handshake timeout is not
> proper. The change 6ccb83d6c497 seems to workaround some different
> watchdog warning timeout instead of resolving the actual issue. The
> watchdog timeout should not be less than the handshake timeout here.

There is certainly one real problem, which has likely existed since
forever: some of those handshakes cause system-wide freezes. I haven't
investigated it thoroughly, but I suspect the main culprit is the one
in xhci_abort_cmd_ring(), which holds the spinlock for a few seconds
if the xHC is particularly slow to complete the abort. This probably
causes xhci_irq() to spin and disrupt other IRQs.

I encounter it sometimes with ASMedia controllers, but I guess anyone
can simulate it by inserting artificial delays near xhci_handshake().

Regards,
Michal