Message ID | 20250515185227.1507363-1-royluo@google.com |
---|---|
Headers | show |
Series | Introduce XHCI_FULL_RESET_ON_REMOVE quirk for DWC3 | expand |
Hi Roy, Michał, On Fri, May 16, 2025, Roy Luo wrote: > > There's no state 0. Checking against that is odd. Couldn't we just use > > xhci_handshake() equivalent instead? > > Ok, I will change it in the next version. > > On Thu, May 15, 2025 at 11:33 PM Michał Pecio <michal.pecio@gmail.com> wrote: > > > > On Thu, 15 May 2025 23:42:50 +0000, Thinh Nguyen wrote: > > > In any case, this is basically a revert of this change: > > > 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() > > > helper") > > > > > > Can't we just revert or fix the above patch that causes a regression? > > > > Also note that 6ccb83d6c497 claimed to fix actual problems, so > > disabling it on selected hardware could bring the old bug back: > > > > > In some situations where xhci removal happens parallel to > > > xhci_handshake, we encounter a scenario where the xhci_handshake > > > can't succeed, and it polls until timeout. > > > > > > If xhci_handshake runs until timeout it can on some platforms result > > > in a long wait which might lead to a watchdog timeout. > > On top of this, xhci_handshake_check_state(XHCI_STATE_REMOVING) > is also used elsewhere like xhci_abort_cmd_ring(), so a simple revert is > off the table. Commit 6ccb83d6c497 did not specify which platform and > in what circumstance would xhci handshake timeout, adding a quirk for > DWC3 seems to be the better option here. > Regarding the commit 6ccb83d6c497, I'm assuming Udipto made the change for Qcom platforms. Hi @Udipto, if you're reading this, please confirm. Many of the Qcom platforms are using dwc3 controller. The change you made here are affecting all the dwc3 DRD controllers, which has a good chance to also impact the Qcom platforms. > > > > But on the other hand, xhci_handshake() has long timeouts because > > the handshakes themselves can take a surprisingly long time (and > > sometimes still succeed), so any reliance on handshake completing > > before timeout is frankly a bug in itself. > > This patch simply honors the contract between the software and > hardware, allowing the handshake to complete. It doesn't assume the > handshake will finish on time. If it times out, then it times out and > returns a failure. > As Michał pointed out, disregarding the xhci handshake timeout is not proper. The change 6ccb83d6c497 seems to workaround some different watchdog warning timeout instead of resolving the actual issue. The watchdog timeout should not be less than the handshake timeout here. BR, Thinh
On Fri, May 16, 2025 at 4:38 PM Thinh Nguyen <Thinh.Nguyen@synopsys.com> wrote: > > Hi Roy, Michał, > > On Fri, May 16, 2025, Roy Luo wrote: > > > There's no state 0. Checking against that is odd. Couldn't we just use > > > xhci_handshake() equivalent instead? > > > > Ok, I will change it in the next version. > > > > On Thu, May 15, 2025 at 11:33 PM Michał Pecio <michal.pecio@gmail.com> wrote: > > > > > > On Thu, 15 May 2025 23:42:50 +0000, Thinh Nguyen wrote: > > > > In any case, this is basically a revert of this change: > > > > 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() > > > > helper") > > > > > > > > Can't we just revert or fix the above patch that causes a regression? > > > > > > Also note that 6ccb83d6c497 claimed to fix actual problems, so > > > disabling it on selected hardware could bring the old bug back: > > > > > > > In some situations where xhci removal happens parallel to > > > > xhci_handshake, we encounter a scenario where the xhci_handshake > > > > can't succeed, and it polls until timeout. > > > > > > > > If xhci_handshake runs until timeout it can on some platforms result > > > > in a long wait which might lead to a watchdog timeout. > > > > On top of this, xhci_handshake_check_state(XHCI_STATE_REMOVING) > > is also used elsewhere like xhci_abort_cmd_ring(), so a simple revert is > > off the table. Commit 6ccb83d6c497 did not specify which platform and > > in what circumstance would xhci handshake timeout, adding a quirk for > > DWC3 seems to be the better option here. > > > > Regarding the commit 6ccb83d6c497, I'm assuming Udipto made the change > for Qcom platforms. Hi @Udipto, if you're reading this, please confirm. > > Many of the Qcom platforms are using dwc3 controller. The change you > made here are affecting all the dwc3 DRD controllers, which has a good > chance to also impact the Qcom platforms. > > > > > > > But on the other hand, xhci_handshake() has long timeouts because > > > the handshakes themselves can take a surprisingly long time (and > > > sometimes still succeed), so any reliance on handshake completing > > > before timeout is frankly a bug in itself. > > > > This patch simply honors the contract between the software and > > hardware, allowing the handshake to complete. It doesn't assume the > > handshake will finish on time. If it times out, then it times out and > > returns a failure. > > > > As Michał pointed out, disregarding the xhci handshake timeout is not > proper. The change 6ccb83d6c497 seems to workaround some different > watchdog warning timeout instead of resolving the actual issue. The > watchdog timeout should not be less than the handshake timeout here. > This makes sense, I will send out a revert of 6ccb83d6c497 then. Thanks, Roy
On Fri, 16 May 2025 23:38:33 +0000, Thinh Nguyen wrote: > > > But on the other hand, xhci_handshake() has long timeouts because > > > the handshakes themselves can take a surprisingly long time (and > > > sometimes still succeed), so any reliance on handshake completing > > > before timeout is frankly a bug in itself. > > > > This patch simply honors the contract between the software and > > hardware, allowing the handshake to complete. It doesn't assume the > > handshake will finish on time. If it times out, then it times out > > and returns a failure. > > > > As Michał pointed out, disregarding the xhci handshake timeout is not > proper. The change 6ccb83d6c497 seems to workaround some different > watchdog warning timeout instead of resolving the actual issue. The > watchdog timeout should not be less than the handshake timeout here. There is certainly one real problem, which has likely existed since forever: some of those handshakes cause system-wide freezes. I haven't investigated it thoroughly, but I suspect the main culprit is the one in xhci_abort_cmd_ring(), which holds the spinlock for a few seconds if the xHC is particularly slow to complete the abort. This probably causes xhci_irq() to spin and disrupt other IRQs. I encounter it sometimes with ASMedia controllers, but I guess anyone can simulate it by inserting artificial delays near xhci_handshake(). Regards, Michal