diff mbox series

[RFT] xhci: make xhci_handshake timeout for xhci_reset() adjustable

Message ID 20220217135643.1321889-1-mathias.nyman@linux.intel.com
State Superseded
Headers show
Series [RFT] xhci: make xhci_handshake timeout for xhci_reset() adjustable | expand

Commit Message

Mathias Nyman Feb. 17, 2022, 1:56 p.m. UTC
xhci_reset() timeout was increased from 250ms to 10 seconds in order to
give Renesas 720201 xHC enough time to get ready in probe.

xhci_reset() is called with interrupts disabled in other places, and
waiting for 10 seconds there is not acceptable.

Add a timeout parameter to xhci_reset(), and adjust it back to 250ms
when called from xhci_stop() or xhci_shutdown() where interrupts are
disabled, and successful reset isn't that critical.

Additionally change the signed integer timeout parameter in
xhci_handshake() to a u64 to match the timeout value we pass to
readl_poll_timeout_atomic()

Reported-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reported-by: Pavan Kondeti <quic_pkondeti@quicinc.com>
Fixes: 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
 drivers/usb/host/xhci-hub.c |  2 +-
 drivers/usb/host/xhci-mem.c |  2 +-
 drivers/usb/host/xhci.c     | 20 +++++++++-----------
 drivers/usb/host/xhci.h     |  7 +++++--
 4 files changed, 16 insertions(+), 15 deletions(-)

Comments

Udipto Goswami Feb. 24, 2022, 8:44 a.m. UTC | #1
Hi Pavan, Mathias,

we have tested the patch in the testing environment where initially we 
were hitting the issue. We don't see any issue after including this.

On 18-02-2022 03:11 pm, Pavan Kondeti wrote:
> On Thu, Feb 17, 2022 at 03:56:43PM +0200, Mathias Nyman wrote:
>> xhci_reset() timeout was increased from 250ms to 10 seconds in order to
>> give Renesas 720201 xHC enough time to get ready in probe.
>>
>> xhci_reset() is called with interrupts disabled in other places, and
>> waiting for 10 seconds there is not acceptable.
>>
>> Add a timeout parameter to xhci_reset(), and adjust it back to 250ms
>> when called from xhci_stop() or xhci_shutdown() where interrupts are
>> disabled, and successful reset isn't that critical.
>>
>> Additionally change the signed integer timeout parameter in
>> xhci_handshake() to a u64 to match the timeout value we pass to
>> readl_poll_timeout_atomic()
>>
>> Reported-by: Sergey Shtylyov <s.shtylyov@omp.ru>
>> Reported-by: Pavan Kondeti <quic_pkondeti@quicinc.com>
>> Fixes: 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
>> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
> We have tested the patch and found no issues. Since the issue happens
> very rarely, we have included in our builds for the wider testing.
>
> Thanks,
> Pavan
Jack Pham March 1, 2022, 1:49 a.m. UTC | #2
Hi Mathias,

On Thu, Feb 17, 2022 at 03:56:43PM +0200, Mathias Nyman wrote:
> xhci_reset() timeout was increased from 250ms to 10 seconds in order to
> give Renesas 720201 xHC enough time to get ready in probe.

This suggests that the only place we really need the long timeout is
in xhci_gen_setup().

> @@ -1163,7 +1161,7 @@ int xhci_resume(struct xhci_hcd *xhci, bool hibernated)
>  		xhci_dbg(xhci, "Stop HCD\n");
>  		xhci_halt(xhci);
>  		xhci_zero_64b_regs(xhci);
> -		retval = xhci_reset(xhci);
> +		retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);

Since preemption is disabled here (spin_lock_irq() is called near the
start of this function), shouldn't we also limit this to 250ms here in
xhci_resume() as well?

>  		spin_unlock_irq(&xhci->lock);
>  		if (retval)
>  			return retval;
> @@ -5297,7 +5295,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t get_quirks)
>  
>  	xhci_dbg(xhci, "Resetting HCD\n");
>  	/* Reset the internal HC memory state and registers. */
> -	retval = xhci_reset(xhci);
> +	retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
>  	if (retval)
>  		return retval;
>  	xhci_dbg(xhci, "Reset complete\n");

Thanks,
Jack
Pavan Kondeti March 1, 2022, 4:03 a.m. UTC | #3
On Mon, Feb 28, 2022 at 05:49:49PM -0800, Jack Pham wrote:
> Hi Mathias,
> 
> On Thu, Feb 17, 2022 at 03:56:43PM +0200, Mathias Nyman wrote:
> > xhci_reset() timeout was increased from 250ms to 10 seconds in order to
> > give Renesas 720201 xHC enough time to get ready in probe.
> 
> This suggests that the only place we really need the long timeout is
> in xhci_gen_setup().
> 
> > @@ -1163,7 +1161,7 @@ int xhci_resume(struct xhci_hcd *xhci, bool hibernated)
> >  		xhci_dbg(xhci, "Stop HCD\n");
> >  		xhci_halt(xhci);
> >  		xhci_zero_64b_regs(xhci);
> > -		retval = xhci_reset(xhci);
> > +		retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
> 
> Since preemption is disabled here (spin_lock_irq() is called near the
> start of this function), shouldn't we also limit this to 250ms here in
> xhci_resume() as well?
> 
The rationale of decreasing the timeout to 250 in certain places is based
on the criticality of the operation but not on the preemption/irq state.
Since xHC reset is critical in startup and resume, the 10 seconds timeout
is enforced so that we don't break Renesas 720201 xHC.

Since all of our internl test reports indicate that the timeout is happening
from stop hcd, this patch is helping.

Thanks,
Pavan
Mathias Nyman March 1, 2022, 8:47 a.m. UTC | #4
On 1.3.2022 6.03, Pavan Kondeti wrote:
> On Mon, Feb 28, 2022 at 05:49:49PM -0800, Jack Pham wrote:
>> Hi Mathias,
>>
>> On Thu, Feb 17, 2022 at 03:56:43PM +0200, Mathias Nyman wrote:
>>> xhci_reset() timeout was increased from 250ms to 10 seconds in order to
>>> give Renesas 720201 xHC enough time to get ready in probe.
>>
>> This suggests that the only place we really need the long timeout is
>> in xhci_gen_setup().
>>
>>> @@ -1163,7 +1161,7 @@ int xhci_resume(struct xhci_hcd *xhci, bool hibernated)
>>>  		xhci_dbg(xhci, "Stop HCD\n");
>>>  		xhci_halt(xhci);
>>>  		xhci_zero_64b_regs(xhci);
>>> -		retval = xhci_reset(xhci);
>>> +		retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
>>
>> Since preemption is disabled here (spin_lock_irq() is called near the
>> start of this function), shouldn't we also limit this to 250ms here in
>> xhci_resume() as well?
>>> The rationale of decreasing the timeout to 250 in certain places is based
> on the criticality of the operation but not on the preemption/irq state.
> Since xHC reset is critical in startup and resume, the 10 seconds timeout
> is enforced so that we don't break Renesas 720201 xHC.
> 
> Since all of our internl test reports indicate that the timeout is happening
> from stop hcd, this patch is helping.
> 

This was pretty much my reasoning as well.
I could add a comment about this to the commit message

In addition we want a targeted fix for a real world issue that we can send to
stable without changing too much, risking regressions.

I also think we should focus more on fixing the locking (preemption/irq state)
around xhci_reset() in xhci_resume() than tuning the timeout, but this needs more
thought and should be a separate patch for later.

Additionally I guess xhci_reset() is more likely to fail in xhci_stop() and
xhci_shutdown() as power or clocks for xHC may be disabled, or entire xHC removed from the
bus by then.   

Thanks
-Mathias
Pavan Kondeti March 2, 2022, 3:23 a.m. UTC | #5
On Tue, Mar 01, 2022 at 10:47:36AM +0200, Mathias Nyman wrote:
> On 1.3.2022 6.03, Pavan Kondeti wrote:
> > On Mon, Feb 28, 2022 at 05:49:49PM -0800, Jack Pham wrote:
> >> Hi Mathias,
> >>
> >> On Thu, Feb 17, 2022 at 03:56:43PM +0200, Mathias Nyman wrote:
> >>> xhci_reset() timeout was increased from 250ms to 10 seconds in order to
> >>> give Renesas 720201 xHC enough time to get ready in probe.
> >>
> >> This suggests that the only place we really need the long timeout is
> >> in xhci_gen_setup().
> >>
> >>> @@ -1163,7 +1161,7 @@ int xhci_resume(struct xhci_hcd *xhci, bool hibernated)
> >>>  		xhci_dbg(xhci, "Stop HCD\n");
> >>>  		xhci_halt(xhci);
> >>>  		xhci_zero_64b_regs(xhci);
> >>> -		retval = xhci_reset(xhci);
> >>> +		retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
> >>
> >> Since preemption is disabled here (spin_lock_irq() is called near the
> >> start of this function), shouldn't we also limit this to 250ms here in
> >> xhci_resume() as well?
> >>> The rationale of decreasing the timeout to 250 in certain places is based
> > on the criticality of the operation but not on the preemption/irq state.
> > Since xHC reset is critical in startup and resume, the 10 seconds timeout
> > is enforced so that we don't break Renesas 720201 xHC.
> > 
> > Since all of our internl test reports indicate that the timeout is happening
> > from stop hcd, this patch is helping.
> > 
> 
> This was pretty much my reasoning as well.
> I could add a comment about this to the commit message
> 
> In addition we want a targeted fix for a real world issue that we can send to
> stable without changing too much, risking regressions.
> 
Makes complete sense. 

Greg,

Do you plan to include this patch?

Thanks,
Pavan
Greg KH March 2, 2022, 7:49 a.m. UTC | #6
On Wed, Mar 02, 2022 at 08:53:06AM +0530, Pavan Kondeti wrote:
> On Tue, Mar 01, 2022 at 10:47:36AM +0200, Mathias Nyman wrote:
> > On 1.3.2022 6.03, Pavan Kondeti wrote:
> > > On Mon, Feb 28, 2022 at 05:49:49PM -0800, Jack Pham wrote:
> > >> Hi Mathias,
> > >>
> > >> On Thu, Feb 17, 2022 at 03:56:43PM +0200, Mathias Nyman wrote:
> > >>> xhci_reset() timeout was increased from 250ms to 10 seconds in order to
> > >>> give Renesas 720201 xHC enough time to get ready in probe.
> > >>
> > >> This suggests that the only place we really need the long timeout is
> > >> in xhci_gen_setup().
> > >>
> > >>> @@ -1163,7 +1161,7 @@ int xhci_resume(struct xhci_hcd *xhci, bool hibernated)
> > >>>  		xhci_dbg(xhci, "Stop HCD\n");
> > >>>  		xhci_halt(xhci);
> > >>>  		xhci_zero_64b_regs(xhci);
> > >>> -		retval = xhci_reset(xhci);
> > >>> +		retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
> > >>
> > >> Since preemption is disabled here (spin_lock_irq() is called near the
> > >> start of this function), shouldn't we also limit this to 250ms here in
> > >> xhci_resume() as well?
> > >>> The rationale of decreasing the timeout to 250 in certain places is based
> > > on the criticality of the operation but not on the preemption/irq state.
> > > Since xHC reset is critical in startup and resume, the 10 seconds timeout
> > > is enforced so that we don't break Renesas 720201 xHC.
> > > 
> > > Since all of our internl test reports indicate that the timeout is happening
> > > from stop hcd, this patch is helping.
> > > 
> > 
> > This was pretty much my reasoning as well.
> > I could add a comment about this to the commit message
> > 
> > In addition we want a targeted fix for a real world issue that we can send to
> > stable without changing too much, risking regressions.
> > 
> Makes complete sense. 
> 
> Greg,
> 
> Do you plan to include this patch?

I need it to be forwarded to me by the maintainer with their
signed-off-by, like any other xhci core patch :)

What is the rush here?  Your hardware is obviously broken somehow if you
are hitting this problem just now.  Why not fix the root cause here,
this patch should only affect rare situations, right?  Are you having
this problem a lot?

thanks,

greg k-h
Pavan Kondeti March 2, 2022, 7:57 a.m. UTC | #7
Hi Greg,

On Wed, Mar 02, 2022 at 08:49:08AM +0100, Greg Kroah-Hartman wrote:
> On Wed, Mar 02, 2022 at 08:53:06AM +0530, Pavan Kondeti wrote:
> > On Tue, Mar 01, 2022 at 10:47:36AM +0200, Mathias Nyman wrote:
> > > On 1.3.2022 6.03, Pavan Kondeti wrote:
> > > > On Mon, Feb 28, 2022 at 05:49:49PM -0800, Jack Pham wrote:
> > > >> Hi Mathias,
> > > >>
> > > >> On Thu, Feb 17, 2022 at 03:56:43PM +0200, Mathias Nyman wrote:
> > > >>> xhci_reset() timeout was increased from 250ms to 10 seconds in order to
> > > >>> give Renesas 720201 xHC enough time to get ready in probe.
> > > >>
> > > >> This suggests that the only place we really need the long timeout is
> > > >> in xhci_gen_setup().
> > > >>
> > > >>> @@ -1163,7 +1161,7 @@ int xhci_resume(struct xhci_hcd *xhci, bool hibernated)
> > > >>>  		xhci_dbg(xhci, "Stop HCD\n");
> > > >>>  		xhci_halt(xhci);
> > > >>>  		xhci_zero_64b_regs(xhci);
> > > >>> -		retval = xhci_reset(xhci);
> > > >>> +		retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
> > > >>
> > > >> Since preemption is disabled here (spin_lock_irq() is called near the
> > > >> start of this function), shouldn't we also limit this to 250ms here in
> > > >> xhci_resume() as well?
> > > >>> The rationale of decreasing the timeout to 250 in certain places is based
> > > > on the criticality of the operation but not on the preemption/irq state.
> > > > Since xHC reset is critical in startup and resume, the 10 seconds timeout
> > > > is enforced so that we don't break Renesas 720201 xHC.
> > > > 
> > > > Since all of our internl test reports indicate that the timeout is happening
> > > > from stop hcd, this patch is helping.
> > > > 
> > > 
> > > This was pretty much my reasoning as well.
> > > I could add a comment about this to the commit message
> > > 
> > > In addition we want a targeted fix for a real world issue that we can send to
> > > stable without changing too much, risking regressions.
> > > 
> > Makes complete sense. 
> > 
> > Greg,
> > 
> > Do you plan to include this patch?
> 
> I need it to be forwarded to me by the maintainer with their
> signed-off-by, like any other xhci core patch :)

Got it. I will wait for Mathia's final patch.

> 
> What is the rush here?  Your hardware is obviously broken somehow if you
> are hitting this problem just now.  Why not fix the root cause here,
> this patch should only affect rare situations, right?  Are you having
> this problem a lot?
> 
Sure, we will work with hardware designers to understand the root cause but
it is very unlikely that this can be fixed with a firmware/control setting
change. so this workaround would atleast solve the stability issues. It
happens rarely but we test with lots of devices in our farm and it shows up
eventually on one or the other device in a day or so. Why not include a
harmless patch and improve user experience is what driving us to request
Mathias and you to help.

Thanks,
Pavan
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
index df3522dab31b..bb4f01ce90e3 100644
--- a/drivers/usb/host/xhci-hub.c
+++ b/drivers/usb/host/xhci-hub.c
@@ -762,7 +762,7 @@  static int xhci_exit_test_mode(struct xhci_hcd *xhci)
 	}
 	pm_runtime_allow(xhci_to_hcd(xhci)->self.controller);
 	xhci->test_mode = 0;
-	return xhci_reset(xhci);
+	return xhci_reset(xhci, XHCI_RESET_SHORT_USEC);
 }
 
 void xhci_set_link_state(struct xhci_hcd *xhci, struct xhci_port *port,
diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c
index f8c2b6c79543..803c89700ed3 100644
--- a/drivers/usb/host/xhci-mem.c
+++ b/drivers/usb/host/xhci-mem.c
@@ -2575,7 +2575,7 @@  int xhci_mem_init(struct xhci_hcd *xhci, gfp_t flags)
 
 fail:
 	xhci_halt(xhci);
-	xhci_reset(xhci);
+	xhci_reset(xhci, XHCI_RESET_SHORT_USEC);
 	xhci_mem_cleanup(xhci);
 	return -ENOMEM;
 }
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 33bae434aa94..2567176b9bdb 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -65,7 +65,7 @@  static bool td_on_ring(struct xhci_td *td, struct xhci_ring *ring)
  * handshake done).  There are two failure modes:  "usec" have passed (major
  * hardware flakeout), or the register reads as all-ones (hardware removed).
  */
-int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, int usec)
+int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, u64 timeout_us)
 {
 	u32	result;
 	int	ret;
@@ -73,7 +73,7 @@  int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, int usec)
 	ret = readl_poll_timeout_atomic(ptr, result,
 					(result & mask) == done ||
 					result == U32_MAX,
-					1, usec);
+					1, timeout_us);
 	if (result == U32_MAX)		/* card removed */
 		return -ENODEV;
 
@@ -162,7 +162,7 @@  int xhci_start(struct xhci_hcd *xhci)
  * Transactions will be terminated immediately, and operational registers
  * will be set to their defaults.
  */
-int xhci_reset(struct xhci_hcd *xhci)
+int xhci_reset(struct xhci_hcd *xhci, u64 timeout_us)
 {
 	u32 command;
 	u32 state;
@@ -195,8 +195,7 @@  int xhci_reset(struct xhci_hcd *xhci)
 	if (xhci->quirks & XHCI_INTEL_HOST)
 		udelay(1000);
 
-	ret = xhci_handshake(&xhci->op_regs->command,
-			CMD_RESET, 0, 10 * 1000 * 1000);
+	ret = xhci_handshake(&xhci->op_regs->command, CMD_RESET, 0, timeout_us);
 	if (ret)
 		return ret;
 
@@ -209,8 +208,7 @@  int xhci_reset(struct xhci_hcd *xhci)
 	 * xHCI cannot write to any doorbells or operational registers other
 	 * than status until the "Controller Not Ready" flag is cleared.
 	 */
-	ret = xhci_handshake(&xhci->op_regs->status,
-			STS_CNR, 0, 10 * 1000 * 1000);
+	ret = xhci_handshake(&xhci->op_regs->status, STS_CNR, 0, timeout_us);
 
 	xhci->usb2_rhub.bus_state.port_c_suspend = 0;
 	xhci->usb2_rhub.bus_state.suspended_ports = 0;
@@ -731,7 +729,7 @@  static void xhci_stop(struct usb_hcd *hcd)
 	xhci->xhc_state |= XHCI_STATE_HALTED;
 	xhci->cmd_ring_state = CMD_RING_STATE_STOPPED;
 	xhci_halt(xhci);
-	xhci_reset(xhci);
+	xhci_reset(xhci, XHCI_RESET_SHORT_USEC);
 	spin_unlock_irq(&xhci->lock);
 
 	xhci_cleanup_msix(xhci);
@@ -784,7 +782,7 @@  void xhci_shutdown(struct usb_hcd *hcd)
 	xhci_halt(xhci);
 	/* Workaround for spurious wakeups at shutdown with HSW */
 	if (xhci->quirks & XHCI_SPURIOUS_WAKEUP)
-		xhci_reset(xhci);
+		xhci_reset(xhci, XHCI_RESET_SHORT_USEC);
 	spin_unlock_irq(&xhci->lock);
 
 	xhci_cleanup_msix(xhci);
@@ -1163,7 +1161,7 @@  int xhci_resume(struct xhci_hcd *xhci, bool hibernated)
 		xhci_dbg(xhci, "Stop HCD\n");
 		xhci_halt(xhci);
 		xhci_zero_64b_regs(xhci);
-		retval = xhci_reset(xhci);
+		retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
 		spin_unlock_irq(&xhci->lock);
 		if (retval)
 			return retval;
@@ -5297,7 +5295,7 @@  int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t get_quirks)
 
 	xhci_dbg(xhci, "Resetting HCD\n");
 	/* Reset the internal HC memory state and registers. */
-	retval = xhci_reset(xhci);
+	retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
 	if (retval)
 		return retval;
 	xhci_dbg(xhci, "Reset complete\n");
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 8a0026ee9524..fce32f8ea9d0 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -229,6 +229,9 @@  struct xhci_op_regs {
 #define CMD_ETE		(1 << 14)
 /* bits 15:31 are reserved (and should be preserved on writes). */
 
+#define XHCI_RESET_LONG_USEC		(10 * 1000 * 1000)
+#define XHCI_RESET_SHORT_USEC		(250 * 1000)
+
 /* IMAN - Interrupt Management Register */
 #define IMAN_IE		(1 << 1)
 #define IMAN_IP		(1 << 0)
@@ -2081,11 +2084,11 @@  void xhci_free_container_ctx(struct xhci_hcd *xhci,
 
 /* xHCI host controller glue */
 typedef void (*xhci_get_quirks_t)(struct device *, struct xhci_hcd *);
-int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, int usec);
+int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, u64 timeout_us);
 void xhci_quiesce(struct xhci_hcd *xhci);
 int xhci_halt(struct xhci_hcd *xhci);
 int xhci_start(struct xhci_hcd *xhci);
-int xhci_reset(struct xhci_hcd *xhci);
+int xhci_reset(struct xhci_hcd *xhci, u64 timeout_us);
 int xhci_run(struct usb_hcd *hcd);
 int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t get_quirks);
 void xhci_shutdown(struct usb_hcd *hcd);