diff mbox series

[1/2] usb: xhci: Fix the NEC stop bug workaround

Message ID 20241014211005.07562933@foxbook
State New
Headers show
Series Fix the NEC stop bug workaround | expand

Commit Message

MichaƂ Pecio Oct. 14, 2024, 7:10 p.m. UTC
The NEC uPD720200 has a bug, which prevents reliably stopping
an endpoint shortly after it has been restarted. This usually
happens when a driver kills many URBs in quick succession and
it results in concurrent execution and cancellation of TDs.

This is handled by stopping the endpoint again if in doubt.

This "doubt" turns out to be a problem, because Stop Endpoint
may be queued when the EP is already Stopped (for Set TR Deq
execution, for example) or becomes Stopped concurrently (by
Reset Endpoint, for example). If the EP is truly Stopped, the
command fails and further retries just keep failing forever.

This is easily triggered by modifying uvcvideo to unlink its
isochronous URBs in 100us intervals instead of poisoning them.
Any driver that unlinks URBs asynchronously may trigger this,
and any URB unlink during ongoing halt recovery also can.

Fix the problem by tracking redundant Stop Endpoint commands
which are sure to fail, and by not retrying them. It's easy,
because xhci_urb_dequeue() is the only user ever queuing the
command with the default handler and without ensuring that
the endpoint is Running and will not Halt before it Stops.
For this case, we assume that an endpoint with pending URBs
is always Running, unless certain operations are pending on
it which indicate known exceptions.

Note that we need to catch those exceptions when they occur,
because their flags may be cleared before our handler runs.

It's possible that other HCs have similar bugs (see also the
related "Running" case below), but the workaround is limited
to NEC because no such chips are currently known and tested.

Fixes: fd9d55d190c0 ("xhci: retry Stop Endpoint on buggy NEC controllers")
Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci-ring.c | 44 +++++++++++++++++++++++++++++++++---
 drivers/usb/host/xhci.h      |  2 ++
 2 files changed, 43 insertions(+), 3 deletions(-)

Comments

Mathias Nyman Oct. 15, 2024, 11:05 a.m. UTC | #1
On 14.10.2024 22.10, Michal Pecio wrote:
> The NEC uPD720200 has a bug, which prevents reliably stopping
> an endpoint shortly after it has been restarted. This usually
> happens when a driver kills many URBs in quick succession and
> it results in concurrent execution and cancellation of TDs.
> 
> This is handled by stopping the endpoint again if in doubt.
> 
> This "doubt" turns out to be a problem, because Stop Endpoint
> may be queued when the EP is already Stopped (for Set TR Deq
> execution, for example) or becomes Stopped concurrently (by
> Reset Endpoint, for example). If the EP is truly Stopped, the
> command fails and further retries just keep failing forever.
> 
> This is easily triggered by modifying uvcvideo to unlink its
> isochronous URBs in 100us intervals instead of poisoning them.
> Any driver that unlinks URBs asynchronously may trigger this,
> and any URB unlink during ongoing halt recovery also can.
> 
> Fix the problem by tracking redundant Stop Endpoint commands
> which are sure to fail, and by not retrying them. It's easy,
> because xhci_urb_dequeue() is the only user ever queuing the
> command with the default handler and without ensuring that
> the endpoint is Running and will not Halt before it Stops.
> For this case, we assume that an endpoint with pending URBs
> is always Running, unless certain operations are pending on
> it which indicate known exceptions.
> 
> Note that we need to catch those exceptions when they occur,
> because their flags may be cleared before our handler runs.
> 
> It's possible that other HCs have similar bugs (see also the
> related "Running" case below), but the workaround is limited
> to NEC because no such chips are currently known and tested.
> 
> Fixes: fd9d55d190c0 ("xhci: retry Stop Endpoint on buggy NEC controllers")
> Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
> ---
>   drivers/usb/host/xhci-ring.c | 44 +++++++++++++++++++++++++++++++++---
>   drivers/usb/host/xhci.h      |  2 ++
>   2 files changed, 43 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index 4d664ba53fe9..c0efb4d34ab9 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -911,6 +911,21 @@ static int xhci_reset_halted_ep(struct xhci_hcd *xhci, unsigned int slot_id,
>   	return ret;
>   }
>   
> +/*
> + * A Stop Endpoint command is redundant if the EP is not in the Running state.
> + * It will fail with Context State Error. We sometimes queue redundant Stop EP
> + * commands when the EP is held Stopped for Set TR Deq execution, or Halted.
> + * A pending Stop Endpoint command *becomes* redundant if the EP halts before
> + * its completion, and this flag needs to be updated in those cases too.
> + */
> +static void xhci_update_stop_cmd_redundant(struct xhci_virt_ep *ep)
> +{
> +	if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT))
> +		ep->ep_state |= EP_STOP_CMD_REDUNDANT;
> +	else
> +		ep->ep_state &= ~EP_STOP_CMD_REDUNDANT;
> +}
> +
>   static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
>   				struct xhci_virt_ep *ep,
>   				struct xhci_td *td,
> @@ -946,6 +961,7 @@ static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
>   		return err;
>   
>   	ep->ep_state |= EP_HALTED;
> +	xhci_update_stop_cmd_redundant(ep);
>   
>   	xhci_ring_cmd_db(xhci);
>   
> @@ -1149,15 +1165,31 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
>   				break;
>   			ep->ep_state &= ~EP_STOP_CMD_PENDING;
>   			return;
> +
>   		case EP_STATE_STOPPED:
>   			/*
> -			 * NEC uPD720200 sometimes sets this state and fails with
> -			 * Context Error while continuing to process TRBs.
> -			 * Be conservative and trust EP_CTX_STATE on other chips.
> +			 * Per xHCI 4.6.9, Stop Endpoint command on a Stopped
> +			 * EP is a Context State Error, and EP stays Stopped.
> +			 * The EP could be stopped by some concurrent job, so
> +			 * ignore this error when that's the case.
> +			 */
> +			if (ep->ep_state & EP_STOP_CMD_REDUNDANT)
> +				break;

Can we skip the new flag and just check for the correct flags here directly?

if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT)
	break;

Thanks
Mathias
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 4d664ba53fe9..c0efb4d34ab9 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -911,6 +911,21 @@  static int xhci_reset_halted_ep(struct xhci_hcd *xhci, unsigned int slot_id,
 	return ret;
 }
 
+/*
+ * A Stop Endpoint command is redundant if the EP is not in the Running state.
+ * It will fail with Context State Error. We sometimes queue redundant Stop EP
+ * commands when the EP is held Stopped for Set TR Deq execution, or Halted.
+ * A pending Stop Endpoint command *becomes* redundant if the EP halts before
+ * its completion, and this flag needs to be updated in those cases too.
+ */
+static void xhci_update_stop_cmd_redundant(struct xhci_virt_ep *ep)
+{
+	if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT))
+		ep->ep_state |= EP_STOP_CMD_REDUNDANT;
+	else
+		ep->ep_state &= ~EP_STOP_CMD_REDUNDANT;
+}
+
 static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
 				struct xhci_virt_ep *ep,
 				struct xhci_td *td,
@@ -946,6 +961,7 @@  static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
 		return err;
 
 	ep->ep_state |= EP_HALTED;
+	xhci_update_stop_cmd_redundant(ep);
 
 	xhci_ring_cmd_db(xhci);
 
@@ -1149,15 +1165,31 @@  static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 				break;
 			ep->ep_state &= ~EP_STOP_CMD_PENDING;
 			return;
+
 		case EP_STATE_STOPPED:
 			/*
-			 * NEC uPD720200 sometimes sets this state and fails with
-			 * Context Error while continuing to process TRBs.
-			 * Be conservative and trust EP_CTX_STATE on other chips.
+			 * Per xHCI 4.6.9, Stop Endpoint command on a Stopped
+			 * EP is a Context State Error, and EP stays Stopped.
+			 * The EP could be stopped by some concurrent job, so
+			 * ignore this error when that's the case.
+			 */
+			if (ep->ep_state & EP_STOP_CMD_REDUNDANT)
+				break;
+			/*
+			 * NEC uPD720200 controllers have this bug: it takes one
+			 * service interval to restart a stopped EP, and in this
+			 * time its state remains Stopped. Any new Stop Endpoint
+			 * command fails and this handler may run quickly enough
+			 * to still observe the Stopped state, but it's bound to
+			 * soon change to Running, and all hell breaks loose.
+			 *
+			 * So keep retrying until the command clearly succeeds.
+			 * Not clear what to do if other HCs have similar bugs.
 			 */
 			if (!(xhci->quirks & XHCI_NEC_HOST))
 				break;
 			fallthrough;
+
 		case EP_STATE_RUNNING:
 			/* Race, HW handled stop ep cmd before ep was running */
 			xhci_dbg(xhci, "Stop ep completion ctx error, ep is running\n");
@@ -4383,11 +4415,17 @@  int xhci_queue_evaluate_context(struct xhci_hcd *xhci, struct xhci_command *cmd,
 int xhci_queue_stop_endpoint(struct xhci_hcd *xhci, struct xhci_command *cmd,
 			     int slot_id, unsigned int ep_index, int suspend)
 {
+	struct xhci_virt_ep *ep;
 	u32 trb_slot_id = SLOT_ID_FOR_TRB(slot_id);
 	u32 trb_ep_index = EP_INDEX_FOR_TRB(ep_index);
 	u32 type = TRB_TYPE(TRB_STOP_RING);
 	u32 trb_suspend = SUSPEND_PORT_FOR_TRB(suspend);
 
+	ep = xhci_get_virt_ep(xhci, slot_id, ep_index);
+	if (ep)
+		xhci_update_stop_cmd_redundant(ep);
+	/* else: don't care, the handler will do nothing anyway */
+
 	return queue_command(xhci, cmd, 0, 0, 0,
 			trb_slot_id | trb_ep_index | type | trb_suspend, false);
 }
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 620502de971a..dd803b016c48 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -670,6 +670,8 @@  struct xhci_virt_ep {
 #define EP_SOFT_CLEAR_TOGGLE	(1 << 7)
 /* usb_hub_clear_tt_buffer is in progress */
 #define EP_CLEARING_TT		(1 << 8)
+/* the EP is kept Stopped or recovering from Halt/Error */
+#define EP_STOP_CMD_REDUNDANT	(1 << 9)
 	/* ----  Related to URB cancellation ---- */
 	struct list_head	cancelled_td_list;
 	struct xhci_hcd		*xhci;