mbox series

[0/2] xhci: Fix the NEC stop bug workaround

Message ID 20241025121806.628e32c0@foxbook
Headers show
Series xhci: Fix the NEC stop bug workaround | expand

Message

Michał Pecio Oct. 25, 2024, 10:18 a.m. UTC
Hi,


This is the promised v2 of this bugfix. It took longer than expected
because I got sidetracked by two (related) issues:

1. looking for similar bugs in other chips
2. simplifying this to avoid adding the STOP_CMD_REDUNDANT flag

Changes in v2:

1. Dropped the warning patch, because dealing with other chips is a
   separate issue from fixing the bug in existing code.
2. Added CC:stable to make the patch bot happy.
3. Some comment updates/clarifications to address questions asked by
   reviewers. Comments made vendor-agnostic, no longer mention NEC in
   preparation for other buggy vendors.
4. Added an RFC patch to simplify things and avoid queuing redundant
   commands instead of trying to handle them. Still a little dodgy in
   one place, problem described in a C99 comment. But it works for me.

The simplification is a separate patch because that's how the code
evolved and because it could enable the more straightforward and lower
risk patch 1/2 to be used in stable without patch 2/2, if desired.

Or otherwise, I could squash the patches together, of course.


Regarding other chips, the following was found:
1. NEC discovered this bug and fixed it in a silicon or FW revision.
   Some chips have the bug, but I have one which doesn't.
2. I couldn't reproduce this bug on VIA VL805 and Etron EJ168A.
3. Two ASMedia chips tested, both have the bug. ASM3142 aka ASM2142
   is quite subtle, because it only seems to happen when multiple EPs
   are used at the same time. I suspect it's a matter of the command
   ring fetching commands asynchronously before we ring the command
   doorbell, or simply increased xHC load triggers some internal bug.

ASMedia presents an additional challange because it sometimes gets
stuck: Stop Endpoint fails in Stopped state even though our ep_state
says it should be running, and it never starts. I need to investigate
what exactly goes wrong and if our ep_state is bad or the chip.

This is dangerous, because the naive workaround would simply retry
the command forever. I suppose it may be a very good idea to add some
timeout. Say, if 100ms passes and the commands are still failing, just
assume that it is stopped for good and go ahead.


Regards,
Michal

Comments

Michał Pecio Oct. 28, 2024, 7:33 a.m. UTC | #1
Hi Mathias,

I would be grateful if you could take a look at patch 2/2 and tell if
there is anything obviously wrong with this approach. As far as I see,
it should be OK to just call those invalidation and giveback functions
directly from xhci_urb_dequeue(), and it works for me in practice.

Regarding the probem with xhci_invalidate_cancelled_tds() being called
while Set TR Dequeue is already pending, I found that it is much easier
to handle it by looking at SET_DEQ_PENDING and simply setting all TDs
to TD_CLEARING_CACHE_DEFERRED if that's the case. So this is solved.


However, these patches still don't solve the issue of infinite retries
completely. There is one more annoying case caused by halts. It is very
poorly defined what happens when a halted EP is hard-reset. Usually Set
TR Dequeue executes afterwards and it restarts the EP when done. But if
it doesn't, the EP stays stopped until a new URB is submitted, if ever.

There is the EP_HARD_CLEAR_TOGGLE flag which is set until the class
driver calls usb_clear_halt(), but it's neither the case that the EP is
guaranteed to be stopped until usb_clear_halt() is called nor that it
is guaranteed to restart afterwards.

Actually, I think it might be a bug? What if Set TR Dequeue restarts an
EP before the class driver clears the device side of the halt?


I'm starting to think that it may not be realistic to quickly solve all
those (and possibly other not known yet) problems now. Maybe just slap
a 100ms timeout on those retries, add quirks for ASMedia/Intel and call
it a day for now?

Regards,
Michal
Mathias Nyman Oct. 28, 2024, 9:54 a.m. UTC | #2
On 28.10.2024 9.33, Michal Pecio wrote:
> Hi Mathias,
> 
> I would be grateful if you could take a look at patch 2/2 and tell if
> there is anything obviously wrong with this approach. As far as I see,
> it should be OK to just call those invalidation and giveback functions
> directly from xhci_urb_dequeue(), and it works for me in practice.

Adding EP_HALTED case to xhci_urb_deqeue() should work fine, we
will both invalidate and give back the invalidated TDs in the completion
handler.

> 
> Regarding the probem with xhci_invalidate_cancelled_tds() being called
> while Set TR Dequeue is already pending, I found that it is much easier
> to handle it by looking at SET_DEQ_PENDING and simply setting all TDs
> to TD_CLEARING_CACHE_DEFERRED if that's the case. So this is solved.
>

The SET_DEQ_PENDING case is trickier. We would read the dequeue pointer
from hardware while we know hardware is processing a command to move the
dequeue pointer. Result may be old dequeue, or new dequeue,
possible unknown.

We are turning an already difficult issue even more complex

> 
> However, these patches still don't solve the issue of infinite retries
> completely. There is one more annoying case caused by halts. It is very
> poorly defined what happens when a halted EP is hard-reset. Usually Set
> TR Dequeue executes afterwards and it restarts the EP when done. But if
> it doesn't, the EP stays stopped until a new URB is submitted, if ever.
> 
> There is the EP_HARD_CLEAR_TOGGLE flag which is set until the class
> driver calls usb_clear_halt(), but it's neither the case that the EP is
> guaranteed to be stopped until usb_clear_halt() is called nor that it
> is guaranteed to restart afterwards.
> 
> Actually, I think it might be a bug? What if Set TR Dequeue restarts an
> EP before the class driver clears the device side of the halt?

Ok, I need to take some time to dig into this.

> 
> 
> I'm starting to think that it may not be realistic to quickly solve all
> those (and possibly other not known yet) problems now. Maybe just slap
> a 100ms timeout on those retries, add quirks for ASMedia/Intel and call
> it a day for now?

There are some mitigations that could be done.

As many of these issues are related to slow enpoint slow start causing
next stop endpoint command to complete with context error as endpoint is
still stopped.

We could ring the doorbell before giving back the invalidated tds in
xhci_handle_cmd_stop_ep(), and possibly xhci_handle_cmd_set_deq().
This gives hardware a bit more time to start the endpoint.

We could also track pending ring starts.
Set a "EP_START_PENDING flag when doorbell is rung on a stopped endpoint.
clear the flag when firt transfer event is received on that endpoint.

If a stop endpoint command fails with context error due to still being
stopped queue a new stop endpoint command, but only if flag was set:

xhci_handle_cmd_stop_ep()
	if (comp_code == COMP_CONTEXT_STATE_ERROR) {
		switch (GET_EP_CTX_STATE(ep_ctx))
		case EP_STATE_STOPPED:
			if (!(ep->ep_state & EP_START_PENDING)
				break;
			ep->ep_state &= ~EP_START_PENDING;
			xhci_queue_stop_endpoint();

Thanks
Mathias
Michał Pecio Oct. 29, 2024, 8:28 a.m. UTC | #3
On Mon, 28 Oct 2024 11:54:39 +0200, Mathias Nyman wrote:
> The SET_DEQ_PENDING case is trickier. We would read the dequeue
> pointer from hardware while we know hardware is processing a command
> to move the dequeue pointer. Result may be old dequeue, or new
> dequeue, possible unknown.

Damn, I looked at various things but I haven't thought about it. Yes,
it's dodgy and not really a great idea. Although I suspect it wouldn't
be *very* harmful, because the truly bad case (failing to queue a Set
TR Deq when it's necessary) is triggered by Set TR Deq already pending
on the same stream, so the stream's cache will be cleared anyway.

But it could easily queue a bunch of pointless commands, for example.

By the way, I think this race is already possible today, without my
patches. There is no testing for SET_DEQ_PENDING in xhci_urb_dequeue()
and no testing in handle_cmd_stop_ep(). If this happens in the middle
of a Set TR Deq chain on a streams endpoint, nothing seems to stop the
Stop EP handler from attempting invalidation under SET_DEQ_PENDING.

Maybe invalidate_cancelled_tds() should bail out if SET_DEQ_PENDING
and later Set Deq completion handler should unconditionally call the
invalidate/giveback combo before it exits.

> We could ring the doorbell before giving back the invalidated tds in
> xhci_handle_cmd_stop_ep(), and possibly xhci_handle_cmd_set_deq().
> This gives hardware a bit more time to start the endpoint.

This unfortunately doesn't buy much time, because giveback is a very
cheap operation - it just adds the URBs to a queue and wakes a worker
which runs all those callbacks. It finishes under 1us on my system.

> We could also track pending ring starts.
> Set a "EP_START_PENDING flag when doorbell is rung on a stopped
> endpoint. clear the flag when firt transfer event is received on that
> endpoint.

Yes, that was the second thing I tried. But I abandoned it:

Problem 1: URBs on "idle" devices are cancelled before generating
any event, so we need to clear the flag on Stop EP and Reset EP.
Not all of them use the default completion handler. Maybe it could
be handled reliably by tapping into handle_cmd_completion(). But...

Problem 2: hardware bugs. On ASMedia 3142, I can trigger (from
userspace) a condition when EP 0 doorbell is rung (even twice)
and its state is still Stopped a few seconds later, and remains
so after repeated Stop EP / doorbell rings.

It looks like a timeout is the only way to be really sure.


Regards,
Michal