Message ID | 20241025121806.628e32c0@foxbook |
---|---|
Headers | show |
Series | xhci: Fix the NEC stop bug workaround | expand |
Hi Mathias, I would be grateful if you could take a look at patch 2/2 and tell if there is anything obviously wrong with this approach. As far as I see, it should be OK to just call those invalidation and giveback functions directly from xhci_urb_dequeue(), and it works for me in practice. Regarding the probem with xhci_invalidate_cancelled_tds() being called while Set TR Dequeue is already pending, I found that it is much easier to handle it by looking at SET_DEQ_PENDING and simply setting all TDs to TD_CLEARING_CACHE_DEFERRED if that's the case. So this is solved. However, these patches still don't solve the issue of infinite retries completely. There is one more annoying case caused by halts. It is very poorly defined what happens when a halted EP is hard-reset. Usually Set TR Dequeue executes afterwards and it restarts the EP when done. But if it doesn't, the EP stays stopped until a new URB is submitted, if ever. There is the EP_HARD_CLEAR_TOGGLE flag which is set until the class driver calls usb_clear_halt(), but it's neither the case that the EP is guaranteed to be stopped until usb_clear_halt() is called nor that it is guaranteed to restart afterwards. Actually, I think it might be a bug? What if Set TR Dequeue restarts an EP before the class driver clears the device side of the halt? I'm starting to think that it may not be realistic to quickly solve all those (and possibly other not known yet) problems now. Maybe just slap a 100ms timeout on those retries, add quirks for ASMedia/Intel and call it a day for now? Regards, Michal
On 28.10.2024 9.33, Michal Pecio wrote: > Hi Mathias, > > I would be grateful if you could take a look at patch 2/2 and tell if > there is anything obviously wrong with this approach. As far as I see, > it should be OK to just call those invalidation and giveback functions > directly from xhci_urb_dequeue(), and it works for me in practice. Adding EP_HALTED case to xhci_urb_deqeue() should work fine, we will both invalidate and give back the invalidated TDs in the completion handler. > > Regarding the probem with xhci_invalidate_cancelled_tds() being called > while Set TR Dequeue is already pending, I found that it is much easier > to handle it by looking at SET_DEQ_PENDING and simply setting all TDs > to TD_CLEARING_CACHE_DEFERRED if that's the case. So this is solved. > The SET_DEQ_PENDING case is trickier. We would read the dequeue pointer from hardware while we know hardware is processing a command to move the dequeue pointer. Result may be old dequeue, or new dequeue, possible unknown. We are turning an already difficult issue even more complex > > However, these patches still don't solve the issue of infinite retries > completely. There is one more annoying case caused by halts. It is very > poorly defined what happens when a halted EP is hard-reset. Usually Set > TR Dequeue executes afterwards and it restarts the EP when done. But if > it doesn't, the EP stays stopped until a new URB is submitted, if ever. > > There is the EP_HARD_CLEAR_TOGGLE flag which is set until the class > driver calls usb_clear_halt(), but it's neither the case that the EP is > guaranteed to be stopped until usb_clear_halt() is called nor that it > is guaranteed to restart afterwards. > > Actually, I think it might be a bug? What if Set TR Dequeue restarts an > EP before the class driver clears the device side of the halt? Ok, I need to take some time to dig into this. > > > I'm starting to think that it may not be realistic to quickly solve all > those (and possibly other not known yet) problems now. Maybe just slap > a 100ms timeout on those retries, add quirks for ASMedia/Intel and call > it a day for now? There are some mitigations that could be done. As many of these issues are related to slow enpoint slow start causing next stop endpoint command to complete with context error as endpoint is still stopped. We could ring the doorbell before giving back the invalidated tds in xhci_handle_cmd_stop_ep(), and possibly xhci_handle_cmd_set_deq(). This gives hardware a bit more time to start the endpoint. We could also track pending ring starts. Set a "EP_START_PENDING flag when doorbell is rung on a stopped endpoint. clear the flag when firt transfer event is received on that endpoint. If a stop endpoint command fails with context error due to still being stopped queue a new stop endpoint command, but only if flag was set: xhci_handle_cmd_stop_ep() if (comp_code == COMP_CONTEXT_STATE_ERROR) { switch (GET_EP_CTX_STATE(ep_ctx)) case EP_STATE_STOPPED: if (!(ep->ep_state & EP_START_PENDING) break; ep->ep_state &= ~EP_START_PENDING; xhci_queue_stop_endpoint(); Thanks Mathias
On Mon, 28 Oct 2024 11:54:39 +0200, Mathias Nyman wrote: > The SET_DEQ_PENDING case is trickier. We would read the dequeue > pointer from hardware while we know hardware is processing a command > to move the dequeue pointer. Result may be old dequeue, or new > dequeue, possible unknown. Damn, I looked at various things but I haven't thought about it. Yes, it's dodgy and not really a great idea. Although I suspect it wouldn't be *very* harmful, because the truly bad case (failing to queue a Set TR Deq when it's necessary) is triggered by Set TR Deq already pending on the same stream, so the stream's cache will be cleared anyway. But it could easily queue a bunch of pointless commands, for example. By the way, I think this race is already possible today, without my patches. There is no testing for SET_DEQ_PENDING in xhci_urb_dequeue() and no testing in handle_cmd_stop_ep(). If this happens in the middle of a Set TR Deq chain on a streams endpoint, nothing seems to stop the Stop EP handler from attempting invalidation under SET_DEQ_PENDING. Maybe invalidate_cancelled_tds() should bail out if SET_DEQ_PENDING and later Set Deq completion handler should unconditionally call the invalidate/giveback combo before it exits. > We could ring the doorbell before giving back the invalidated tds in > xhci_handle_cmd_stop_ep(), and possibly xhci_handle_cmd_set_deq(). > This gives hardware a bit more time to start the endpoint. This unfortunately doesn't buy much time, because giveback is a very cheap operation - it just adds the URBs to a queue and wakes a worker which runs all those callbacks. It finishes under 1us on my system. > We could also track pending ring starts. > Set a "EP_START_PENDING flag when doorbell is rung on a stopped > endpoint. clear the flag when firt transfer event is received on that > endpoint. Yes, that was the second thing I tried. But I abandoned it: Problem 1: URBs on "idle" devices are cancelled before generating any event, so we need to clear the flag on Stop EP and Reset EP. Not all of them use the default completion handler. Maybe it could be handled reliably by tapping into handle_cmd_completion(). But... Problem 2: hardware bugs. On ASMedia 3142, I can trigger (from userspace) a condition when EP 0 doorbell is rung (even twice) and its state is still Stopped a few seconds later, and remains so after repeated Stop EP / doorbell rings. It looks like a timeout is the only way to be really sure. Regards, Michal
On 29.10.2024 10.28, Michał Pecio wrote: > > By the way, I think this race is already possible today, without my > patches. There is no testing for SET_DEQ_PENDING in xhci_urb_dequeue() > and no testing in handle_cmd_stop_ep(). If this happens in the middle > of a Set TR Deq chain on a streams endpoint, nothing seems to stop the > Stop EP handler from attempting invalidation under SET_DEQ_PENDING. > > Maybe invalidate_cancelled_tds() should bail out if SET_DEQ_PENDING > and later Set Deq completion handler should unconditionally call the > invalidate/giveback combo before it exits. > I think you are on to something. If we add invalidate/givaback to Set TR deq completion handler, allowing it to possible queue new Set TR Deq commands, then we can bail out in xhci_urb_dequeue() if SET_DEQ_PENDING is set. xhci_urb_dequeue() would not queue a extra stop endpoint command, only set td->cancel_status to TD_DIRTY dirty, and Set TR Deq handler would not ring the doorbell unnecessary. Sounds like a plan to ne. >> We could ring the doorbell before giving back the invalidated tds in >> xhci_handle_cmd_stop_ep(), and possibly xhci_handle_cmd_set_deq(). >> This gives hardware a bit more time to start the endpoint. > > This unfortunately doesn't buy much time, because giveback is a very > cheap operation - it just adds the URBs to a queue and wakes a worker > which runs all those callbacks. It finishes under 1us on my system. Probably true, but change is simple and free so might be worth it. Thanks Mathias
On 31.10.2024 13.17, Michał Pecio wrote: > Update: > >> Your patch prints one dev_dbg() each time, mine spams many of them for >> 100ms each time. I will remove this one retry limit from your patch to >> see if starts spinning infinitely, but I strongly suspect it will. > > Yes, that's exactly what happens. > > This time I have killed the ifconfig loop, unplugged the NIC and > started 'rmmod xhci_pci', which is still hanging 10 minutes later. > > So business as usual when these things go wrong. > >> One retry is not enough. This is what I got on the first try with a >> random UVC webcam: >> [...] Ok, good to know, then using flag is not enough. Using a retry timeout for failed stop endpoint commands also sounds good to me. It has a slight downside of a possible 100ms aggressive 'Stop Endpoint' retry loop in cases where endpoint was stopped earlier for some other reason. Not sure if that is a problem, if it is then we can add the flag and only retry for 100ms if flag is set (only clear flag in handle_tx_event()) Another reason for the flag is the additional note in xhci 4.8.3 [1], we might need to track the state better in software. [1] xhci 4.8.3 Endpoint Context state "There are several cases where the EP State field in the Output Endpoint Context may not reflect the current state of an endpoint. The xHC should attempt to keep EP State as current as possible, however it may defer these updates to perform higher priority references to memory, e.g. Isoch data transfers, etc. Software should maintain an internal variable that tracks the state of an endpoint and not depend on EP State to represent the instantaneous state of an endpoint. For example, when a Command that affects EP State is issued, the value of EP State may be updated anytime between when software rings the Command Ring doorbell for a command and when the associated Command Completion Event is placed on the Event Ring by the xHC. The update of EP State may also be delayed relative to a Doorbell ring or error condition (e.g. TRB Error, STALL, or USB Transaction Error) that causes an EP State change not generated by a command. Software should maintain an accurate value for EP State, by tracking it with an internal variable that is driven by Events and Doorbell accesses associated with an endpoint using the following method: • When a command is issued to an endpoint that affects its state, software should use the Command Completion Event to update its image of EP State to the appropriate state. • When a Transfer Event reports a TRB Error, software should update its image of EP State to Error. • When a Transfer Event reports a Stall Error or USB Transaction Error, software should update its image of EP State to Halted. • When software rings the Doorbell of an endpoint to transition it from the Stopped to Running state, it should update its image of EP State to Running." Thanks -Mathias