[RFC] xhci: Add Link TRB sync quirk for ASM3142

Message ID	20210416093729.41865-1-ole@salscheider.org
State	New
Headers	show Return-Path: <linux-usb-owner@kernel.org> From: Ole Salscheider <ole@salscheider.org> To: linux-usb@vger.kernel.org Cc: Mathias Nyman <mathias.nyman@intel.com>, Ole Salscheider <ole@salscheider.org> Subject: [PATCH] [RFC] xhci: Add Link TRB sync quirk for ASM3142 Date: Fri, 16 Apr 2021 11:37:29 +0200 Message-Id: <20210416093729.41865-1-ole@salscheider.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[RFC] xhci: Add Link TRB sync quirk for ASM3142 \| expand [RFC] xhci: Add Link TRB sync quirk for ASM3142

Ole Salscheider April 16, 2021, 9:37 a.m. UTC

This patch adds a quirk to the xhci driver so that link TRBs are only
given to the host controller once it has processed all previous TRBs on
this segment.

This quirk is necessary for me on an ASMedia ASM3142 host controller.
Without it, I get the following errors when accessing a SuperSpeed UVC
camera:

Transfer event TRB DMA ptr not part of current TD ep_index XX comp_code XX

You can find more details in my previous mail about the problem:
https://lkml.org/lkml/2021/3/31/355

This patch fixes my problem, but it is probably terribly wrong. I am not
even sure if I can rely on handle_tx_event being called before each link
TRB in the segment. Some feedback would be very welcome.
---
 drivers/usb/host/xhci-pci.c  |  4 +++-
 drivers/usb/host/xhci-ring.c | 34 ++++++++++++++++++++++++++++++----
 drivers/usb/host/xhci.h      |  1 +
 3 files changed, 34 insertions(+), 5 deletions(-)

Ole Salscheider April 19, 2021, 10:52 a.m. UTC | #1

Hi Mathias,

On 16.04.21 17:18, Ole Salscheider wrote:
> Hi Mathias.

> 

> On 16.04.21 14:09, Mathias Nyman wrote:

>> Hi Ole

>>

>> On 16.4.2021 12.37, Ole Salscheider wrote:

>>> This patch adds a quirk to the xhci driver so that link TRBs are only

>>> given to the host controller once it has processed all previous TRBs on

>>> this segment.

>>>

>>> This quirk is necessary for me on an ASMedia ASM3142 host controller.

>>> Without it, I get the following errors when accessing a SuperSpeed UVC

>>> camera:

>>>

>>> Transfer event TRB DMA ptr not part of current TD ep_index XX 

> comp_code XX

>>>

>>> You can find more details in my previous mail about the problem:

>>> https://lkml.org/lkml/2021/3/31/355

>>>

>>> This patch fixes my problem, but it is probably terribly wrong. I am not

>>> even sure if I can rely on handle_tx_event being called before each link

>>> TRB in the segment. Some feedback would be very welcome.

>>

>> I think we need to look at the cause more closely.

>>

>> We normally only get events for the last TRB of a TD, or for short 

> transfers like in your case.

>> So not every transfer TRB generates events.

>>

>> There are several things going on here that combined could cause this.

>>

>> Last transfer TRB of a segment has some alignment requirements which 

> might not be handled in the isoc case.

>> The amount of untransferred data is large, (16388 bytes) so the TRB 

> causing the short packet

>> could be far from the last TRB we expect the event on.

>> Due to new segment and link trb maybe the stored last_trb for this TD 

> is just set wrong

>>

>> Anyway, more detailed traces together with dynamic debug show us more:

>>

>> mount -t debugfs none /sys/kernel/debug

>> echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control

>> echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control

>> echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb

>> echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable

>> < trigger the issue >

>> Send output of dmesg

>> Send content of /sys/kernel/debug/tracing/trace

>>

>> trace accumulates pretty fast so try to copy it as soon as the issue 

> is seen.

> 

> I have uploaded the dmesg output to

> https://stuff.salscheider.org/dmesg_out

> and the trace to

> https://stuff.salscheider.org/trace_out

> 

> With the trace enabled, I did not get the DMA errors. Maybe it slowed 

> down the computer enough. But still the camera stream froze after ~3 

> frames.

> 

> The log contains a recording with ffmpeg that gave a few frames (at 

> second 83), then some where it hung completely. Then I re-plugged the 

> camera and got a few more frames (at second 179) before it hung again.

> 

> I hope this is helpful. If you need a different log please tell me.


I tried it a second time today and now I got the DMA error also when 
tracing was active. You can find the dmesg and trace outputs here:

https://stuff.salscheider.org/dmesg_out2
https://stuff.salscheider.org/trace_out2

Best regards,

Ole

Mathias Nyman April 19, 2021, 2:43 p.m. UTC | #2

Hi

On 19.4.2021 13.52, Ole Salscheider wrote:
> 

> 

> I tried it a second time today and now I got the DMA error also when tracing was active. You can find the dmesg and trace outputs here:

> 

> https://stuff.salscheider.org/dmesg_out2

> https://stuff.salscheider.org/trace_out2

> 


Thanks. 
It really looks like we just don't get an event for the last transfers TRB on the segment.
In this case bulk transfers are used, one TD per URB, with one TRB in only per TD.

From the trace:
Queue the second last transfer TRB on this segment:
  365.159991: xhci_urb_enqueue: ep3in-bulk: urb 0000000029404272 pipe 3225518720 slot 1 length 0/32768 sgs 0/0 stream 0 flags 00000204
  365.159991: xhci_queue_trb: BULK: Buffer 00000000fe388000 length 32768 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:c
  365.159991: xhci_inc_enq: BULK 00000000ff812306: enq 0x00000000ffe77fe0(0x00000000ffe77000) deq 0x00000000ffe77f90(0x00000000ffe77000) segs 2 stream 0 free_trbs 504 bounce 102

Queue the last transfer TRB on this segment at ..77fe0. (note xhci_inc_enq show enq value _after_ it increased it)
  365.160043: xhci_urb_enqueue: ep3in-bulk: urb 00000000518e7c2f pipe 3225518720 slot 1 length 0/32768 sgs 0/0 stream 0 flags 00000204
  365.160044: xhci_queue_trb: BULK: Buffer 00000000fe380000 length 32768 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:c
  365.160044: xhci_inc_enq: BULK 00000000ff812306: enq 0x00000000ffe77ff0(0x00000000ffe77000) deq 0x00000000ffe77fa0(0x00000000ffe77000) segs 2 stream 0 free_trbs 504 bounce 1024 cycle 1

(omitted get events for TRBs at ..77fa0, ..77fb0, ..77fc0)
(omitted queue TRBs at next segment at ..76000, 76010, ..76020)

Event for completed TRB at 77fd0 (second last Transer on segment)
  365.160815: xhci_handle_event: EVENT: TRB 00000000ffe77fd0 status 'Short Packet' len 16388 slot 1 ep 7 type 'Transfer Event' flags e:c
  365.160815: xhci_handle_transfer: BULK: Buffer 00000000fe388000 length 32768 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C
  365.160817: xhci_inc_deq: BULK 00000000ff812306: enq 0x00000000ffe76030(0x00000000ffe76000) deq 0x00000000ffe77fe0(0x00000000ffe77000) segs 2 stream 0 free_trbs 505 bounce 1024 cycle 1

Queue Transfer TRB at 76030
  365.160831: xhci_urb_enqueue: ep3in-bulk: urb 0000000029404272 pipe 3225518720 slot 1 length 0/32768 sgs 0/0 stream 0 flags 00000204
  365.160831: xhci_queue_trb: BULK: Buffer 00000000fe388000 length 32768 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:c
  365.160831: xhci_inc_enq: BULK 00000000ff812306: enq 0x00000000ffe76040(0x00000000ffe76000) deq 0x00000000ffe77fe0(0x00000000ffe77000) segs 2 stream 0 free_trbs 504 bounce 1024 cycle 1

Event for TRB at ..76000, the first TRB on the next segment.
Note that we didn't get an event for last transfer TRB at ..77fe0,
  365.161182: xhci_handle_event: EVENT: TRB 00000000ffe76000 status 'Success' len 0 slot 1 ep 7 type 'Transfer Event' flags e:c
  365.161183: xhci_inc_deq: EVENT 00000000837a99ac: enq 0x00000000ffeec000(0x00000000ffeec000) deq 0x00000000ffeec6e0(0x00000000ffeec000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 0
  365.161228: xhci_handle_event: EVENT: TRB 00000000ffe76010 status 'Short Packet' len 16388 slot 1 ep 7 type 'Transfer Event' flags e:c
  365.161236: xhci_inc_deq: EVENT 00000000837a99ac: enq 0x00000000ffeec000(0x00000000ffeec000) deq 0x00000000ffeec6f0(0x00000000ffeec000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 0
  365.161567: xhci_handle_event: EVENT: TRB 00000000ffe76020 status 'Success' len 0 slot 1 ep 7 type 'Transfer Event' flags e:c
  365.161573: xhci_inc_deq: EVENT 00000000837a99ac: enq 0x00000000ffeec000(0x00000000ffeec000) deq 0x00000000ffeec700(0x00000000ffeec000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 0
  365.161618: xhci_handle_event: EVENT: TRB 00000000ffe76030 status 'Short Packet' len 16388 slot 1 ep 7 type 'Transfer Event' flags e:c
  365.161623: xhci_inc_deq: EVENT 00000000837a99ac: enq 0x00000000ffeec000(0x00000000ffeec000) deq 0x00000000ffeec710(0x00000000ffeec000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 0

Five seconds later driver start canceling TRBs due to timeout

In some cases the Link TRB might be given to HW (cycle bit written) before the last transfer TRB as
inc_enq() is called before giveback_first_trb().
This shouldn't be the case here as we only have one TRB per URB (no chain or more_trbs_coming set).
Maybe still worth double checking this.
We have memory barriers after writing the TRB, but not after the later cycle bit write.
maybe still try if a wmb() after writing the cycle bit in giveback_first_trb() helps.

One interesting thing is that every second transfer completes short, and every other as success (full).
When we miss one TRB I'd expect the order to get messed up, but it doesn't.
It's as if hw doesn't consider that last transfer TRB valid, and just skips it.

tracing shows the content of the TRB while it's written, before writing the cycle bit.
Could be worth dumping transfer ring after issue is seen and check TRB still looks valid:
cat /sys/kernel/debug/usb/xhci/<your controller>/devices/<xx>/ep<yy>/trbs
you need to figure out which controller, device and endpoint number for this.

How about your delayed link TRB cycle write patch? did it help trigger an event for the last transfer 
TRB, or did it just help as controller couldn't go past the link TRB and was less out of sync with
the driver?

Also trying latest upstream kernel instead of 5.8 could help.

Thanks
-Mathias

Ole Salscheider April 19, 2021, 7:05 p.m. UTC | #3

Hi Mathias,

I'm sorry for messing the reply-to-all up... Here it comes again:

On 19.04.21 16:43, Mathias Nyman wrote:
> Hi

> 

> On 19.4.2021 13.52, Ole Salscheider wrote:

>>

>>

>> I tried it a second time today and now I got the DMA error also when tracing was active. You can find the dmesg and trace outputs here:

>>

>> https://stuff.salscheider.org/dmesg_out2

>> https://stuff.salscheider.org/trace_out2

>>

> 

> Thanks.

> It really looks like we just don't get an event for the last transfers TRB on the segment.

> In this case bulk transfers are used, one TD per URB, with one TRB in only per TD.

> 

>  From the trace:

> Queue the second last transfer TRB on this segment:

>    365.159991: xhci_urb_enqueue: ep3in-bulk: urb 0000000029404272 pipe 3225518720 slot 1 length 0/32768 sgs 0/0 stream 0 flags 00000204

>    365.159991: xhci_queue_trb: BULK: Buffer 00000000fe388000 length 32768 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:c

>    365.159991: xhci_inc_enq: BULK 00000000ff812306: enq 0x00000000ffe77fe0(0x00000000ffe77000) deq 0x00000000ffe77f90(0x00000000ffe77000) segs 2 stream 0 free_trbs 504 bounce 102

> 

> Queue the last transfer TRB on this segment at ..77fe0. (note xhci_inc_enq show enq value _after_ it increased it)

>    365.160043: xhci_urb_enqueue: ep3in-bulk: urb 00000000518e7c2f pipe 3225518720 slot 1 length 0/32768 sgs 0/0 stream 0 flags 00000204

>    365.160044: xhci_queue_trb: BULK: Buffer 00000000fe380000 length 32768 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:c

>    365.160044: xhci_inc_enq: BULK 00000000ff812306: enq 0x00000000ffe77ff0(0x00000000ffe77000) deq 0x00000000ffe77fa0(0x00000000ffe77000) segs 2 stream 0 free_trbs 504 bounce 1024 cycle 1

> 

> (omitted get events for TRBs at ..77fa0, ..77fb0, ..77fc0)

> (omitted queue TRBs at next segment at ..76000, 76010, ..76020)

> 

> Event for completed TRB at 77fd0 (second last Transer on segment)

>    365.160815: xhci_handle_event: EVENT: TRB 00000000ffe77fd0 status 'Short Packet' len 16388 slot 1 ep 7 type 'Transfer Event' flags e:c

>    365.160815: xhci_handle_transfer: BULK: Buffer 00000000fe388000 length 32768 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:C

>    365.160817: xhci_inc_deq: BULK 00000000ff812306: enq 0x00000000ffe76030(0x00000000ffe76000) deq 0x00000000ffe77fe0(0x00000000ffe77000) segs 2 stream 0 free_trbs 505 bounce 1024 cycle 1

> 

> Queue Transfer TRB at 76030

>    365.160831: xhci_urb_enqueue: ep3in-bulk: urb 0000000029404272 pipe 3225518720 slot 1 length 0/32768 sgs 0/0 stream 0 flags 00000204

>    365.160831: xhci_queue_trb: BULK: Buffer 00000000fe388000 length 32768 TD size 0 intr 0 type 'Normal' flags b:i:I:c:s:I:e:c

>    365.160831: xhci_inc_enq: BULK 00000000ff812306: enq 0x00000000ffe76040(0x00000000ffe76000) deq 0x00000000ffe77fe0(0x00000000ffe77000) segs 2 stream 0 free_trbs 504 bounce 1024 cycle 1

> 

> Event for TRB at ..76000, the first TRB on the next segment.

> Note that we didn't get an event for last transfer TRB at ..77fe0,

>    365.161182: xhci_handle_event: EVENT: TRB 00000000ffe76000 status 'Success' len 0 slot 1 ep 7 type 'Transfer Event' flags e:c

>    365.161183: xhci_inc_deq: EVENT 00000000837a99ac: enq 0x00000000ffeec000(0x00000000ffeec000) deq 0x00000000ffeec6e0(0x00000000ffeec000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 0

>    365.161228: xhci_handle_event: EVENT: TRB 00000000ffe76010 status 'Short Packet' len 16388 slot 1 ep 7 type 'Transfer Event' flags e:c

>    365.161236: xhci_inc_deq: EVENT 00000000837a99ac: enq 0x00000000ffeec000(0x00000000ffeec000) deq 0x00000000ffeec6f0(0x00000000ffeec000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 0

>    365.161567: xhci_handle_event: EVENT: TRB 00000000ffe76020 status 'Success' len 0 slot 1 ep 7 type 'Transfer Event' flags e:c

>    365.161573: xhci_inc_deq: EVENT 00000000837a99ac: enq 0x00000000ffeec000(0x00000000ffeec000) deq 0x00000000ffeec700(0x00000000ffeec000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 0

>    365.161618: xhci_handle_event: EVENT: TRB 00000000ffe76030 status 'Short Packet' len 16388 slot 1 ep 7 type 'Transfer Event' flags e:c

>    365.161623: xhci_inc_deq: EVENT 00000000837a99ac: enq 0x00000000ffeec000(0x00000000ffeec000) deq 0x00000000ffeec710(0x00000000ffeec000) segs 1 stream 0 free_trbs 254 bounce 0 cycle 0

> 

> Five seconds later driver start canceling TRBs due to timeout

> 

> In some cases the Link TRB might be given to HW (cycle bit written) before the last transfer TRB as

> inc_enq() is called before giveback_first_trb().

> This shouldn't be the case here as we only have one TRB per URB (no chain or more_trbs_coming set).

> Maybe still worth double checking this.

> We have memory barriers after writing the TRB, but not after the later cycle bit write.

> maybe still try if a wmb() after writing the cycle bit in giveback_first_trb() helps.


I added the wmb() but it did not change anything.
I added debug output to inc_enq(), but the Link TRB is never given to 
the HW there, only in giveback_first_trb().

> One interesting thing is that every second transfer completes short, and every other as success (full).

> When we miss one TRB I'd expect the order to get messed up, but it doesn't.

> It's as if hw doesn't consider that last transfer TRB valid, and just skips it.

> 

> tracing shows the content of the TRB while it's written, before writing the cycle bit.

> Could be worth dumping transfer ring after issue is seen and check TRB still looks valid:

> cat /sys/kernel/debug/usb/xhci/<your controller>/devices/<xx>/ep<yy>/trbs

> you need to figure out which controller, device and endpoint number for this.


I think this is the output for the correct device and endpoint:

https://stuff.salscheider.org/trbs

> How about your delayed link TRB cycle write patch? did it help trigger an event for the last transfer

> TRB, or did it just help as controller couldn't go past the link TRB and was less out of sync with

> the driver?



You can find a trace of recording a few seconds of video with my patch here:

https://stuff.salscheider.org/trace_with_patch

As far as I can tell from it, the patch helps to trigger an event for 
all transfer TRBs. I also tried to record around 30 min of video with it 
and it did not encounter a problem while recording.

I however cannot start a new recording after stopping the first. And 
sometimes I observe hangs of the computer some time after stopping the 
recording. That might be some unwanted side-effect of my patch...

> Also trying latest upstream kernel instead of 5.8 could help.


My patch is against the master branch and I see the exact same problem 
there.

Best regards,
Ole

Mathias Nyman April 21, 2021, 7:28 a.m. UTC | #4

On 19.4.2021 22.05, Ole Salscheider wrote:
> Hi Mathias,

...

>> tracing shows the content of the TRB while it's written, before writing the cycle bit.

>> Could be worth dumping transfer ring after issue is seen and check TRB still looks valid:

>> cat /sys/kernel/debug/usb/xhci/<your controller>/devices/<xx>/ep<yy>/trbs

>> you need to figure out which controller, device and endpoint number for this.

> 

> I think this is the output for the correct device and endpoint:

> 

> https://stuff.salscheider.org/trbs


Thanks, the TRB was already turned No-op by the cancel code by then, but it doesn't matter,
your trace below indicates the TRB is fine.

> 

>> How about your delayed link TRB cycle write patch? did it help trigger an event for the last transfer

>> TRB, or did it just help as controller couldn't go past the link TRB and was less out of sync with

>> the driver?

> 

> 

> You can find a trace of recording a few seconds of video with my patch here:

> 

> https://stuff.salscheider.org/trace_with_patch

> 

> As far as I can tell from it, the patch helps to trigger an event for all transfer TRBs. I also tried to record around 30 min of video with it and it did not encounter a problem while recording.

> 

> I however cannot start a new recording after stopping the first. And sometimes I observe hangs of the computer some time after stopping the recording. That might be some unwanted side-effect of my patch...

> 


Yes, It shows transfer events for the TRBs before the link TRB.
This workaround is however a can of worms as many TD's consist of several TRBs, and they only trigger events for the last TRB.

How about a different approach?
If the issue is only with transfers starting on the last TRB before the link TRB, we could turn that TRB to a no-op.
Does something like the code below help?

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 6cdea0d00d19..0ffda8127640 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -3181,6 +3181,12 @@ static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring,
 		}
 	}
 
+	if (ep_ring != xhci->cmd_ring &&
+	    !trb_is_link(ep_ring->enqueue) &&
+	    trb_is_link(ep_ring->enqueue + 1))
+		queue_trb(xhci, ep_ring, 0, 0, 0, 0,
+			  TRB_TYPE(TRB_TR_NOOP) | ep_ring->cycle_state);
+
 	while (trb_is_link(ep_ring->enqueue)) {
 		/* If we're not dealing with 0.95 hardware or isoc rings
 		 * on AMD 0.96 host, clear the chain bit.
 
-Mathias

Ole Salscheider April 21, 2021, 8:45 a.m. UTC | #5

Hi

On 21.04.21 09:28, Mathias Nyman wrote:
> On 19.4.2021 22.05, Ole Salscheider wrote:

>> Hi Mathias,

> ...

> 

>>> tracing shows the content of the TRB while it's written, before writing the cycle bit.

>>> Could be worth dumping transfer ring after issue is seen and check TRB still looks valid:

>>> cat /sys/kernel/debug/usb/xhci/<your controller>/devices/<xx>/ep<yy>/trbs

>>> you need to figure out which controller, device and endpoint number for this.

>>

>> I think this is the output for the correct device and endpoint:

>>

>> https://stuff.salscheider.org/trbs

> 

> Thanks, the TRB was already turned No-op by the cancel code by then, but it doesn't matter,

> your trace below indicates the TRB is fine.

> 

>>

>>> How about your delayed link TRB cycle write patch? did it help trigger an event for the last transfer

>>> TRB, or did it just help as controller couldn't go past the link TRB and was less out of sync with

>>> the driver?

>>

>>

>> You can find a trace of recording a few seconds of video with my patch here:

>>

>> https://stuff.salscheider.org/trace_with_patch

>>

>> As far as I can tell from it, the patch helps to trigger an event for all transfer TRBs. I also tried to record around 30 min of video with it and it did not encounter a problem while recording.

>>

>> I however cannot start a new recording after stopping the first. And sometimes I observe hangs of the computer some time after stopping the recording. That might be some unwanted side-effect of my patch...

>>

> 

> Yes, It shows transfer events for the TRBs before the link TRB.

> This workaround is however a can of worms as many TD's consist of several TRBs, and they only trigger events for the last TRB.

> 

> How about a different approach?

> If the issue is only with transfers starting on the last TRB before the link TRB, we could turn that TRB to a no-op.

> Does something like the code below help?

> 

> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c

> index 6cdea0d00d19..0ffda8127640 100644

> --- a/drivers/usb/host/xhci-ring.c

> +++ b/drivers/usb/host/xhci-ring.c

> @@ -3181,6 +3181,12 @@ static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring,

>   		}

>   	}

>   

> +	if (ep_ring != xhci->cmd_ring &&

> +	    !trb_is_link(ep_ring->enqueue) &&

> +	    trb_is_link(ep_ring->enqueue + 1))

> +		queue_trb(xhci, ep_ring, 0, 0, 0, 0,

> +			  TRB_TYPE(TRB_TR_NOOP) | ep_ring->cycle_state);

> +

>   	while (trb_is_link(ep_ring->enqueue)) {

>   		/* If we're not dealing with 0.95 hardware or isoc rings

>   		 * on AMD 0.96 host, clear the chain bit.

>   


Your patch seems to work. I can record video with this and it seems 
stable so far.

But there is still something off (as with my patch): If I stop the video 
recording and try to record again, the camera does not give me any 
frames. Maybe this is an unrelated issue but it works fine on the two 
other host controllers that I tested.

If you are interested you can find a trace here:
https://stuff.salscheider.org/dmesg_second
https://stuff.salscheider.org/trace_second

In this trace I recorded a few seconds of video with ffmpeg, killed it 
(at second 108) and restarted it (at second 116). Can you see anything 
suspicious in the trace?

Thanks a lot
- Ole

Ole Salscheider May 5, 2021, 7:56 a.m. UTC | #6

Hi Mathias,

...

>> How about a different approach?

>> If the issue is only with transfers starting on the last TRB before 

>> the link TRB, we could turn that TRB to a no-op.

>> Does something like the code below help?

>>

>> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c

>> index 6cdea0d00d19..0ffda8127640 100644

>> --- a/drivers/usb/host/xhci-ring.c

>> +++ b/drivers/usb/host/xhci-ring.c

>> @@ -3181,6 +3181,12 @@ static int prepare_ring(struct xhci_hcd *xhci, 

>> struct xhci_ring *ep_ring,

>>           }

>>       }

>> +    if (ep_ring != xhci->cmd_ring &&

>> +        !trb_is_link(ep_ring->enqueue) &&

>> +        trb_is_link(ep_ring->enqueue + 1))

>> +        queue_trb(xhci, ep_ring, 0, 0, 0, 0,

>> +              TRB_TYPE(TRB_TR_NOOP) | ep_ring->cycle_state);

>> +

>>       while (trb_is_link(ep_ring->enqueue)) {

>>           /* If we're not dealing with 0.95 hardware or isoc rings

>>            * on AMD 0.96 host, clear the chain bit.

> 

> Your patch seems to work. I can record video with this and it seems 

> stable so far.

> 

> But there is still something off (as with my patch): If I stop the video 

> recording and try to record again, the camera does not give me any 

> frames. Maybe this is an unrelated issue but it works fine on the two 

> other host controllers that I tested.

> 

> If you are interested you can find a trace here:

> https://stuff.salscheider.org/dmesg_second

> https://stuff.salscheider.org/trace_second

> 

> In this trace I recorded a few seconds of video with ffmpeg, killed it 

> (at second 108) and restarted it (at second 116). Can you see anything 

> suspicious in the trace?


I guess this second issue is unrelated. The cameras have worked stable 
so far with your patch. It might be good to include this workaround in 
mainline. Will you take care of it or should I send something to the list?

Thanks and best regards,
Ole

Mathias Nyman May 6, 2021, 9:08 a.m. UTC | #7

On 5.5.2021 10.56, Ole Salscheider wrote:
> Hi Mathias,

> 

> ...

> 

>>> How about a different approach?

>>> If the issue is only with transfers starting on the last TRB before the link TRB, we could turn that TRB to a no-op.

>>> Does something like the code below help?

>>>

>>> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c

>>> index 6cdea0d00d19..0ffda8127640 100644

>>> --- a/drivers/usb/host/xhci-ring.c

>>> +++ b/drivers/usb/host/xhci-ring.c

>>> @@ -3181,6 +3181,12 @@ static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring,

>>>           }

>>>       }

>>> +    if (ep_ring != xhci->cmd_ring &&

>>> +        !trb_is_link(ep_ring->enqueue) &&

>>> +        trb_is_link(ep_ring->enqueue + 1))

>>> +        queue_trb(xhci, ep_ring, 0, 0, 0, 0,

>>> +              TRB_TYPE(TRB_TR_NOOP) | ep_ring->cycle_state);

>>> +

>>>       while (trb_is_link(ep_ring->enqueue)) {

>>>           /* If we're not dealing with 0.95 hardware or isoc rings

>>>            * on AMD 0.96 host, clear the chain bit.

>>

>> Your patch seems to work. I can record video with this and it seems stable so far.

>>

>> But there is still something off (as with my patch): If I stop the video recording and try to record again, the camera does not give me any frames. Maybe this is an unrelated issue but it works fine on the two other host controllers that I tested.

>>

>> If you are interested you can find a trace here:

>> https://stuff.salscheider.org/dmesg_second

>> https://stuff.salscheider.org/trace_second

>>

>> In this trace I recorded a few seconds of video with ffmpeg, killed it (at second 108) and restarted it (at second 116). Can you see anything suspicious in the trace?

> 

> I guess this second issue is unrelated. The cameras have worked stable so far with your patch. It might be good to include this workaround in mainline. Will you take care of it or should I send something to the list?

>


This is still not a very nice solution. We have no clue about the actual rootcause.

I remember now there was a similar issue with an earlier ASMedia host some years ago.
This was fixed by modifying some internal flowcontol parameters of the host in:

9da5a1092b13 xhci: Bad Ethernet performance plugged in ASM1042A host

Not sure if  Jiahau Chang (cc) works on this anymore, but maybe he knows who to contact.
Also adding Forest Crossman who has committed ASMediad fixes lately

Any clue about the rootcause?
thread:
https://lore.kernel.org/linux-usb/20210416093729.41865-1-ole@salscheider.org

Thanks
-Mathias

Forest Crossman May 9, 2021, 12:01 a.m. UTC | #8

On Thu, May 6, 2021 at 4:06 AM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
>

> On 5.5.2021 10.56, Ole Salscheider wrote:

> > Hi Mathias,

> >

> > ...

> >

> >>> How about a different approach?

> >>> If the issue is only with transfers starting on the last TRB before the link TRB, we could turn that TRB to a no-op.

> >>> Does something like the code below help?

> >>>

> >>> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c

> >>> index 6cdea0d00d19..0ffda8127640 100644

> >>> --- a/drivers/usb/host/xhci-ring.c

> >>> +++ b/drivers/usb/host/xhci-ring.c

> >>> @@ -3181,6 +3181,12 @@ static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring,

> >>>           }

> >>>       }

> >>> +    if (ep_ring != xhci->cmd_ring &&

> >>> +        !trb_is_link(ep_ring->enqueue) &&

> >>> +        trb_is_link(ep_ring->enqueue + 1))

> >>> +        queue_trb(xhci, ep_ring, 0, 0, 0, 0,

> >>> +              TRB_TYPE(TRB_TR_NOOP) | ep_ring->cycle_state);

> >>> +

> >>>       while (trb_is_link(ep_ring->enqueue)) {

> >>>           /* If we're not dealing with 0.95 hardware or isoc rings

> >>>            * on AMD 0.96 host, clear the chain bit.

> >>

> >> Your patch seems to work. I can record video with this and it seems stable so far.

> >>

> >> But there is still something off (as with my patch): If I stop the video recording and try to record again, the camera does not give me any frames. Maybe this is an unrelated issue but it works fine on the two other host controllers that I tested.

> >>

> >> If you are interested you can find a trace here:

> >> https://stuff.salscheider.org/dmesg_second

> >> https://stuff.salscheider.org/trace_second

> >>

> >> In this trace I recorded a few seconds of video with ffmpeg, killed it (at second 108) and restarted it (at second 116). Can you see anything suspicious in the trace?

> >

> > I guess this second issue is unrelated. The cameras have worked stable so far with your patch. It might be good to include this workaround in mainline. Will you take care of it or should I send something to the list?

> >

>

> This is still not a very nice solution. We have no clue about the actual rootcause.

>

> I remember now there was a similar issue with an earlier ASMedia host some years ago.

> This was fixed by modifying some internal flowcontol parameters of the host in:

>

> 9da5a1092b13 xhci: Bad Ethernet performance plugged in ASM1042A host

>

> Not sure if  Jiahau Chang (cc) works on this anymore, but maybe he knows who to contact.

> Also adding Forest Crossman who has committed ASMediad fixes lately

>

> Any clue about the rootcause?

> thread:

> https://lore.kernel.org/linux-usb/20210416093729.41865-1-ole@salscheider.org


Unfortunately, I don't know what could be causing this. The only thing
I would suggest is to see if this problem happens (without the patch)
while the USB device is connected directly to a port on the ASMedia
host controller, with no other hubs or devices connected to that
controller. The only problem I've been seeing with my various ASMedia
cards is when I try to do a lot of bulk reads from multiple devices
simultaneously (e.g., when dd-ing from multiple hard drives to
/dev/null). In those cases, the controller eventually triggers an
IOMMU page access violation, which causes the kernel to reset the PCIe
endpoint. So if the camera works fine when it's the only device
connected to the host controller (without any patches applied), that
might indicate that this is the same issue. But that's mostly a wild
guess--I don't know enough of the USB or xHCI standards to really
understand what's going on.

Best of luck resolving this issue,

Forest

Ole Salscheider May 9, 2021, 7:31 a.m. UTC | #9

On 09.05.21 02:01, Forest Crossman wrote:
> On Thu, May 6, 2021 at 4:06 AM Mathias Nyman

> <mathias.nyman@linux.intel.com> wrote:

>>

>> On 5.5.2021 10.56, Ole Salscheider wrote:

>>> Hi Mathias,

>>>

>>> ...

>>>

>>>>> How about a different approach?

>>>>> If the issue is only with transfers starting on the last TRB before the link TRB, we could turn that TRB to a no-op.

>>>>> Does something like the code below help?

>>>>>

>>>>> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c

>>>>> index 6cdea0d00d19..0ffda8127640 100644

>>>>> --- a/drivers/usb/host/xhci-ring.c

>>>>> +++ b/drivers/usb/host/xhci-ring.c

>>>>> @@ -3181,6 +3181,12 @@ static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring,

>>>>>            }

>>>>>        }

>>>>> +    if (ep_ring != xhci->cmd_ring &&

>>>>> +        !trb_is_link(ep_ring->enqueue) &&

>>>>> +        trb_is_link(ep_ring->enqueue + 1))

>>>>> +        queue_trb(xhci, ep_ring, 0, 0, 0, 0,

>>>>> +              TRB_TYPE(TRB_TR_NOOP) | ep_ring->cycle_state);

>>>>> +

>>>>>        while (trb_is_link(ep_ring->enqueue)) {

>>>>>            /* If we're not dealing with 0.95 hardware or isoc rings

>>>>>             * on AMD 0.96 host, clear the chain bit.

>>>>

>>>> Your patch seems to work. I can record video with this and it seems stable so far.

>>>>

>>>> But there is still something off (as with my patch): If I stop the video recording and try to record again, the camera does not give me any frames. Maybe this is an unrelated issue but it works fine on the two other host controllers that I tested.

>>>>

>>>> If you are interested you can find a trace here:

>>>> https://stuff.salscheider.org/dmesg_second

>>>> https://stuff.salscheider.org/trace_second

>>>>

>>>> In this trace I recorded a few seconds of video with ffmpeg, killed it (at second 108) and restarted it (at second 116). Can you see anything suspicious in the trace?

>>>

>>> I guess this second issue is unrelated. The cameras have worked stable so far with your patch. It might be good to include this workaround in mainline. Will you take care of it or should I send something to the list?

>>>

>>

>> This is still not a very nice solution. We have no clue about the actual rootcause.

>>

>> I remember now there was a similar issue with an earlier ASMedia host some years ago.

>> This was fixed by modifying some internal flowcontol parameters of the host in:

>>

>> 9da5a1092b13 xhci: Bad Ethernet performance plugged in ASM1042A host

>>

>> Not sure if  Jiahau Chang (cc) works on this anymore, but maybe he knows who to contact.

>> Also adding Forest Crossman who has committed ASMediad fixes lately

>>

>> Any clue about the rootcause?

>> thread:

>> https://lore.kernel.org/linux-usb/20210416093729.41865-1-ole@salscheider.org

> 

> Unfortunately, I don't know what could be causing this. The only thing

> I would suggest is to see if this problem happens (without the patch)

> while the USB device is connected directly to a port on the ASMedia

> host controller, with no other hubs or devices connected to that

> controller. The only problem I've been seeing with my various ASMedia

> cards is when I try to do a lot of bulk reads from multiple devices

> simultaneously (e.g., when dd-ing from multiple hard drives to

> /dev/null). In those cases, the controller eventually triggers an

> IOMMU page access violation, which causes the kernel to reset the PCIe

> endpoint. So if the camera works fine when it's the only device

> connected to the host controller (without any patches applied), that

> might indicate that this is the same issue. But that's mostly a wild

> guess--I don't know enough of the USB or xHCI standards to really

> understand what's going on.


The problem occurs here also if only one camera and no other device are 
connected to the ASMedia host controller.

> Best of luck resolving this issue,

> 

> Forest

>

[RFC] xhci: Add Link TRB sync quirk for ASM3142

Commit Message

Comments

Patch