diff mbox series

xhci: print warning when HCE was set

Message ID 20220915011134.58400-1-liulongfang@huawei.com
State New
Headers show
Series xhci: print warning when HCE was set | expand

Commit Message

liulongfang Sept. 15, 2022, 1:11 a.m. UTC
When HCE(Host Controller Error) is set, it means that the xhci hardware
controller has an error at this time, but the current xhci driver
software does not log this event.

By adding an HCE event detection in the xhci interrupt processing
interface, a warning log is output to the system, which is convenient
for system device status tracking.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
---
 drivers/usb/host/xhci-ring.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Mathias Nyman Sept. 22, 2022, 1:01 p.m. UTC | #1
Hi

On 15.9.2022 4.11, Longfang Liu wrote:
> When HCE(Host Controller Error) is set, it means that the xhci hardware
> controller has an error at this time, but the current xhci driver
> software does not log this event.
> 
> By adding an HCE event detection in the xhci interrupt processing
> interface, a warning log is output to the system, which is convenient
> for system device status tracking.
> 

xHC should cease all activity when it sets HCE, and is probably not
generating interrupts anymore.

Would probably be more useful to check for HCE at timeouts than in the
interrupt handler.

If this is something seen on actual hardware then it makes sense to add it.

Thanks
-Mathias
liulongfang Oct. 14, 2022, 3:12 a.m. UTC | #2
On 2022/9/26 15:58, Mathias Nyman wrote:
> On 24.9.2022 5.35, liulongfang wrote:
>> On 2022/9/22 21:01, Mathias Nyman Wrote:
>>> Hi
>>>
>>> On 15.9.2022 4.11, Longfang Liu wrote:
>>>> When HCE(Host Controller Error) is set, it means that the xhci hardware
>>>> controller has an error at this time, but the current xhci driver
>>>> software does not log this event.
>>>>
>>>> By adding an HCE event detection in the xhci interrupt processing
>>>> interface, a warning log is output to the system, which is convenient
>>>> for system device status tracking.
>>>>
>>>
>>> xHC should cease all activity when it sets HCE, and is probably not
>>> generating interrupts anymore.
>>>
>>> Would probably be more useful to check for HCE at timeouts than in the
>>> interrupt handler.
>>>
>>
>> Which function of the driver code is this timeout in?
> 
> xhci_handle_command_timeout() will usually trigger at some point,
> 

Because this HCE error is reported in the form of an interrupt signal, it is more
concise to put it in xhci_irq() than in xhci_handle_command_timeout().

>>
>>> If this is something seen on actual hardware then it makes sense to add it.
>>>
>>
>> This HCE error is sure to report an interrupt on the chip we are using.
> 
> Ok, then makes sense to add this patch.
> 
> Thanks
> -Mathias
>
Thanks,
Longfang.
> .
>
Mathias Nyman Oct. 14, 2022, 7:56 a.m. UTC | #3
On 14.10.2022 6.12, liulongfang wrote:
> On 2022/9/26 15:58, Mathias Nyman wrote:
>> On 24.9.2022 5.35, liulongfang wrote:
>>> On 2022/9/22 21:01, Mathias Nyman Wrote:
>>>> Hi
>>>>
>>>> On 15.9.2022 4.11, Longfang Liu wrote:
>>>>> When HCE(Host Controller Error) is set, it means that the xhci hardware
>>>>> controller has an error at this time, but the current xhci driver
>>>>> software does not log this event.
>>>>>
>>>>> By adding an HCE event detection in the xhci interrupt processing
>>>>> interface, a warning log is output to the system, which is convenient
>>>>> for system device status tracking.
>>>>>
>>>>
>>>> xHC should cease all activity when it sets HCE, and is probably not
>>>> generating interrupts anymore.
>>>>
>>>> Would probably be more useful to check for HCE at timeouts than in the
>>>> interrupt handler.
>>>>
>>>
>>> Which function of the driver code is this timeout in?
>>
>> xhci_handle_command_timeout() will usually trigger at some point,
>>
> 
> Because this HCE error is reported in the form of an interrupt signal, it is more
> concise to put it in xhci_irq() than in xhci_handle_command_timeout().
> 

Patch was added to queue after you reported your xHC hardware triggers interrupts when HCE is set.
I'll send it forward after 6.1-rc1

xHCI specification still indicate HCE might not trigger interrupts:
  
Section 4.24.1 -Internal Errors
...
"Software should implement an algorithm for checking the HCE flag if the xHC is
not responding."

Thanks
-Mathias
liulongfang Dec. 9, 2022, 6:13 a.m. UTC | #4
On 2022/10/14 15:56, Mathias Nyman Wrote:
> On 14.10.2022 6.12, liulongfang wrote:
>> On 2022/9/26 15:58, Mathias Nyman wrote:
>>> On 24.9.2022 5.35, liulongfang wrote:
>>>> On 2022/9/22 21:01, Mathias Nyman Wrote:
>>>>> Hi
>>>>>
>>>>> On 15.9.2022 4.11, Longfang Liu wrote:
>>>>>> When HCE(Host Controller Error) is set, it means that the xhci hardware
>>>>>> controller has an error at this time, but the current xhci driver
>>>>>> software does not log this event.
>>>>>>
>>>>>> By adding an HCE event detection in the xhci interrupt processing
>>>>>> interface, a warning log is output to the system, which is convenient
>>>>>> for system device status tracking.
>>>>>>
>>>>>
>>>>> xHC should cease all activity when it sets HCE, and is probably not
>>>>> generating interrupts anymore.
>>>>>
>>>>> Would probably be more useful to check for HCE at timeouts than in the
>>>>> interrupt handler.
>>>>>
>>>>
>>>> Which function of the driver code is this timeout in?
>>>
>>> xhci_handle_command_timeout() will usually trigger at some point,
>>>
>>
>> Because this HCE error is reported in the form of an interrupt signal, it is more
>> concise to put it in xhci_irq() than in xhci_handle_command_timeout().
>>
> 
> Patch was added to queue after you reported your xHC hardware triggers interrupts when HCE is set.
> I'll send it forward after 6.1-rc1
> 

In our test version, a test log is added to xhci_irq(). In the test case that triggers HCE,
the HCE interrupt is reported and recorded through the log:

{53}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
{53}[Hardware Error]: event severity: recoverable
{53}[Hardware Error]:  Error 0, type: recoverable
{53}[Hardware Error]:   section type: unknown, c8b328a8-9917-4af6-9a13-2e08ab2e7586
{53}[Hardware Error]:   section length: 0x48
{53}[Hardware Error]:   00000000: 0000186b 00000201 001a0001 00000000  k...............
{53}[Hardware Error]:   00000010: 00000000 00000000 00000000 00000028  ............(...
{53}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
{53}[Hardware Error]:   00000030: 00000000 00000000 00000000 00000000  ................
{53}[Hardware Error]:   00000040: 00000001 00000000                    ........
 xhci_hcd 0000:30:01.0: xHCI host not responding to stop endpoint command.
 xhci_hcd 0000:30:01.0: USBSTS: PCD HCE
 xhci_hcd 0000:30:01.0: xHCI host controller not responding, assume dead
 xhci_hcd 0000:30:01.0: HC died; cleaning up
 usb usb1-port1: couldn't allocate usb_device
rmmod xhci-pci
 xhci_hcd 0000:30:01.0: remove, state 4
 usb usb2: USB disconnect, device number 1
 xhci_hcd 0000:30:01.0: USB bus 2 deregistered
 xhci_hcd 0000:30:01.0: remove, state 1
 usb usb1: USB disconnect, device number 1
 xhci_hcd 0000:30:01.0: USB bus 1 deregistered

Thanks,
Longfang.

> xHCI specification still indicate HCE might not trigger interrupts:
>  
> Section 4.24.1 -Internal Errors
> ...
> "Software should implement an algorithm for checking the HCE flag if the xHC is
> not responding."
> 
> Thanks
> -Mathias
> .
>
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index ad81e9a508b1..f6af479188e8 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -3031,6 +3031,11 @@  irqreturn_t xhci_irq(struct usb_hcd *hcd)
 	if (!(status & STS_EINT))
 		goto out;
 
+	if (status & STS_HCE) {
+		xhci_warn(xhci, "WARNING: Host Controller Error\n");
+		goto out;
+	}
+
 	if (status & STS_FATAL) {
 		xhci_warn(xhci, "WARNING: Host System Error\n");
 		xhci_halt(xhci);