diff mbox series

usb: dwc3: Add dwc3 lock for blocking interrupt storming

Message ID 1646630679-121585-1-git-send-email-jh0801.jung@samsung.com
State New
Headers show
Series usb: dwc3: Add dwc3 lock for blocking interrupt storming | expand

Commit Message

정재훈 March 7, 2022, 5:24 a.m. UTC
Interrupt Storming occurred with a very low probability of occurrence.
The occurrence of the problem is estimated to be caused by a race condition
between the top half and bottom half of the interrupt service routine.
It was confirmed that variables have values that cannot be held
when ISR occurs through normal H / W irq.
=====================================================================
(struct dwc3_event_buffer *) ev_buf = 0xFFFFFF88DE6A0380 (
	(void *) buf = 0xFFFFFFC01594E000,
	(void *) cache = 0xFFFFFF88DDC14080,
	(unsigned int) length = 4096,
	(unsigned int) lpos = 0,
	(unsigned int) count = 0, <<
	(unsigned int) flags = 1, <<
=====================================================================
"evt->count=0" and "evt->flags=DWC3_EVENT_PENDING" cannot be set
at the same time.

We estimate that a race condition occurred
between dwc3_interrupt() and dwc3_process_event_buf()
called by dwc3_gadget_process_pending_events().
So I try to block the race condition through spin_lock.

Signed-off-by: JaeHun Jung <jh0801.jung@samsung.com>
---
 drivers/usb/dwc3/gadget.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Comments

Thinh Nguyen March 11, 2022, 1:56 a.m. UTC | #1
정재훈 wrote:
> Hi.
> 
>> -----Original Message-----
>> From: Thinh Nguyen [mailto:Thinh.Nguyen@synopsys.com]
>> Sent: Thursday, March 10, 2022 11:14 AM
>> To: JaeHun Jung; Felipe Balbi; Greg Kroah-Hartman
>> Cc: open list:USB XHCI DRIVER; open list; Seungchull Suh; Daehwan Jung
>> Subject: Re: [PATCH] usb: dwc3: Add dwc3 lock for blocking interrupt
>> storming
>>
>> Hi,
>>
>> JaeHun Jung wrote:
>>> Interrupt Storming occurred with a very low probability of occurrence.
>>> The occurrence of the problem is estimated to be caused by a race
>>> condition between the top half and bottom half of the interrupt service
>> routine.
>>> It was confirmed that variables have values that cannot be held when
>>> ISR occurs through normal H / W irq.
>>> =====================================================================
>>> (struct dwc3_event_buffer *) ev_buf = 0xFFFFFF88DE6A0380 (
>>> 	(void *) buf = 0xFFFFFFC01594E000,
>>> 	(void *) cache = 0xFFFFFF88DDC14080,
>>> 	(unsigned int) length = 4096,
>>> 	(unsigned int) lpos = 0,
>>> 	(unsigned int) count = 0, <<
>>> 	(unsigned int) flags = 1, <<
>>> =====================================================================
>>> "evt->count=0" and "evt->flags=DWC3_EVENT_PENDING" cannot be set at
>>> the same time.
>>>
>>> We estimate that a race condition occurred between dwc3_interrupt()
>>> and dwc3_process_event_buf() called by
>>> dwc3_gadget_process_pending_events().
>>> So I try to block the race condition through spin_lock.
>>
>> This looks like it needs a memory barrier. Would this work for you?
> Maybe it could be. But "evt->count = 0;" is updated on dwc3_process_event_buf().
> So, I think spin_lock is more clear routine for this issue.
> 

Not really. If problem is due to the evt->flags not updated in time,
then the solution should be using the memory barrier. The spin_lock
would obfuscate the issue. And we should avoid using spin_lock in the
top-half.

BR,
Thinh

>>
>> diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index
>> c02e239978e0..a96c344b9f17 100644
>> --- a/drivers/usb/dwc3/gadget.c
>> +++ b/drivers/usb/dwc3/gadget.c
>> @@ -5340,6 +5340,9 @@ static irqreturn_t dwc3_check_event_buf(struct
>> dwc3_event_buffer *evt)
>>                 return IRQ_HANDLED;
>>         }
>>
>> +       /* Make sure the event flags is updated */
>> +       wmb();
>> +
>>         /*
>>          * With PCIe legacy interrupt, test shows that top-half irq handler
>> can
>>          * be called again after HW interrupt deassertion. Check if bottom-
>> half
>>
>>
>> Thanks,
>> Thinh
>
Thinh Nguyen March 11, 2022, 3:55 a.m. UTC | #2
Thinh Nguyen wrote:
> 정재훈 wrote:
>>> -----Original Message-----
>>> From: Thinh Nguyen [mailto:Thinh.Nguyen@synopsys.com]
>>> Sent: Friday, March 11, 2022 10:57 AM
>>> To: 정재훈; Thinh Nguyen; 'Felipe Balbi'; 'Greg Kroah-Hartman'
>>> Cc: 'open list:USB XHCI DRIVER'; 'open list'; 'Seungchull Suh'; 'Daehwan
>>> Jung'; cpgs@samsung.com; cpgsproxy5@samsung.com
>>> Subject: Re: [PATCH] usb: dwc3: Add dwc3 lock for blocking interrupt
>>> storming
>>>
>>> 정재훈 wrote:
>>>> Hi.
>>>>
>>>>> -----Original Message-----
>>>>> From: Thinh Nguyen [mailto:Thinh.Nguyen@synopsys.com]
>>>>> Sent: Thursday, March 10, 2022 11:14 AM
>>>>> To: JaeHun Jung; Felipe Balbi; Greg Kroah-Hartman
>>>>> Cc: open list:USB XHCI DRIVER; open list; Seungchull Suh; Daehwan
>>>>> Jung
>>>>> Subject: Re: [PATCH] usb: dwc3: Add dwc3 lock for blocking interrupt
>>>>> storming
>>>>>
>>>>> Hi,
>>>>>
>>>>> JaeHun Jung wrote:
>>>>>> Interrupt Storming occurred with a very low probability of occurrence.
>>>>>> The occurrence of the problem is estimated to be caused by a race
>>>>>> condition between the top half and bottom half of the interrupt
>>>>>> service
>>>>> routine.
>>>>>> It was confirmed that variables have values that cannot be held when
>>>>>> ISR occurs through normal H / W irq.
>>>>>> ====================================================================
>>>>>> = (struct dwc3_event_buffer *) ev_buf = 0xFFFFFF88DE6A0380 (
>>>>>> 	(void *) buf = 0xFFFFFFC01594E000,
>>>>>> 	(void *) cache = 0xFFFFFF88DDC14080,
>>>>>> 	(unsigned int) length = 4096,
>>>>>> 	(unsigned int) lpos = 0,
>>>>>> 	(unsigned int) count = 0, <<
>>>>>> 	(unsigned int) flags = 1, <<
>>>>>> ====================================================================
>>>>>> = "evt->count=0" and "evt->flags=DWC3_EVENT_PENDING" cannot be set
>>>>>> at the same time.
>>>>>>
>>>>>> We estimate that a race condition occurred between dwc3_interrupt()
>>>>>> and dwc3_process_event_buf() called by
>>>>>> dwc3_gadget_process_pending_events().
>>>>>> So I try to block the race condition through spin_lock.
>>>>>
>>>>> This looks like it needs a memory barrier. Would this work for you?
>>>> Maybe it could be. But "evt->count = 0;" is updated on
>>> dwc3_process_event_buf().
>>>> So, I think spin_lock is more clear routine for this issue.
>>>>
>>>
>>> Not really. If problem is due to the evt->flags not updated in time, then
>>> the solution should be using the memory barrier. The spin_lock would
>>> obfuscate the issue. And we should avoid using spin_lock in the top-half.
>>
>> This issue was occurred by watchdog. The interrupt occurred in units of 4 to 5us and cannot be released until the bottom is executed.
>> If it is a problem with the memory barrier, the value should be updated after a few clocks and the TOP should run normally. Isn't it?
> 
> Can you guarantee that a value is stored after X amount of time, every time?
> 
>> And Could you explain me why we should avoid using spin_lock in the top-half.
>>
> 
> The top-half and bottom-half are serialized. While the bottom-half
> handler is running, the interrupt should be masked. If the top-half got
> called in the middle of the bottom-half handler, something else is
> wrong. There should not be a race that requires a spin_lock for this
> particular critical section.
> 
> The problem you're seeing is pointing toward a memory barrier issue.
> 
> Also you noted that there's an "interrupt storm", which doesn't indicate
> to me that it's due to PCIe legacy interrupt de-assertion delay response
> either.
> 
> Can you test it out and we can take a look further?
>
We want to avoid spin_lock because the top-half shouldn't stall for too
long, affecting performance. This can happen if some async call from the
upperlayer driver's holding the lock.

Thinh
Thinh Nguyen March 14, 2022, 11:18 p.m. UTC | #3
정재훈 wrote:
> 
> 
>> -----Original Message-----
>> From: Thinh Nguyen [mailto:Thinh.Nguyen@synopsys.com]
>> Sent: Friday, March 11, 2022 12:55 PM
>> To: 정재훈; 'Felipe Balbi'; 'Greg Kroah-Hartman'
>> Cc: 'open list:USB XHCI DRIVER'; 'open list'; 'Seungchull Suh'; 'Daehwan
>> Jung'; cpgs@samsung.com; cpgsproxy5@samsung.com
>> Subject: Re: [PATCH] usb: dwc3: Add dwc3 lock for blocking interrupt
>> storming
>>
>> Thinh Nguyen wrote:
>>> 정재훈 wrote:
>>>>> -----Original Message-----
>>>>> From: Thinh Nguyen [mailto:Thinh.Nguyen@synopsys.com]
>>>>> Sent: Friday, March 11, 2022 10:57 AM
>>>>> To: 정재훈; Thinh Nguyen; 'Felipe Balbi'; 'Greg Kroah-Hartman'
>>>>> Cc: 'open list:USB XHCI DRIVER'; 'open list'; 'Seungchull Suh';
>>>>> 'Daehwan Jung'; cpgs@samsung.com; cpgsproxy5@samsung.com
>>>>> Subject: Re: [PATCH] usb: dwc3: Add dwc3 lock for blocking interrupt
>>>>> storming
>>>>>
>>>>> 정재훈 wrote:
>>>>>> Hi.
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Thinh Nguyen [mailto:Thinh.Nguyen@synopsys.com]
>>>>>>> Sent: Thursday, March 10, 2022 11:14 AM
>>>>>>> To: JaeHun Jung; Felipe Balbi; Greg Kroah-Hartman
>>>>>>> Cc: open list:USB XHCI DRIVER; open list; Seungchull Suh; Daehwan
>>>>>>> Jung
>>>>>>> Subject: Re: [PATCH] usb: dwc3: Add dwc3 lock for blocking
>>>>>>> interrupt storming
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> JaeHun Jung wrote:
>>>>>>>> Interrupt Storming occurred with a very low probability of
>> occurrence.
>>>>>>>> The occurrence of the problem is estimated to be caused by a race
>>>>>>>> condition between the top half and bottom half of the interrupt
>>>>>>>> service
>>>>>>> routine.
>>>>>>>> It was confirmed that variables have values that cannot be held
>>>>>>>> when ISR occurs through normal H / W irq.
>>>>>>>> =================================================================
>>>>>>>> === = (struct dwc3_event_buffer *) ev_buf = 0xFFFFFF88DE6A0380 (
>>>>>>>> 	(void *) buf = 0xFFFFFFC01594E000,
>>>>>>>> 	(void *) cache = 0xFFFFFF88DDC14080,
>>>>>>>> 	(unsigned int) length = 4096,
>>>>>>>> 	(unsigned int) lpos = 0,
>>>>>>>> 	(unsigned int) count = 0, <<
>>>>>>>> 	(unsigned int) flags = 1, <<
>>>>>>>> =================================================================
>>>>>>>> === = "evt->count=0" and "evt->flags=DWC3_EVENT_PENDING" cannot
>>>>>>>> be set at the same time.
>>>>>>>>
>>>>>>>> We estimate that a race condition occurred between
>>>>>>>> dwc3_interrupt() and dwc3_process_event_buf() called by
>>>>>>>> dwc3_gadget_process_pending_events().
>>>>>>>> So I try to block the race condition through spin_lock.
>>>>>>>
>>>>>>> This looks like it needs a memory barrier. Would this work for you?
>>>>>> Maybe it could be. But "evt->count = 0;" is updated on
>>>>> dwc3_process_event_buf().
>>>>>> So, I think spin_lock is more clear routine for this issue.
>>>>>>
>>>>>
>>>>> Not really. If problem is due to the evt->flags not updated in time,
>>>>> then the solution should be using the memory barrier. The spin_lock
>>>>> would obfuscate the issue. And we should avoid using spin_lock in the
>> top-half.
>>>>
>>>> This issue was occurred by watchdog. The interrupt occurred in units of
>> 4 to 5us and cannot be released until the bottom is executed.
>>>> If it is a problem with the memory barrier, the value should be updated
>> after a few clocks and the TOP should run normally. Isn't it?
>>>
>>> Can you guarantee that a value is stored after X amount of time, every
>> time?
> Yes, I think it's guaranteed. The system was working with other cores for 20 seconds.
> 
>>>
>>>> And Could you explain me why we should avoid using spin_lock in the
>> top-half.
>>>>
>>>
>>> The top-half and bottom-half are serialized. While the bottom-half
>>> handler is running, the interrupt should be masked. If the top-half
>>> got called in the middle of the bottom-half handler, something else is
>>> wrong. There should not be a race that requires a spin_lock for this
>>> particular critical section.
> 
>>>
>>> The problem you're seeing is pointing toward a memory barrier issue.
>>>
>>> Also you noted that there's an "interrupt storm", which doesn't
>>> indicate to me that it's due to PCIe legacy interrupt de-assertion
>>> delay response either.
>>>
>>> Can you test it out and we can take a look further?
>>>
>> We want to avoid spin_lock because the top-half shouldn't stall for too
>> long, affecting performance. This can happen if some async call from the
>> upperlayer driver's holding the lock.
> I also do not think that the serialization of the top and bottom of the ISR has been broken.
> 
> I think that dwc3_interrupt() called by dwc3_gadget_process_pending_events() influenced the serialization with the bottom. At the time of the problem, 20 seconds ago, dwc3_runtime_resume()->dwc3_gadget_process_pending_events() was called, and the problem began at that time.

Ok. This doesn't sound right.

On suspend, we can just mask the interrupt, and unmask after resume.
There's no need to even track dwc->pending_events flag either. While the
event count from GEVNTCOUNT is non-zero, the controller will generate
interrupts if unmasked.

dwc3 is currently not handling hibernation, and we're not restoring
anything on resume.

> 
> I think so that it can make deadlock when using spin_lock in top, too.
> Thank you for your feedback. I'll consider another way.
> 

Thanks,
Thinh
diff mbox series

Patch

diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index a0c883f19a41..88f9ecf8cbfe 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -4172,6 +4172,7 @@  static irqreturn_t dwc3_thread_interrupt(int irq, void *_evt)
 static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt)
 {
 	struct dwc3 *dwc = evt->dwc;
+	unsigned long flags;
 	u32 amount;
 	u32 count;
 
@@ -4188,13 +4189,18 @@  static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt)
 	 * irq event handler completes before caching new event to prevent
 	 * losing events.
 	 */
-	if (evt->flags & DWC3_EVENT_PENDING)
+	spin_lock_irqsave(&dwc->lock, flags);
+	if (evt->flags & DWC3_EVENT_PENDING) {
+		spin_unlock_irqrestore(&dwc->lock, flags);
 		return IRQ_HANDLED;
+	}
 
 	count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
 	count &= DWC3_GEVNTCOUNT_MASK;
-	if (!count)
+	if (!count) {
+		spin_unlock_irqrestore(&dwc->lock, flags);
 		return IRQ_NONE;
+	}
 
 	evt->count = count;
 	evt->flags |= DWC3_EVENT_PENDING;
@@ -4210,6 +4216,7 @@  static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt)
 		memcpy(evt->cache, evt->buf, count - amount);
 
 	dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count);
+	spin_unlock_irqrestore(&dwc->lock, flags);
 
 	return IRQ_WAKE_THREAD;
 }