Message ID | 0d79eb1f34e409749a136173b68f365b545c4789.1482738764.git.baolin.wang@linaro.org |
---|---|
State | New |
Headers | show |
Hi, On 27 December 2016 at 10:39, Lu Baolu <baolu.lu@linux.intel.com> wrote: > Hi, > > On 12/26/2016 04:01 PM, Baolin Wang wrote: >> On some platfroms(like x86 platform), when one core is running the USB gadget >> irq thread handler by dwc3_thread_interrupt(), meanwhile another core also can >> respond other interrupts from dwc3 controller and modify the event buffer by >> dwc3_interrupt() function, that will cause getting the wrong event count in >> irq thread handler to make the USB function abnormal. >> >> We should add spin_lock/unlock() in dwc3_check_event_buf() to avoid this race. > > Why not spin_lock_irq ones? This lock seems to be used in both > normal and interrupt threads. Or, I missed anything? I assumed there are no nested interrupts, when one core is running at interrupt context, then it can not respond any other interrupts, which means we don't need to disable local IRQ now, right? -- Baolin.wang Best Regards
Hi, On 12/27/2016 10:58 AM, Baolin Wang wrote: > Hi, > > On 27 December 2016 at 10:39, Lu Baolu <baolu.lu@linux.intel.com> wrote: >> Hi, >> >> On 12/26/2016 04:01 PM, Baolin Wang wrote: >>> On some platfroms(like x86 platform), when one core is running the USB gadget >>> irq thread handler by dwc3_thread_interrupt(), meanwhile another core also can >>> respond other interrupts from dwc3 controller and modify the event buffer by >>> dwc3_interrupt() function, that will cause getting the wrong event count in >>> irq thread handler to make the USB function abnormal. >>> >>> We should add spin_lock/unlock() in dwc3_check_event_buf() to avoid this race. >> Why not spin_lock_irq ones? This lock seems to be used in both >> normal and interrupt threads. Or, I missed anything? > I assumed there are no nested interrupts, when one core is running at > interrupt context, then it can not respond any other interrupts, which > means we don't need to disable local IRQ now, right? > Fair enough. Thanks. Best regards, Lu Baolu
2016-12-27 12:05 GMT+01:00 Felipe Balbi <balbi@kernel.org>: > Hi, > > Lu Baolu <baolu.lu@linux.intel.com> writes: >> On 12/26/2016 04:01 PM, Baolin Wang wrote: >>> On some platfroms(like x86 platform), when one core is running the USB gadget >>> irq thread handler by dwc3_thread_interrupt(), meanwhile another core also can >>> respond other interrupts from dwc3 controller and modify the event buffer by >>> dwc3_interrupt() function, that will cause getting the wrong event count in >>> irq thread handler to make the USB function abnormal. >>> >>> We should add spin_lock/unlock() in dwc3_check_event_buf() to avoid this race. >> >> Why not spin_lock_irq ones? This lock seems to be used in both >> normal and interrupt threads. Or, I missed anything? > > this is top half handler. Interrupts are already disabled. > BTW, We don't use spin_lock in top half handler. Maybe we should/can switch all spin_lock_irqsave() to simple spin_lock() in the thread/callbacks? Or there is a reason to use irqsave() version? BR Janusz > -- > balbi -- Janusz Dziedzic
Hi, Janusz Dziedzic <janusz.dziedzic@gmail.com> writes: >>>> On some platfroms(like x86 platform), when one core is running the USB gadget >>>> irq thread handler by dwc3_thread_interrupt(), meanwhile another core also can >>>> respond other interrupts from dwc3 controller and modify the event buffer by >>>> dwc3_interrupt() function, that will cause getting the wrong event count in >>>> irq thread handler to make the USB function abnormal. >>>> >>>> We should add spin_lock/unlock() in dwc3_check_event_buf() to avoid this race. >>> >>> Why not spin_lock_irq ones? This lock seems to be used in both >>> normal and interrupt threads. Or, I missed anything? >> >> this is top half handler. Interrupts are already disabled. >> > BTW, > We don't use spin_lock in top half handler. > Maybe we should/can switch all spin_lock_irqsave() to simple > spin_lock() in the thread/callbacks? in theory, yes we've masked all interrupts from this controller for the duration of the thread handler. However this breaks networking gadgets. I can only guess network stack has a hard requirement to run with IRQs disabled. > Or there is a reason to use irqsave() version? see above :-) -- balbi
Hi, Baolin Wang <baolin.wang@linaro.org> writes: >>>>>>> On 27 December 2016 at 18:52, Janusz Dziedzic <janusz.dziedzic@gmail.com> wrote: >>>>>>>> 2016-12-26 9:01 GMT+01:00 Baolin Wang <baolin.wang@linaro.org>: >>>>>>>>> On some platfroms(like x86 platform), when one core is running the USB gadget >>>>>>>>> irq thread handler by dwc3_thread_interrupt(), meanwhile another core also can >>>>>>>>> respond other interrupts from dwc3 controller and modify the event buffer by >>>>>>>>> dwc3_interrupt() function, that will cause getting the wrong event count in >>>>>>>>> irq thread handler to make the USB function abnormal. >>>>>>>>> >>>>>>>>> We should add spin_lock/unlock() in dwc3_check_event_buf() to avoid this race. >>>>>>>>> >>>>>>>> Interesting, I always think we mask interrupt in dwc3_interrupt() by setting >>>>>>>> DWC3_GEVNTSIZ_INTMASK >>>>>>>> And unmask interrupt when we end dwc3_thread_interrupt(). >>>>>>>> >>>>>>>> So, we shouldn't get any IRQ from HW during dwc3_thread_interrupt(), >>>>>>>> or I miss something? >>>>>>>> Do you have some traces that indicate this masking will not work correctly? >>>>>>> >>>>>>> Yes, but we just masked the interrupts described in DEVTEN register, >>>>>>> and we did not mask all the interrupts, like the endpoint command >>>>>>> complete event, transfer complete event and so on, so we can still get >>>>>>> interrupts. >>>>>> >>>>>> not true, we masked interrupts for the entire event buffer: >>>>> >>>>> Yes, you are right and I missed that. I should reproduce this problem >>>>> and analyse the real reason. >>>>> >>>>>> >>>>>>> static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt) >>>>>>> { >>>>>>> struct dwc3 *dwc = evt->dwc; >>>>>>> u32 count; >>>>>>> u32 reg; >>>>>>> >>>>>>> if (pm_runtime_suspended(dwc->dev)) { >>>>>>> pm_runtime_get(dwc->dev); >>>>>>> disable_irq_nosync(dwc->irq_gadget); >>>>>>> dwc->pending_events = true; >>>>>>> return IRQ_HANDLED; >>>>>>> } >>>>>>> >>>>>>> count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0)); >>>>>>> count &= DWC3_GEVNTCOUNT_MASK; >>>>>>> if (!count) >>>>>>> return IRQ_NONE; >>>>>>> >>>>>>> evt->count = count; >>>>>>> evt->flags |= DWC3_EVENT_PENDING; >>>>>>> >>>>>>> /* Mask interrupt */ >>>>>>> reg = dwc3_readl(dwc->regs, DWC3_GEVNTSIZ(0)); >>>>>>> reg |= DWC3_GEVNTSIZ_INTMASK; >>>>>> >>>>>> See here ?!? >>>>>> >>>>>>> dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), reg); >>>>>>> >>>>>>> return IRQ_WAKE_THREAD; >>>>>>> } >>>>>> >>>>>>>> BTW, what value you get when problem occured, 0xFFFC? >>>>>>> >>>>>>> Yes, something like this, the event count become huge. >>>>>> >>>> Probably you have little bit different code than current community >>>> version (depends how your PM works). >>>> >>>> This is possible when we write: >>>> dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0); >>>> And after that >>>> dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 4); >>>> >>>> After that we will get 0xFFFC (-4). >>>> >>>> Possible races: >>>> 1) dwc3_event_buffers_setup/dwc3_event_buffers_cleanup - write 0 >>>> 2) dwc3_thread - write 4 >>>> >>>> While [1] could be called in PM work or UM context (init in Android >>>> case) spin_lock_irqsave() will only disable local irqs and still we >>>> could get IRQ on different core, next update evt->count and run >>>> thread... >>> >>> Yeah, that's the possible races. >> >> and you have triggered this with mailine? How? We don't write to GEVNT* >> registers from PM code and we only allow runtime_suspend with cable >> dettached. > > Sorry for late reply since I am busy on other things. I just agreed > with the possible races pointed by Janusz. I need to look at if these > are happened on my platform and also I found some out of tree code > which will clean the GEVNTCOUNT register when stop the gadget. I will > check the mainline kernel and resend new patch if I make this problem > clearly. Anyway thanks for your help and suggestion. IOW, you sent me a patch to be integrated in the tree which everybody in the whole world uses and you didn't even test anything in that very tree? How am I supposed to trust you're sending me tested patches from now on? Clearly you have no empathy for those working countless hours to keep this stable and working. If you're ready to send me a completely untested patch and claim that it's fixing a race condition you have never seen for yourself, it becomes difficult to trust any patches you're sending me. You should know better. Your employer has people on staff who are able to clarify all these "details". You should never, ever send me patches like this again. Mark, himself, has a long track record of contributing to upstream development; maybe you should have a discussion with him offline about this. -- balbi
Hi, On 5 January 2017 at 17:26, Felipe Balbi <balbi@kernel.org> wrote: > > Hi, > > Baolin Wang <baolin.wang@linaro.org> writes: >>>>>>>> On 27 December 2016 at 18:52, Janusz Dziedzic <janusz.dziedzic@gmail.com> wrote: >>>>>>>>> 2016-12-26 9:01 GMT+01:00 Baolin Wang <baolin.wang@linaro.org>: >>>>>>>>>> On some platfroms(like x86 platform), when one core is running the USB gadget >>>>>>>>>> irq thread handler by dwc3_thread_interrupt(), meanwhile another core also can >>>>>>>>>> respond other interrupts from dwc3 controller and modify the event buffer by >>>>>>>>>> dwc3_interrupt() function, that will cause getting the wrong event count in >>>>>>>>>> irq thread handler to make the USB function abnormal. >>>>>>>>>> >>>>>>>>>> We should add spin_lock/unlock() in dwc3_check_event_buf() to avoid this race. >>>>>>>>>> >>>>>>>>> Interesting, I always think we mask interrupt in dwc3_interrupt() by setting >>>>>>>>> DWC3_GEVNTSIZ_INTMASK >>>>>>>>> And unmask interrupt when we end dwc3_thread_interrupt(). >>>>>>>>> >>>>>>>>> So, we shouldn't get any IRQ from HW during dwc3_thread_interrupt(), >>>>>>>>> or I miss something? >>>>>>>>> Do you have some traces that indicate this masking will not work correctly? >>>>>>>> >>>>>>>> Yes, but we just masked the interrupts described in DEVTEN register, >>>>>>>> and we did not mask all the interrupts, like the endpoint command >>>>>>>> complete event, transfer complete event and so on, so we can still get >>>>>>>> interrupts. >>>>>>> >>>>>>> not true, we masked interrupts for the entire event buffer: >>>>>> >>>>>> Yes, you are right and I missed that. I should reproduce this problem >>>>>> and analyse the real reason. >>>>>> >>>>>>> >>>>>>>> static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt) >>>>>>>> { >>>>>>>> struct dwc3 *dwc = evt->dwc; >>>>>>>> u32 count; >>>>>>>> u32 reg; >>>>>>>> >>>>>>>> if (pm_runtime_suspended(dwc->dev)) { >>>>>>>> pm_runtime_get(dwc->dev); >>>>>>>> disable_irq_nosync(dwc->irq_gadget); >>>>>>>> dwc->pending_events = true; >>>>>>>> return IRQ_HANDLED; >>>>>>>> } >>>>>>>> >>>>>>>> count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0)); >>>>>>>> count &= DWC3_GEVNTCOUNT_MASK; >>>>>>>> if (!count) >>>>>>>> return IRQ_NONE; >>>>>>>> >>>>>>>> evt->count = count; >>>>>>>> evt->flags |= DWC3_EVENT_PENDING; >>>>>>>> >>>>>>>> /* Mask interrupt */ >>>>>>>> reg = dwc3_readl(dwc->regs, DWC3_GEVNTSIZ(0)); >>>>>>>> reg |= DWC3_GEVNTSIZ_INTMASK; >>>>>>> >>>>>>> See here ?!? >>>>>>> >>>>>>>> dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), reg); >>>>>>>> >>>>>>>> return IRQ_WAKE_THREAD; >>>>>>>> } >>>>>>> >>>>>>>>> BTW, what value you get when problem occured, 0xFFFC? >>>>>>>> >>>>>>>> Yes, something like this, the event count become huge. >>>>>>> >>>>> Probably you have little bit different code than current community >>>>> version (depends how your PM works). >>>>> >>>>> This is possible when we write: >>>>> dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0); >>>>> And after that >>>>> dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 4); >>>>> >>>>> After that we will get 0xFFFC (-4). >>>>> >>>>> Possible races: >>>>> 1) dwc3_event_buffers_setup/dwc3_event_buffers_cleanup - write 0 >>>>> 2) dwc3_thread - write 4 >>>>> >>>>> While [1] could be called in PM work or UM context (init in Android >>>>> case) spin_lock_irqsave() will only disable local irqs and still we >>>>> could get IRQ on different core, next update evt->count and run >>>>> thread... >>>> >>>> Yeah, that's the possible races. >>> >>> and you have triggered this with mailine? How? We don't write to GEVNT* >>> registers from PM code and we only allow runtime_suspend with cable >>> dettached. >> >> Sorry for late reply since I am busy on other things. I just agreed >> with the possible races pointed by Janusz. I need to look at if these >> are happened on my platform and also I found some out of tree code >> which will clean the GEVNTCOUNT register when stop the gadget. I will >> check the mainline kernel and resend new patch if I make this problem >> clearly. Anyway thanks for your help and suggestion. > > IOW, you sent me a patch to be integrated in the tree which everybody in > the whole world uses and you didn't even test anything in that very > tree? How am I supposed to trust you're sending me tested patches from > now on? > > Clearly you have no empathy for those working countless hours to keep > this stable and working. If you're ready to send me a completely > untested patch and claim that it's fixing a race condition you have > never seen for yourself, it becomes difficult to trust any patches > you're sending me. I am sure I send you every patch was tested on my vendor platform and I saw the problem on my platform. But like my said I missed something that we have masked all interrupts in the dwc3 interrupt handler, so the real reason maybe caused by some out of tree code on my vendor platform which will clean the GEVNTCOUNT register when stop the gadget. Moreover I did not only do this one thing, and some other problem I also need time to test and investigate. So I think I need some time to make things clear, then I can send you one better patch with the correct explanation, am I wrong here? > > You should know better. Your employer has people on staff who are able > to clarify all these "details". You should never, ever send me patches > like this again. Mark, himself, has a long track record of contributing > to upstream development; maybe you should have a discussion with him > offline about this. > > -- > balbi -- Baolin.wang Best Regards
On 12/28/2016 5:29 PM, John Youn wrote: > > >> Janusz Dziedzic <janusz.dziedzic@gmail.com> writes: >>>>>> On some platfroms(like x86 platform), when one core is running the >> USB gadget >>>>>> irq thread handler by dwc3_thread_interrupt(), meanwhile another >> core also can >>>>>> respond other interrupts from dwc3 controller and modify the event >> buffer by >>>>>> dwc3_interrupt() function, that will cause getting the wrong event >> count in >>>>>> irq thread handler to make the USB function abnormal. >>>>>> >>>>>> We should add spin_lock/unlock() in dwc3_check_event_buf() to avoid >> this race. >>>>> >>>>> Why not spin_lock_irq ones? This lock seems to be used in both >>>>> normal and interrupt threads. Or, I missed anything? >>>> >>>> this is top half handler. Interrupts are already disabled. >>>> >>> BTW, >>> We don't use spin_lock in top half handler. >>> Maybe we should/can switch all spin_lock_irqsave() to simple >>> spin_lock() in the thread/callbacks? >> >> in theory, yes we've masked all interrupts from this controller for the >> duration of the thread handler. However this breaks networking >> gadgets. I can only guess network stack has a hard requirement to run >> with IRQs disabled. >> > > Hi, > > Is this version 3.00a of the core? > > That version has a STAR where the interrupts cannot be masked. That > results in similar symptoms to what you're seeing here. > Regards, > John > Didn't see any response to this. Just want to make sure this possibility is addressed as there is a workaround for it on mainline. See the following commits for details: d9fa4c63f766 ("usb: dwc3: core: add a event buffer cache") ebbb2d59398f ("usb: dwc3: gadget: use evt->cache for processing events") 65aca3205046 ("usb: dwc3: gadget: clear events in top-half handler") cf40b86b6ef6 ("usb: dwc3: Implement interrupt moderation") 28632b44d129 ("usb: dwc3: Workaround for irq mask issue") Regards, John
Hi John, On 6 January 2017 at 03:08, John Youn <John.Youn@synopsys.com> wrote: > On 12/28/2016 5:29 PM, John Youn wrote: >> >> >>> Janusz Dziedzic <janusz.dziedzic@gmail.com> writes: >>>>>>> On some platfroms(like x86 platform), when one core is running the >>> USB gadget >>>>>>> irq thread handler by dwc3_thread_interrupt(), meanwhile another >>> core also can >>>>>>> respond other interrupts from dwc3 controller and modify the event >>> buffer by >>>>>>> dwc3_interrupt() function, that will cause getting the wrong event >>> count in >>>>>>> irq thread handler to make the USB function abnormal. >>>>>>> >>>>>>> We should add spin_lock/unlock() in dwc3_check_event_buf() to avoid >>> this race. >>>>>> >>>>>> Why not spin_lock_irq ones? This lock seems to be used in both >>>>>> normal and interrupt threads. Or, I missed anything? >>>>> >>>>> this is top half handler. Interrupts are already disabled. >>>>> >>>> BTW, >>>> We don't use spin_lock in top half handler. >>>> Maybe we should/can switch all spin_lock_irqsave() to simple >>>> spin_lock() in the thread/callbacks? >>> >>> in theory, yes we've masked all interrupts from this controller for the >>> duration of the thread handler. However this breaks networking >>> gadgets. I can only guess network stack has a hard requirement to run >>> with IRQs disabled. >>> >> >> Hi, >> >> Is this version 3.00a of the core? >> >> That version has a STAR where the interrupts cannot be masked. That >> results in similar symptoms to what you're seeing here. Sorry for late reply. The version is 2.80a. >> Regards, >> John >> > > Didn't see any response to this. Just want to make sure this > possibility is addressed as there is a workaround for it on mainline. Thanks for reminding. > > See the following commits for details: > > d9fa4c63f766 ("usb: dwc3: core: add a event buffer cache") > ebbb2d59398f ("usb: dwc3: gadget: use evt->cache for processing events") > 65aca3205046 ("usb: dwc3: gadget: clear events in top-half handler") > cf40b86b6ef6 ("usb: dwc3: Implement interrupt moderation") > 28632b44d129 ("usb: dwc3: Workaround for irq mask issue") > -- Baolin.wang Best Regards
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c index 6785595..1a1e1f4 100644 --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -2894,10 +2894,13 @@ static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt) return IRQ_HANDLED; } + spin_lock(&dwc->lock); count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0)); count &= DWC3_GEVNTCOUNT_MASK; - if (!count) + if (!count) { + spin_unlock(&dwc->lock); return IRQ_NONE; + } evt->count = count; evt->flags |= DWC3_EVENT_PENDING; @@ -2914,6 +2917,7 @@ static irqreturn_t dwc3_check_event_buf(struct dwc3_event_buffer *evt) memcpy(evt->cache, evt->buf, count - amount); dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count); + spin_unlock(&dwc->lock); return IRQ_WAKE_THREAD; }
On some platfroms(like x86 platform), when one core is running the USB gadget irq thread handler by dwc3_thread_interrupt(), meanwhile another core also can respond other interrupts from dwc3 controller and modify the event buffer by dwc3_interrupt() function, that will cause getting the wrong event count in irq thread handler to make the USB function abnormal. We should add spin_lock/unlock() in dwc3_check_event_buf() to avoid this race. Signed-off-by: Baolin Wang <baolin.wang@linaro.org> --- drivers/usb/dwc3/gadget.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) -- 1.7.9.5