g_mass_storage bug ?

Message ID 20140924191814.GM17997@saruman
State New
Headers show

Commit Message

Felipe Balbi Sept. 24, 2014, 7:18 p.m.
Hi,

On Wed, Sep 24, 2014 at 01:08:15PM -0500, Felipe Balbi wrote:
> On Wed, Sep 24, 2014 at 01:53:31PM -0400, Alan Stern wrote:
> > On Wed, 24 Sep 2014, Felipe Balbi wrote:
> > 
> > > > I'll capture usbmon and send here shortly.
> > > 
> > > here it is... Interesting part starts at line 73 (114 on this email)
> > > where the data transport received EPIPE (due to Stall). This time
> > > however, I was eventually able to talk to the device and managed to
> > > issue quite a few writes to it.
> > 
> > Here's where the unexpected stuff begins:
> > 
> > > ed2541c0 1237463240 S Bo:003:01 -115 31 = 55534243 06000000 c0000000 8000061a 003f00c0 00000000 00000000 000000
> > > ed2541c0 1237463431 C Bo:003:01 0 31 >
> > > ec1a8540 1237463873 S Bi:003:01 -115 192 <
> > > ec1a8540 1237464053 C Bi:003:01 -32 0
> > > ed2541c0 1237464158 S Co:003:00 s 02 01 0000 0081 0000 0
> > > ed2541c0 1237464359 C Co:003:00 0 0
> > > ed2541c0 1237468607 S Bi:003:01 -115 13 <
> > > ed2541c0 1237468802 C Bi:003:01 -75 0
> > 
> > This is the first MODE SENSE command.  The gadget should send as much
> > data as it can before halting the bulk-IN endpoint.  Instead, the
> > endpoint was halted first.
> > 
> > Then, after the host cleared the halt, the gadget apparently sent the
> > data that _should_ have been sent previously.  The host was expecting
> > to receive the CSW at this point, so there was an overflow error.
> > That's what caused the host to perform a reset.
> > 
> > Evidently this UDC implements the set_halt method incorrectly.  
> > According to the kerneldoc for usb_ep_set_halt:
> > 
> >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> >  * transfer requests are still queued, or if the controller hardware
> >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> 
> damn old bugs :-) I'll fix that up and Cc stable.

alright fixed. Below you can see a combined diff which I'll still split
into patches so they can be applied properly.

Comments

Felipe Balbi Sept. 24, 2014, 7:27 p.m. | #1
Hi,

On Wed, Sep 24, 2014 at 02:18:14PM -0500, Felipe Balbi wrote:
> On Wed, Sep 24, 2014 at 01:08:15PM -0500, Felipe Balbi wrote:
> > On Wed, Sep 24, 2014 at 01:53:31PM -0400, Alan Stern wrote:
> > > On Wed, 24 Sep 2014, Felipe Balbi wrote:
> > > 
> > > > > I'll capture usbmon and send here shortly.
> > > > 
> > > > here it is... Interesting part starts at line 73 (114 on this email)
> > > > where the data transport received EPIPE (due to Stall). This time
> > > > however, I was eventually able to talk to the device and managed to
> > > > issue quite a few writes to it.
> > > 
> > > Here's where the unexpected stuff begins:
> > > 
> > > > ed2541c0 1237463240 S Bo:003:01 -115 31 = 55534243 06000000 c0000000 8000061a 003f00c0 00000000 00000000 000000
> > > > ed2541c0 1237463431 C Bo:003:01 0 31 >
> > > > ec1a8540 1237463873 S Bi:003:01 -115 192 <
> > > > ec1a8540 1237464053 C Bi:003:01 -32 0
> > > > ed2541c0 1237464158 S Co:003:00 s 02 01 0000 0081 0000 0
> > > > ed2541c0 1237464359 C Co:003:00 0 0
> > > > ed2541c0 1237468607 S Bi:003:01 -115 13 <
> > > > ed2541c0 1237468802 C Bi:003:01 -75 0
> > > 
> > > This is the first MODE SENSE command.  The gadget should send as much
> > > data as it can before halting the bulk-IN endpoint.  Instead, the
> > > endpoint was halted first.
> > > 
> > > Then, after the host cleared the halt, the gadget apparently sent the
> > > data that _should_ have been sent previously.  The host was expecting
> > > to receive the CSW at this point, so there was an overflow error.
> > > That's what caused the host to perform a reset.
> > > 
> > > Evidently this UDC implements the set_halt method incorrectly.  
> > > According to the kerneldoc for usb_ep_set_halt:
> > > 
> > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > >  * transfer requests are still queued, or if the controller hardware
> > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> > 
> > damn old bugs :-) I'll fix that up and Cc stable.
> 
> alright fixed. Below you can see a combined diff which I'll still split
> into patches so they can be applied properly.

OTOH, there's really no way to split it. It's all needed to fix a single
bug. Do you want me to add Reported-by: Alan Stern ?
Alan Stern Sept. 24, 2014, 7:31 p.m. | #2
On Wed, 24 Sep 2014, Felipe Balbi wrote:

> > > According to the kerneldoc for usb_ep_set_halt:
> > > 
> > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > >  * transfer requests are still queued, or if the controller hardware
> > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> > 
> > damn old bugs :-) I'll fix that up and Cc stable.
> 
> alright fixed. Below you can see a combined diff which I'll still split
> into patches so they can be applied properly.

And this eliminates the problems you saw with g_mass_storage?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Felipe Balbi Sept. 24, 2014, 7:40 p.m. | #3
On Wed, Sep 24, 2014 at 03:31:13PM -0400, Alan Stern wrote:
> On Wed, 24 Sep 2014, Felipe Balbi wrote:
> 
> > > > According to the kerneldoc for usb_ep_set_halt:
> > > > 
> > > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > > >  * transfer requests are still queued, or if the controller hardware
> > > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> > > 
> > > damn old bugs :-) I'll fix that up and Cc stable.
> > 
> > alright fixed. Below you can see a combined diff which I'll still split
> > into patches so they can be applied properly.
> 
> And this eliminates the problems you saw with g_mass_storage?

yup, working with and without stall=0, with and without debugging on. On
all three systems I tested before ;-)
Alan Stern Sept. 24, 2014, 7:48 p.m. | #4
On Wed, 24 Sep 2014, Felipe Balbi wrote:

> > alright fixed. Below you can see a combined diff which I'll still split
> > into patches so they can be applied properly.
> 
> OTOH, there's really no way to split it. It's all needed to fix a single
> bug. Do you want me to add Reported-by: Alan Stern ?

What we really need is a "Diagnosed-by:" tag.  :-)

I'll settle for Suggested-by:.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Felipe Balbi Sept. 24, 2014, 7:52 p.m. | #5
On Wed, Sep 24, 2014 at 03:48:29PM -0400, Alan Stern wrote:
> On Wed, 24 Sep 2014, Felipe Balbi wrote:
> 
> > > alright fixed. Below you can see a combined diff which I'll still split
> > > into patches so they can be applied properly.
> > 
> > OTOH, there's really no way to split it. It's all needed to fix a single
> > bug. Do you want me to add Reported-by: Alan Stern ?
> 
> What we really need is a "Diagnosed-by:" tag.  :-)

heh, that's right.

> I'll settle for Suggested-by:.

alright, I'll add that :-)
Felipe Balbi Sept. 24, 2014, 8:06 p.m. | #6
On Wed, Sep 24, 2014 at 02:40:12PM -0500, Felipe Balbi wrote:
> On Wed, Sep 24, 2014 at 03:31:13PM -0400, Alan Stern wrote:
> > On Wed, 24 Sep 2014, Felipe Balbi wrote:
> > 
> > > > > According to the kerneldoc for usb_ep_set_halt:
> > > > > 
> > > > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > > > >  * transfer requests are still queued, or if the controller hardware
> > > > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> > > > 
> > > > damn old bugs :-) I'll fix that up and Cc stable.
> > > 
> > > alright fixed. Below you can see a combined diff which I'll still split
> > > into patches so they can be applied properly.
> > 
> > And this eliminates the problems you saw with g_mass_storage?
> 
> yup, working with and without stall=0, with and without debugging on. On
> all three systems I tested before ;-)

there is still one detail which I just caught and not sure if it's
something we should care. When I run my msc.sh/msc.c tests [1], after
each test I see a new "sdX: unknown partition table". This doesn't
happen with my USB stick.

I'll fire up my sniffer again and see if I find anything peculiar.

[1] https://gitorious.org/usb/usb-tools/source/7eb7ef21de6cd124e0e0d0e7df9ddfff0e2f548e:msc.sh
Felipe Balbi Sept. 24, 2014, 8:13 p.m. | #7
On Wed, Sep 24, 2014 at 03:06:15PM -0500, Felipe Balbi wrote:
> On Wed, Sep 24, 2014 at 02:40:12PM -0500, Felipe Balbi wrote:
> > On Wed, Sep 24, 2014 at 03:31:13PM -0400, Alan Stern wrote:
> > > On Wed, 24 Sep 2014, Felipe Balbi wrote:
> > > 
> > > > > > According to the kerneldoc for usb_ep_set_halt:
> > > > > > 
> > > > > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > > > > >  * transfer requests are still queued, or if the controller hardware
> > > > > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> > > > > 
> > > > > damn old bugs :-) I'll fix that up and Cc stable.
> > > > 
> > > > alright fixed. Below you can see a combined diff which I'll still split
> > > > into patches so they can be applied properly.
> > > 
> > > And this eliminates the problems you saw with g_mass_storage?
> > 
> > yup, working with and without stall=0, with and without debugging on. On
> > all three systems I tested before ;-)
> 
> there is still one detail which I just caught and not sure if it's
> something we should care. When I run my msc.sh/msc.c tests [1], after
> each test I see a new "sdX: unknown partition table". This doesn't
> happen with my USB stick.

it happens with the USB stick after I destroy the partition table. So I
guess that's normal.
Paul Zimmerman Sept. 25, 2014, 1:37 a.m. | #8
> From: linux-usb-owner@vger.kernel.org [mailto:linux-usb-owner@vger.kernel.org] On Behalf Of Felipe Balbi
> Sent: Wednesday, September 24, 2014 12:18 PM
> 
> On Wed, Sep 24, 2014 at 01:08:15PM -0500, Felipe Balbi wrote:
> > On Wed, Sep 24, 2014 at 01:53:31PM -0400, Alan Stern wrote:
> > > On Wed, 24 Sep 2014, Felipe Balbi wrote:
> > >
> > > > > I'll capture usbmon and send here shortly.
> > > >
> > > > here it is... Interesting part starts at line 73 (114 on this email)
> > > > where the data transport received EPIPE (due to Stall). This time
> > > > however, I was eventually able to talk to the device and managed to
> > > > issue quite a few writes to it.
> > >
> > > Here's where the unexpected stuff begins:
> > >
> > > > ed2541c0 1237463240 S Bo:003:01 -115 31 = 55534243 06000000 c0000000 8000061a 003f00c0 00000000
> 00000000 000000
> > > > ed2541c0 1237463431 C Bo:003:01 0 31 >
> > > > ec1a8540 1237463873 S Bi:003:01 -115 192 <
> > > > ec1a8540 1237464053 C Bi:003:01 -32 0
> > > > ed2541c0 1237464158 S Co:003:00 s 02 01 0000 0081 0000 0
> > > > ed2541c0 1237464359 C Co:003:00 0 0
> > > > ed2541c0 1237468607 S Bi:003:01 -115 13 <
> > > > ed2541c0 1237468802 C Bi:003:01 -75 0
> > >
> > > This is the first MODE SENSE command.  The gadget should send as much
> > > data as it can before halting the bulk-IN endpoint.  Instead, the
> > > endpoint was halted first.
> > >
> > > Then, after the host cleared the halt, the gadget apparently sent the
> > > data that _should_ have been sent previously.  The host was expecting
> > > to receive the CSW at this point, so there was an overflow error.
> > > That's what caused the host to perform a reset.
> > >
> > > Evidently this UDC implements the set_halt method incorrectly.
> > > According to the kerneldoc for usb_ep_set_halt:
> > >
> > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > >  * transfer requests are still queued, or if the controller hardware
> > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> >
> > damn old bugs :-) I'll fix that up and Cc stable.
> 
> alright fixed. Below you can see a combined diff which I'll still split
> into patches so they can be applied properly.

[ snipped patch which grubs around in the GDBGFIFOSPACE registers ]

Ick. That shouldn't be necessary. The Synopsys vendor driver works
fine with the mass-storage gadget (with stall=1) without doing anything
like that.

When did the dwc3 driver start showing this problem? Wouldn't it be
better to find the change which caused this? Or has dwc3 always had
this problem, but you never tested with stall=1 before so didn't see
it?
Felipe Balbi Sept. 25, 2014, 2:40 a.m. | #9
Hi,

On Thu, Sep 25, 2014 at 01:37:05AM +0000, Paul Zimmerman wrote:
> > > > Then, after the host cleared the halt, the gadget apparently sent the
> > > > data that _should_ have been sent previously.  The host was expecting
> > > > to receive the CSW at this point, so there was an overflow error.
> > > > That's what caused the host to perform a reset.
> > > >
> > > > Evidently this UDC implements the set_halt method incorrectly.
> > > > According to the kerneldoc for usb_ep_set_halt:
> > > >
> > > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > > >  * transfer requests are still queued, or if the controller hardware
> > > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> > >
> > > damn old bugs :-) I'll fix that up and Cc stable.
> > 
> > alright fixed. Below you can see a combined diff which I'll still split
> > into patches so they can be applied properly.
> 
> [ snipped patch which grubs around in the GDBGFIFOSPACE registers ]
> 
> Ick. That shouldn't be necessary. The Synopsys vendor driver works

you need to make sure that both there are no pending transfers and FIFO
is empty in case of IN endpoints. Databook itself says (sorry, forgot
the section and no access to the docs right now) that to figure out if
the FIFO is empty, application can use GDBGFIFOSPACE. Do you have any
better ideas ?

> fine with the mass-storage gadget (with stall=1) without doing anything
> like that.
> 
> When did the dwc3 driver start showing this problem? Wouldn't it be
> better to find the change which caused this? Or has dwc3 always had

it's probably been like that since ever :-) I just figured that my
scripts always had stall=0 and ended up never testing the other way.

> this problem, but you never tested with stall=1 before so didn't see
> it?

yup. Note that it's not enough to checked for TRB completion because
there could still be data in the FIFO which the host hasn't moved yet,
unless it's 100% guaranteed that the core won't fire XferComplete until
every single bit of data has been moved by the host.

But the way things are, I'd expect core to be firing transfer complete
as soon as all data has reached the FIFO (in case of TX).

cheers
Felipe Balbi Sept. 25, 2014, 2:46 a.m. | #10
Hi again,

On Wed, Sep 24, 2014 at 09:40:37PM -0500, Felipe Balbi wrote:
> On Thu, Sep 25, 2014 at 01:37:05AM +0000, Paul Zimmerman wrote:
> > > > > Then, after the host cleared the halt, the gadget apparently sent the
> > > > > data that _should_ have been sent previously.  The host was expecting
> > > > > to receive the CSW at this point, so there was an overflow error.
> > > > > That's what caused the host to perform a reset.
> > > > >
> > > > > Evidently this UDC implements the set_halt method incorrectly.
> > > > > According to the kerneldoc for usb_ep_set_halt:
> > > > >
> > > > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > > > >  * transfer requests are still queued, or if the controller hardware
> > > > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> > > >
> > > > damn old bugs :-) I'll fix that up and Cc stable.
> > > 
> > > alright fixed. Below you can see a combined diff which I'll still split
> > > into patches so they can be applied properly.
> > 
> > [ snipped patch which grubs around in the GDBGFIFOSPACE registers ]
> > 
> > Ick. That shouldn't be necessary. The Synopsys vendor driver works
> 
> you need to make sure that both there are no pending transfers and FIFO
> is empty in case of IN endpoints. Databook itself says (sorry, forgot
> the section and no access to the docs right now) that to figure out if
> the FIFO is empty, application can use GDBGFIFOSPACE. Do you have any
> better ideas ?

oh, and btw... I only noticed the failure with Linux. I mentioned that
it, though flakey during start, worked very stably against Mac OS X.
Perhaps Windows is also more forgiving ?

Can you share a little more details on what you guys did to get it to
work without making sure FIFO is empty ? I can't really understand how
you'd cover all cases otherwise...
Paul Zimmerman Sept. 25, 2014, 5:50 p.m. | #11
> From: Felipe Balbi [mailto:balbi@ti.com]
> Sent: Wednesday, September 24, 2014 7:41 PM
> 
> On Thu, Sep 25, 2014 at 01:37:05AM +0000, Paul Zimmerman wrote:
> > > > > Then, after the host cleared the halt, the gadget apparently sent the
> > > > > data that _should_ have been sent previously.  The host was expecting
> > > > > to receive the CSW at this point, so there was an overflow error.
> > > > > That's what caused the host to perform a reset.
> > > > >
> > > > > Evidently this UDC implements the set_halt method incorrectly.
> > > > > According to the kerneldoc for usb_ep_set_halt:
> > > > >
> > > > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > > > >  * transfer requests are still queued, or if the controller hardware
> > > > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> > > >
> > > > damn old bugs :-) I'll fix that up and Cc stable.
> > >
> > > alright fixed. Below you can see a combined diff which I'll still split
> > > into patches so they can be applied properly.
> >
> > [ snipped patch which grubs around in the GDBGFIFOSPACE registers ]
> >
> > Ick. That shouldn't be necessary. The Synopsys vendor driver works
> 
> you need to make sure that both there are no pending transfers and FIFO
> is empty in case of IN endpoints. Databook itself says (sorry, forgot
> the section and no access to the docs right now) that to figure out if
> the FIFO is empty, application can use GDBGFIFOSPACE. Do you have any
> better ideas ?

I guess I'm just afraid this may just be papering over the real cause.

> > fine with the mass-storage gadget (with stall=1) without doing anything
> > like that.
> >
> > When did the dwc3 driver start showing this problem? Wouldn't it be
> > better to find the change which caused this? Or has dwc3 always had
> 
> it's probably been like that since ever :-) I just figured that my
> scripts always had stall=0 and ended up never testing the other way.
> 
> > this problem, but you never tested with stall=1 before so didn't see
> > it?
> 
> yup. Note that it's not enough to checked for TRB completion because
> there could still be data in the FIFO which the host hasn't moved yet,
> unless it's 100% guaranteed that the core won't fire XferComplete until
> every single bit of data has been moved by the host.

That is supposed to be 100% guaranteed. The IN XferInProgress and
XferComplete events are only delivered after the host has ACKed the
packet.

> But the way things are, I'd expect core to be firing transfer complete
> as soon as all data has reached the FIFO (in case of TX).

Nope.

That's why I don't understand how this can happen for IN. AFAIK, a STALL
is only sent in response to something the host sent in the CBW. At that
point, there shouldn't be any IN transfers active.

BTW, about a year ago, I fixed a similar issue in our vendor driver.
But I don't remember what the cause was, I will search the Git log to
see if I can find the commit that fixed it.
Paul Zimmerman Sept. 25, 2014, 5:54 p.m. | #12
> From: Felipe Balbi [mailto:balbi@ti.com]
> 
> On Wed, Sep 24, 2014 at 09:40:37PM -0500, Felipe Balbi wrote:
> > On Thu, Sep 25, 2014 at 01:37:05AM +0000, Paul Zimmerman wrote:
> > > > > > Then, after the host cleared the halt, the gadget apparently sent the
> > > > > > data that _should_ have been sent previously.  The host was expecting
> > > > > > to receive the CSW at this point, so there was an overflow error.
> > > > > > That's what caused the host to perform a reset.
> > > > > >
> > > > > > Evidently this UDC implements the set_halt method incorrectly.
> > > > > > According to the kerneldoc for usb_ep_set_halt:
> > > > > >
> > > > > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > > > > >  * transfer requests are still queued, or if the controller hardware
> > > > > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> > > > >
> > > > > damn old bugs :-) I'll fix that up and Cc stable.
> > > >
> > > > alright fixed. Below you can see a combined diff which I'll still split
> > > > into patches so they can be applied properly.
> > >
> > > [ snipped patch which grubs around in the GDBGFIFOSPACE registers ]
> > >
> > > Ick. That shouldn't be necessary. The Synopsys vendor driver works
> >
> > you need to make sure that both there are no pending transfers and FIFO
> > is empty in case of IN endpoints. Databook itself says (sorry, forgot
> > the section and no access to the docs right now) that to figure out if
> > the FIFO is empty, application can use GDBGFIFOSPACE. Do you have any
> > better ideas ?
> 
> oh, and btw... I only noticed the failure with Linux. I mentioned that
> it, though flakey during start, worked very stably against Mac OS X.
> Perhaps Windows is also more forgiving ?

Yeah, I see you said it only fails with 3.17-rc on an embedded platform
you have. Can you say which platform that is? Do you know whose xHCI
controller it uses?

> Can you share a little more details on what you guys did to get it to
> work without making sure FIFO is empty ? I can't really understand how
> you'd cover all cases otherwise...

See my other email, I will try to find the fix for the similar issue
that we saw.
Alan Stern Sept. 25, 2014, 6:07 p.m. | #13
On Thu, 25 Sep 2014, Paul Zimmerman wrote:

> That's why I don't understand how this can happen for IN. AFAIK, a STALL
> is only sent in response to something the host sent in the CBW. At that
> point, there shouldn't be any IN transfers active.

The gadget may send a partial response to the CBW.  The end of the 
response is marked with a STALL.  The mass-storage gadget driver 
submits the partial response and then calls usb_ep_set_halt() without 
waiting for the IN data to be delivered.  It relies on the UDC driver 
returning -EAGAIN if any data is still pending.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Zimmerman Sept. 25, 2014, 7:08 p.m. | #14
> From: Alan Stern [mailto:stern@rowland.harvard.edu]
> Sent: Thursday, September 25, 2014 11:08 AM
> 
> On Thu, 25 Sep 2014, Paul Zimmerman wrote:
> 
> > That's why I don't understand how this can happen for IN. AFAIK, a STALL
> > is only sent in response to something the host sent in the CBW. At that
> > point, there shouldn't be any IN transfers active.
> 
> The gadget may send a partial response to the CBW.  The end of the
> response is marked with a STALL.  The mass-storage gadget driver
> submits the partial response and then calls usb_ep_set_halt() without
> waiting for the IN data to be delivered.  It relies on the UDC driver
> returning -EAGAIN if any data is still pending.

I guess you're referring to the code under the DATA_DIR_TO_HOST case
in finish_reply().

Felipe, in our vendor driver, the ep_set_halt() and ep_set_wedge()
functions check the request queue for the endpoint, and if it is not
empty, they return -EAGAIN.

I see your patch for the dwc3 driver does add that, in addition to the
FIFO empty check. Does it still work OK if you remove the FIFO empty
check portion?
Paul Zimmerman Sept. 25, 2014, 7:43 p.m. | #15
> From: Paul Zimmerman
> 
> > From: Felipe Balbi [mailto:balbi@ti.com]
> >
> > On Wed, Sep 24, 2014 at 09:40:37PM -0500, Felipe Balbi wrote:
> > > On Thu, Sep 25, 2014 at 01:37:05AM +0000, Paul Zimmerman wrote:
> > > > > > > Then, after the host cleared the halt, the gadget apparently sent the
> > > > > > > data that _should_ have been sent previously.  The host was expecting
> > > > > > > to receive the CSW at this point, so there was an overflow error.
> > > > > > > That's what caused the host to perform a reset.
> > > > > > >
> > > > > > > Evidently this UDC implements the set_halt method incorrectly.
> > > > > > > According to the kerneldoc for usb_ep_set_halt:
> > > > > > >
> > > > > > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > > > > > >  * transfer requests are still queued, or if the controller hardware
> > > > > > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> > > > > >
> > > > > > damn old bugs :-) I'll fix that up and Cc stable.
> > > > >
> > > > > alright fixed. Below you can see a combined diff which I'll still split
> > > > > into patches so they can be applied properly.
> > > >
> > > > [ snipped patch which grubs around in the GDBGFIFOSPACE registers ]
> > > >
> > > > Ick. That shouldn't be necessary. The Synopsys vendor driver works
> > >
> > > you need to make sure that both there are no pending transfers and FIFO
> > > is empty in case of IN endpoints. Databook itself says (sorry, forgot
> > > the section and no access to the docs right now) that to figure out if
> > > the FIFO is empty, application can use GDBGFIFOSPACE. Do you have any
> > > better ideas ?
> >
> > oh, and btw... I only noticed the failure with Linux. I mentioned that
> > it, though flakey during start, worked very stably against Mac OS X.
> > Perhaps Windows is also more forgiving ?
> 
> Yeah, I see you said it only fails with 3.17-rc on an embedded platform
> you have. Can you say which platform that is? Do you know whose xHCI
> controller it uses?
> 
> > Can you share a little more details on what you guys did to get it to
> > work without making sure FIFO is empty ? I can't really understand how
> > you'd cover all cases otherwise...
> 
> See my other email, I will try to find the fix for the similar issue
> that we saw.

OK, it looks like that issue is not related to this one. We were
accidentally setting the Stream Capable bit in the endpoint
configuration for all bulk EPs, not just stream-enabled ones. That
caused several strange behaviors, including the Set Stall command
not working.
Felipe Balbi Sept. 25, 2014, 9:03 p.m. | #16
On Thu, Sep 25, 2014 at 05:54:10PM +0000, Paul Zimmerman wrote:
> > From: Felipe Balbi [mailto:balbi@ti.com]
> > 
> > On Wed, Sep 24, 2014 at 09:40:37PM -0500, Felipe Balbi wrote:
> > > On Thu, Sep 25, 2014 at 01:37:05AM +0000, Paul Zimmerman wrote:
> > > > > > > Then, after the host cleared the halt, the gadget apparently sent the
> > > > > > > data that _should_ have been sent previously.  The host was expecting
> > > > > > > to receive the CSW at this point, so there was an overflow error.
> > > > > > > That's what caused the host to perform a reset.
> > > > > > >
> > > > > > > Evidently this UDC implements the set_halt method incorrectly.
> > > > > > > According to the kerneldoc for usb_ep_set_halt:
> > > > > > >
> > > > > > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > > > > > >  * transfer requests are still queued, or if the controller hardware
> > > > > > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> > > > > >
> > > > > > damn old bugs :-) I'll fix that up and Cc stable.
> > > > >
> > > > > alright fixed. Below you can see a combined diff which I'll still split
> > > > > into patches so they can be applied properly.
> > > >
> > > > [ snipped patch which grubs around in the GDBGFIFOSPACE registers ]
> > > >
> > > > Ick. That shouldn't be necessary. The Synopsys vendor driver works
> > >
> > > you need to make sure that both there are no pending transfers and FIFO
> > > is empty in case of IN endpoints. Databook itself says (sorry, forgot
> > > the section and no access to the docs right now) that to figure out if
> > > the FIFO is empty, application can use GDBGFIFOSPACE. Do you have any
> > > better ideas ?
> > 
> > oh, and btw... I only noticed the failure with Linux. I mentioned that
> > it, though flakey during start, worked very stably against Mac OS X.
> > Perhaps Windows is also more forgiving ?
> 
> Yeah, I see you said it only fails with 3.17-rc on an embedded platform
> you have. Can you say which platform that is? Do you know whose xHCI

AM473x with DWC3 2.40a (2 instances of it).

> controller it uses?

yours :-)

> > Can you share a little more details on what you guys did to get it to
> > work without making sure FIFO is empty ? I can't really understand how
> > you'd cover all cases otherwise...
> 
> See my other email, I will try to find the fix for the similar issue
> that we saw.

alright, tks
Felipe Balbi Sept. 25, 2014, 9:03 p.m. | #17
On Thu, Sep 25, 2014 at 07:43:35PM +0000, Paul Zimmerman wrote:
> > From: Paul Zimmerman
> > 
> > > From: Felipe Balbi [mailto:balbi@ti.com]
> > >
> > > On Wed, Sep 24, 2014 at 09:40:37PM -0500, Felipe Balbi wrote:
> > > > On Thu, Sep 25, 2014 at 01:37:05AM +0000, Paul Zimmerman wrote:
> > > > > > > > Then, after the host cleared the halt, the gadget apparently sent the
> > > > > > > > data that _should_ have been sent previously.  The host was expecting
> > > > > > > > to receive the CSW at this point, so there was an overflow error.
> > > > > > > > That's what caused the host to perform a reset.
> > > > > > > >
> > > > > > > > Evidently this UDC implements the set_halt method incorrectly.
> > > > > > > > According to the kerneldoc for usb_ep_set_halt:
> > > > > > > >
> > > > > > > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > > > > > > >  * transfer requests are still queued, or if the controller hardware
> > > > > > > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> > > > > > >
> > > > > > > damn old bugs :-) I'll fix that up and Cc stable.
> > > > > >
> > > > > > alright fixed. Below you can see a combined diff which I'll still split
> > > > > > into patches so they can be applied properly.
> > > > >
> > > > > [ snipped patch which grubs around in the GDBGFIFOSPACE registers ]
> > > > >
> > > > > Ick. That shouldn't be necessary. The Synopsys vendor driver works
> > > >
> > > > you need to make sure that both there are no pending transfers and FIFO
> > > > is empty in case of IN endpoints. Databook itself says (sorry, forgot
> > > > the section and no access to the docs right now) that to figure out if
> > > > the FIFO is empty, application can use GDBGFIFOSPACE. Do you have any
> > > > better ideas ?
> > >
> > > oh, and btw... I only noticed the failure with Linux. I mentioned that
> > > it, though flakey during start, worked very stably against Mac OS X.
> > > Perhaps Windows is also more forgiving ?
> > 
> > Yeah, I see you said it only fails with 3.17-rc on an embedded platform
> > you have. Can you say which platform that is? Do you know whose xHCI
> > controller it uses?
> > 
> > > Can you share a little more details on what you guys did to get it to
> > > work without making sure FIFO is empty ? I can't really understand how
> > > you'd cover all cases otherwise...
> > 
> > See my other email, I will try to find the fix for the similar issue
> > that we saw.
> 
> OK, it looks like that issue is not related to this one. We were
> accidentally setting the Stream Capable bit in the endpoint
> configuration for all bulk EPs, not just stream-enabled ones. That
> caused several strange behaviors, including the Set Stall command
> not working.

ok, this it not about Set Stall not working, however, it's about making
sure we don't stall at the wrong time :-)
Felipe Balbi Sept. 25, 2014, 9:06 p.m. | #18
Hi,

On Thu, Sep 25, 2014 at 05:50:59PM +0000, Paul Zimmerman wrote:
> > From: Felipe Balbi [mailto:balbi@ti.com]
> > Sent: Wednesday, September 24, 2014 7:41 PM
> > 
> > On Thu, Sep 25, 2014 at 01:37:05AM +0000, Paul Zimmerman wrote:
> > > > > > Then, after the host cleared the halt, the gadget apparently sent the
> > > > > > data that _should_ have been sent previously.  The host was expecting
> > > > > > to receive the CSW at this point, so there was an overflow error.
> > > > > > That's what caused the host to perform a reset.
> > > > > >
> > > > > > Evidently this UDC implements the set_halt method incorrectly.
> > > > > > According to the kerneldoc for usb_ep_set_halt:
> > > > > >
> > > > > >  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
> > > > > >  * transfer requests are still queued, or if the controller hardware
> > > > > >  * (usually a FIFO) still holds bytes that the host hasn't collected.
> > > > >
> > > > > damn old bugs :-) I'll fix that up and Cc stable.
> > > >
> > > > alright fixed. Below you can see a combined diff which I'll still split
> > > > into patches so they can be applied properly.
> > >
> > > [ snipped patch which grubs around in the GDBGFIFOSPACE registers ]
> > >
> > > Ick. That shouldn't be necessary. The Synopsys vendor driver works
> > 
> > you need to make sure that both there are no pending transfers and FIFO
> > is empty in case of IN endpoints. Databook itself says (sorry, forgot
> > the section and no access to the docs right now) that to figure out if
> > the FIFO is empty, application can use GDBGFIFOSPACE. Do you have any
> > better ideas ?
> 
> I guess I'm just afraid this may just be papering over the real cause.

Well, the API documentation, as pointed out by Alan, calls for:

380  * Attempts to halt IN endpoints will fail (returning -EAGAIN) if any
381  * transfer requests are still queued, or if the controller hardware
382  * (usually a FIFO) still holds bytes that the host hasn't collected.

There's no other way to ensuring FIFO is empty with this core.

> > > fine with the mass-storage gadget (with stall=1) without doing anything
> > > like that.
> > >
> > > When did the dwc3 driver start showing this problem? Wouldn't it be
> > > better to find the change which caused this? Or has dwc3 always had
> > 
> > it's probably been like that since ever :-) I just figured that my
> > scripts always had stall=0 and ended up never testing the other way.
> > 
> > > this problem, but you never tested with stall=1 before so didn't see
> > > it?
> > 
> > yup. Note that it's not enough to checked for TRB completion because
> > there could still be data in the FIFO which the host hasn't moved yet,
> > unless it's 100% guaranteed that the core won't fire XferComplete until
> > every single bit of data has been moved by the host.
> 
> That is supposed to be 100% guaranteed. The IN XferInProgress and
> XferComplete events are only delivered after the host has ACKed the
> packet.
> 
> > But the way things are, I'd expect core to be firing transfer complete
> > as soon as all data has reached the FIFO (in case of TX).
> 
> Nope.
> 
> That's why I don't understand how this can happen for IN. AFAIK, a STALL
> is only sent in response to something the host sent in the CBW. At that
> point, there shouldn't be any IN transfers active.

you could race with the HW, right ? You just issued Start Transfer for
that IN transfer and control returns to the gadget driver which can, at
that very moment, issue a Set Stall. SW *must* make sure that there's
nothing pending yet. We haven't received XferComplete for that IN we
just started.

> BTW, about a year ago, I fixed a similar issue in our vendor driver.
> But I don't remember what the cause was, I will search the Git log to
> see if I can find the commit that fixed it.

ok, thanks.
Felipe Balbi Sept. 25, 2014, 9:10 p.m. | #19
Hi,

On Thu, Sep 25, 2014 at 07:08:12PM +0000, Paul Zimmerman wrote:
> > From: Alan Stern [mailto:stern@rowland.harvard.edu]
> > Sent: Thursday, September 25, 2014 11:08 AM
> > 
> > On Thu, 25 Sep 2014, Paul Zimmerman wrote:
> > 
> > > That's why I don't understand how this can happen for IN. AFAIK, a STALL
> > > is only sent in response to something the host sent in the CBW. At that
> > > point, there shouldn't be any IN transfers active.
> > 
> > The gadget may send a partial response to the CBW.  The end of the
> > response is marked with a STALL.  The mass-storage gadget driver
> > submits the partial response and then calls usb_ep_set_halt() without
> > waiting for the IN data to be delivered.  It relies on the UDC driver
> > returning -EAGAIN if any data is still pending.
> 
> I guess you're referring to the code under the DATA_DIR_TO_HOST case
> in finish_reply().
> 
> Felipe, in our vendor driver, the ep_set_halt() and ep_set_wedge()
> functions check the request queue for the endpoint, and if it is not
> empty, they return -EAGAIN.

is that really enough ? The request is deleted (rather, moved to another
list) as soon as StartTransfer is issued. I can certainly check both
lists (and already do) but I fear that we might still race with the HW
because there's no documentation to guarantee that I won't :-)

> I see your patch for the dwc3 driver does add that, in addition to the
> FIFO empty check. Does it still work OK if you remove the FIFO empty
> check portion?

It probably does, but I can't be sure there won't be a corner case where
the list is empty (Xfercomplete was issued and request given back) but
host hasn't moved all the data. Synopsys databook doesn't guarantee that
will never be the case.

Can you confirm, without a shadow of a doubt, that XferComplete will
only be issued after host has moved every single bit of data ? If that
can be updated to the databook, then sure, I can remove the FIFO space
check; otherwise we might leave a corner case that will take us a few
days to debug again because it will be very difficult to trigger :-)

cheers
Paul Zimmerman Sept. 25, 2014, 9:57 p.m. | #20
> From: Felipe Balbi [mailto:balbi@ti.com]
> 
> On Thu, Sep 25, 2014 at 07:08:12PM +0000, Paul Zimmerman wrote:
> > > From: Alan Stern [mailto:stern@rowland.harvard.edu]
> > > Sent: Thursday, September 25, 2014 11:08 AM
> > >
> > > On Thu, 25 Sep 2014, Paul Zimmerman wrote:
> > >
> > > > That's why I don't understand how this can happen for IN. AFAIK, a STALL
> > > > is only sent in response to something the host sent in the CBW. At that
> > > > point, there shouldn't be any IN transfers active.
> > >
> > > The gadget may send a partial response to the CBW.  The end of the
> > > response is marked with a STALL.  The mass-storage gadget driver
> > > submits the partial response and then calls usb_ep_set_halt() without
> > > waiting for the IN data to be delivered.  It relies on the UDC driver
> > > returning -EAGAIN if any data is still pending.
> >
> > I guess you're referring to the code under the DATA_DIR_TO_HOST case
> > in finish_reply().
> >
> > Felipe, in our vendor driver, the ep_set_halt() and ep_set_wedge()
> > functions check the request queue for the endpoint, and if it is not
> > empty, they return -EAGAIN.
> 
> is that really enough ? The request is deleted (rather, moved to another
> list) as soon as StartTransfer is issued. I can certainly check both
> lists (and already do) but I fear that we might still race with the HW
> because there's no documentation to guarantee that I won't :-)
> 
> > I see your patch for the dwc3 driver does add that, in addition to the
> > FIFO empty check. Does it still work OK if you remove the FIFO empty
> > check portion?
> 
> It probably does, but I can't be sure there won't be a corner case where
> the list is empty (Xfercomplete was issued and request given back) but
> host hasn't moved all the data. Synopsys databook doesn't guarantee that
> will never be the case.
> 
> Can you confirm, without a shadow of a doubt, that XferComplete will
> only be issued after host has moved every single bit of data ? If that
> can be updated to the databook, then sure, I can remove the FIFO space
> check; otherwise we might leave a corner case that will take us a few
> days to debug again because it will be very difficult to trigger :-)

From the 2.70a databook, section 8.2.2:

"While processing TRBs, two conditions may cause the core to write out an
 event and raise an interrupt line:
	- TRB Complete:
		- For OUT endpoints, a packet is received which reduces the
		  remaining byte count in the TRB buffer to zero.
		- For IN endpoints, an acknowledgement is received for a
		  transmitted packet which reduces the remaining byte count
		  in the TRB buffer to zero.
	- Short Packet Received:
		- For OUT endpoints only. While writing to a TRB buffer,
		  the endpoint receives a packet that is smaller than the
		  endpoint's MaxPacketSize.
"

So for an IN, the completion event doesn't come until the TRB has been
ACKed, which means all the data has been sent.

But, I'm not saying you *have* to remove the FIFO space check. I think
we now know the problem was caused by the missing -EAGAIN. So having
the FIFO check there shouldn't hurt anything, it's not masking some
other problem. And you may need it in the future for one of those
other cases the databook talks about.
Paul Zimmerman Sept. 25, 2014, 10 p.m. | #21
> From the 2.70a databook, section 8.2.2:

Sorry, it's databook section 8.2.3, not 8.2.2.
Felipe Balbi Sept. 25, 2014, 11:12 p.m. | #22
Hi,

On Thu, Sep 25, 2014 at 09:57:59PM +0000, Paul Zimmerman wrote:
> > From: Felipe Balbi [mailto:balbi@ti.com]
> > 
> > On Thu, Sep 25, 2014 at 07:08:12PM +0000, Paul Zimmerman wrote:
> > > > From: Alan Stern [mailto:stern@rowland.harvard.edu]
> > > > Sent: Thursday, September 25, 2014 11:08 AM
> > > >
> > > > On Thu, 25 Sep 2014, Paul Zimmerman wrote:
> > > >
> > > > > That's why I don't understand how this can happen for IN. AFAIK, a STALL
> > > > > is only sent in response to something the host sent in the CBW. At that
> > > > > point, there shouldn't be any IN transfers active.
> > > >
> > > > The gadget may send a partial response to the CBW.  The end of the
> > > > response is marked with a STALL.  The mass-storage gadget driver
> > > > submits the partial response and then calls usb_ep_set_halt() without
> > > > waiting for the IN data to be delivered.  It relies on the UDC driver
> > > > returning -EAGAIN if any data is still pending.
> > >
> > > I guess you're referring to the code under the DATA_DIR_TO_HOST case
> > > in finish_reply().
> > >
> > > Felipe, in our vendor driver, the ep_set_halt() and ep_set_wedge()
> > > functions check the request queue for the endpoint, and if it is not
> > > empty, they return -EAGAIN.
> > 
> > is that really enough ? The request is deleted (rather, moved to another
> > list) as soon as StartTransfer is issued. I can certainly check both
> > lists (and already do) but I fear that we might still race with the HW
> > because there's no documentation to guarantee that I won't :-)
> > 
> > > I see your patch for the dwc3 driver does add that, in addition to the
> > > FIFO empty check. Does it still work OK if you remove the FIFO empty
> > > check portion?
> > 
> > It probably does, but I can't be sure there won't be a corner case where
> > the list is empty (Xfercomplete was issued and request given back) but
> > host hasn't moved all the data. Synopsys databook doesn't guarantee that
> > will never be the case.
> > 
> > Can you confirm, without a shadow of a doubt, that XferComplete will
> > only be issued after host has moved every single bit of data ? If that
> > can be updated to the databook, then sure, I can remove the FIFO space
> > check; otherwise we might leave a corner case that will take us a few
> > days to debug again because it will be very difficult to trigger :-)
> 
> From the 2.70a databook, section 8.2.2:
> 
> "While processing TRBs, two conditions may cause the core to write out an
>  event and raise an interrupt line:
> 	- TRB Complete:
> 		- For OUT endpoints, a packet is received which reduces the
> 		  remaining byte count in the TRB buffer to zero.
> 		- For IN endpoints, an acknowledgement is received for a
> 		  transmitted packet which reduces the remaining byte count
> 		  in the TRB buffer to zero.
> 	- Short Packet Received:
> 		- For OUT endpoints only. While writing to a TRB buffer,
> 		  the endpoint receives a packet that is smaller than the
> 		  endpoint's MaxPacketSize.
> "
> 
> So for an IN, the completion event doesn't come until the TRB has been
> ACKed, which means all the data has been sent.

a very good point indeed. I completely missed that. Well, the patch can
become a lot smaller, which makes my life easier when backporting :-)

I'll test that out tomorrow, for sure :-)

> But, I'm not saying you *have* to remove the FIFO space check. I think
> we now know the problem was caused by the missing -EAGAIN. So having

sure, that's certainly what was missing.

> the FIFO check there shouldn't hurt anything, it's not masking some

it's also completely useless. A few register accesses for no reason :-)

> other problem. And you may need it in the future for one of those
> other cases the databook talks about.

I'll keep that in the back of my head. Perhaps I can repurpose these
FIFO space accessors to implement usb_ep_fifo_status(). I'll consider
that.

cheers
Felipe Balbi Sept. 26, 2014, 3:45 p.m. | #23
On Thu, Sep 25, 2014 at 06:12:51PM -0500, Felipe Balbi wrote:
> Hi,
> 
> On Thu, Sep 25, 2014 at 09:57:59PM +0000, Paul Zimmerman wrote:
> > > From: Felipe Balbi [mailto:balbi@ti.com]
> > > 
> > > On Thu, Sep 25, 2014 at 07:08:12PM +0000, Paul Zimmerman wrote:
> > > > > From: Alan Stern [mailto:stern@rowland.harvard.edu]
> > > > > Sent: Thursday, September 25, 2014 11:08 AM
> > > > >
> > > > > On Thu, 25 Sep 2014, Paul Zimmerman wrote:
> > > > >
> > > > > > That's why I don't understand how this can happen for IN. AFAIK, a STALL
> > > > > > is only sent in response to something the host sent in the CBW. At that
> > > > > > point, there shouldn't be any IN transfers active.
> > > > >
> > > > > The gadget may send a partial response to the CBW.  The end of the
> > > > > response is marked with a STALL.  The mass-storage gadget driver
> > > > > submits the partial response and then calls usb_ep_set_halt() without
> > > > > waiting for the IN data to be delivered.  It relies on the UDC driver
> > > > > returning -EAGAIN if any data is still pending.
> > > >
> > > > I guess you're referring to the code under the DATA_DIR_TO_HOST case
> > > > in finish_reply().
> > > >
> > > > Felipe, in our vendor driver, the ep_set_halt() and ep_set_wedge()
> > > > functions check the request queue for the endpoint, and if it is not
> > > > empty, they return -EAGAIN.
> > > 
> > > is that really enough ? The request is deleted (rather, moved to another
> > > list) as soon as StartTransfer is issued. I can certainly check both
> > > lists (and already do) but I fear that we might still race with the HW
> > > because there's no documentation to guarantee that I won't :-)
> > > 
> > > > I see your patch for the dwc3 driver does add that, in addition to the
> > > > FIFO empty check. Does it still work OK if you remove the FIFO empty
> > > > check portion?
> > > 
> > > It probably does, but I can't be sure there won't be a corner case where
> > > the list is empty (Xfercomplete was issued and request given back) but
> > > host hasn't moved all the data. Synopsys databook doesn't guarantee that
> > > will never be the case.
> > > 
> > > Can you confirm, without a shadow of a doubt, that XferComplete will
> > > only be issued after host has moved every single bit of data ? If that
> > > can be updated to the databook, then sure, I can remove the FIFO space
> > > check; otherwise we might leave a corner case that will take us a few
> > > days to debug again because it will be very difficult to trigger :-)
> > 
> > From the 2.70a databook, section 8.2.2:
> > 
> > "While processing TRBs, two conditions may cause the core to write out an
> >  event and raise an interrupt line:
> > 	- TRB Complete:
> > 		- For OUT endpoints, a packet is received which reduces the
> > 		  remaining byte count in the TRB buffer to zero.
> > 		- For IN endpoints, an acknowledgement is received for a
> > 		  transmitted packet which reduces the remaining byte count
> > 		  in the TRB buffer to zero.
> > 	- Short Packet Received:
> > 		- For OUT endpoints only. While writing to a TRB buffer,
> > 		  the endpoint receives a packet that is smaller than the
> > 		  endpoint's MaxPacketSize.
> > "
> > 
> > So for an IN, the completion event doesn't come until the TRB has been
> > ACKed, which means all the data has been sent.
> 
> a very good point indeed. I completely missed that. Well, the patch can
> become a lot smaller, which makes my life easier when backporting :-)
> 
> I'll test that out tomorrow, for sure :-)
> 
> > But, I'm not saying you *have* to remove the FIFO space check. I think
> > we now know the problem was caused by the missing -EAGAIN. So having
> 
> sure, that's certainly what was missing.
> 
> > the FIFO check there shouldn't hurt anything, it's not masking some
> 
> it's also completely useless. A few register accesses for no reason :-)
> 
> > other problem. And you may need it in the future for one of those
> > other cases the databook talks about.
> 
> I'll keep that in the back of my head. Perhaps I can repurpose these
> FIFO space accessors to implement usb_ep_fifo_status(). I'll consider
> that.

alright, removed all that and it still works fine. Even passes Halt
Endpoint test on usb20cv.

Patch

diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h
index 66f6256..834f524 100644
--- a/drivers/usb/dwc3/core.h
+++ b/drivers/usb/dwc3/core.h
@@ -210,6 +210,14 @@ 
 #define DWC3_GHWPARAMS4_HIBER_SCRATCHBUFS(n)	(((n) & (0x0f << 13)) >> 13)
 #define DWC3_MAX_HIBER_SCRATCHBUFS		15
 
+/* Global FIFO Space Register */
+#define DWC3_GDBGFIFOSPACE_TXFIFO		(0 << 5)
+#define DWC3_GDBGFIFOSPACE_RXFIFO		(1 << 5)
+#define DWC3_GDBGFIFOSPACE_TXREQ_Q		(2 << 5)
+#define DWC3_GDBGFIFOSPACE_RXREQ_Q		(3 << 5)
+
+#define DWC3_GDBGFIFOSPACE_SPACE_AVAIL(num)	(((num) & 0xffff0000) >> 16)
+
 /* Device Configuration Register */
 #define DWC3_DCFG_DEVADDR(addr)	((addr) << 3)
 #define DWC3_DCFG_DEVADDR_MASK	DWC3_DCFG_DEVADDR(0x7f)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index de53361..5e89913 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -145,6 +145,75 @@  int dwc3_gadget_set_link_state(struct dwc3 *dwc, enum dwc3_link_state state)
 }
 
 /**
+ * dwc3_gadget_ep_fifo_space - returns currently available space on FIFO
+ * @dwc: pointer to our context struct
+ * @dep: the endpoint to fetch FIFO space
+ *
+ * This function will return the currently available FIFO space
+ */
+static int dwc3_gadget_ep_fifo_space(struct dwc3 *dwc, struct dwc3_ep *dep)
+{
+	u32 current_fifo;
+	u32 reg;
+
+	if (dep->direction)
+		reg = DWC3_GDBGFIFOSPACE_TXFIFO;
+	else
+		reg = DWC3_GDBGFIFOSPACE_RXFIFO;
+
+	/* remove direction bit */
+	reg |= dep->number >> 1;
+
+	dwc3_writel(dwc->regs, DWC3_GDBGFIFOSPACE, reg);
+	reg = dwc3_readl(dwc->regs, DWC3_GDBGFIFOSPACE);
+	current_fifo = DWC3_GDBGFIFOSPACE_SPACE_AVAIL(reg);
+
+	return current_fifo;
+}
+
+/**
+ * dwc3_gadget_ep_fifo_size - returns allocated FIFO size
+ * @dwc: pointer to our context struct
+ * @dep: TX endpoint to fetch allocated FIFO size
+ *
+ * This function will read the correct TX FIFO register and
+ * return the FIFO size
+ */
+static int dwc3_gadget_ep_fifo_size(struct dwc3 *dwc, struct dwc3_ep *dep)
+{
+	if (WARN_ON(!dep->direction))
+		return 0;
+
+	return dwc3_readl(dwc->regs, DWC3_GTXFIFOSIZ(dep->number)) & 0xffff;
+}
+
+/**
+ * dwc3_gadget_ep_fifo_empty - returns true when FIFO is empty
+ * @dwc: pointer to our context struct
+ * @dep: endpoint to request FIFO space
+ *
+ * This function will return a TRUE when FIFO is empty and FALSE
+ * otherwise.
+ */
+static int dwc3_gadget_ep_fifo_empty(struct dwc3 *dwc, struct dwc3_ep *dep)
+{
+	u32 allocated_fifo;
+	u32 current_fifo;
+
+	/* should not be called for RX endpoints */
+	if (WARN_ON(!dep->direction))
+		return false;
+
+	current_fifo = dwc3_gadget_ep_fifo_space(dwc, dep);
+	allocated_fifo = dwc3_gadget_ep_fifo_size(dwc, dep);
+
+	dev_vdbg(dwc->dev, "%s: FIFO space %u/%u\n", dep->name,
+			current_fifo, allocated_fifo);
+
+	return current_fifo == allocated_fifo;
+}
+
+/**
  * dwc3_gadget_resize_tx_fifos - reallocate fifo spaces for current use-case
  * @dwc: pointer to our context structure
  *
@@ -1216,6 +1285,20 @@  int __dwc3_gadget_ep_set_halt(struct dwc3_ep *dep, int value)
 	memset(&params, 0x00, sizeof(params));
 
 	if (value) {
+		if (dep->flags & DWC3_EP_BUSY ||
+				(!list_empty(&dep->req_queued) ||
+				 !list_empty(&dep->request_list))) {
+			dev_dbg(dwc->dev, "%s: pending request, cannot halt\n",
+					dep->name);
+			return -EAGAIN;
+		}
+
+		if (dep->direction && !dwc3_gadget_ep_fifo_empty(dwc, dep)) {
+			dev_dbg(dwc->dev, "%s: FIFO not empty, cannot halt\n",
+					dep->name);
+			return -EAGAIN;
+		}
+
 		ret = dwc3_send_gadget_ep_cmd(dwc, dep->number,
 			DWC3_DEPCMD_SETSTALL, &params);
 		if (ret)