[2/2] usb: host: xhci: Handle the right timeout command

Message ID	0c39bfed1cf6f7b747e702aa841f82c9d2140f27.1480922249.git.baolin.wang@linaro.org
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-usb-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Baolin Wang <baolin.wang@linaro.org> To: mathias.nyman@intel.com, gregkh@linuxfoundation.org Cc: baolu.lu@linux.intel.com, linux-usb@vger.kernel.org, linux-kernel@vger.kernel.org, broonie@kernel.org, baolin.wang@linaro.org Subject: [PATCH 2/2] usb: host: xhci: Handle the right timeout command Date: Mon, 5 Dec 2016 15:51:50 +0800 Message-Id: <0c39bfed1cf6f7b747e702aa841f82c9d2140f27.1480922249.git.baolin.wang@linaro.org> In-Reply-To: <b0157454dce55d1288c7e097ddffc3d6c81351b2.1480922249.git.baolin.wang@linaro.org> References: <b0157454dce55d1288c7e097ddffc3d6c81351b2.1480922249.git.baolin.wang@linaro.org> In-Reply-To: <b0157454dce55d1288c7e097ddffc3d6c81351b2.1480922249.git.baolin.wang@linaro.org> References: <b0157454dce55d1288c7e097ddffc3d6c81351b2.1480922249.git.baolin.wang@linaro.org> Sender: linux-usb-owner@vger.kernel.org Precedence: bulk

(Exiting) Baolin Wang Dec. 5, 2016, 7:51 a.m. UTC

If a command event is found on the event ring during an interrupt,
we need to stop the command timer with del_timer(). Since del_timer()
can fail if the timer is running and waiting on the xHCI lock, then
it maybe get the wrong timeout command in xhci_handle_command_timeout()
if host fetched a new command and updated the xhci->current_cmd in
handle_cmd_completion(). For this situation, we need a way to signal
to the command timer that everything is fine and it should exit.

We should introduce a counter (xhci->current_cmd_pending) for the number
of pending commands. If we need to cancel the command timer and del_timer()
succeeds, we decrement the number of pending commands. If del_timer() fails,
we leave the number of pending commands alone.

For handling timeout command, in xhci_handle_command_timeout() we will check
the counter after decrementing it, if the counter (xhci->current_cmd_pending)
is 0, which means xhci->current_cmd is the right timeout command. If the
counter (xhci->current_cmd_pending) is greater than 0, which means current
timeout command has been handled by host and host has fetched new command as
xhci->current_cmd, then just return and wait for new current command.

Signed-off-by: Baolin Wang <baolin.wang@linaro.org>

---
 drivers/usb/host/xhci-ring.c |   29 ++++++++++++++++++++++++++++-
 drivers/usb/host/xhci.h      |    1 +
 2 files changed, 29 insertions(+), 1 deletion(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Exiting) Baolin Wang Dec. 13, 2016, 3:21 a.m. UTC | #1

Hi Mathias,

On 12 December 2016 at 23:52, Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
> On 05.12.2016 09:51, Baolin Wang wrote:

>>

>> If a command event is found on the event ring during an interrupt,

>> we need to stop the command timer with del_timer(). Since del_timer()

>> can fail if the timer is running and waiting on the xHCI lock, then

>> it maybe get the wrong timeout command in xhci_handle_command_timeout()

>> if host fetched a new command and updated the xhci->current_cmd in

>> handle_cmd_completion(). For this situation, we need a way to signal

>> to the command timer that everything is fine and it should exit.

>

>

> Ah, right, this could actually happen.

>

>>

>>

>> We should introduce a counter (xhci->current_cmd_pending) for the number

>> of pending commands. If we need to cancel the command timer and

>> del_timer()

>> succeeds, we decrement the number of pending commands. If del_timer()

>> fails,

>> we leave the number of pending commands alone.

>>

>> For handling timeout command, in xhci_handle_command_timeout() we will

>> check

>> the counter after decrementing it, if the counter

>> (xhci->current_cmd_pending)

>> is 0, which means xhci->current_cmd is the right timeout command. If the

>> counter (xhci->current_cmd_pending) is greater than 0, which means current

>> timeout command has been handled by host and host has fetched new command

>> as

>> xhci->current_cmd, then just return and wait for new current command.

>

>

> A counter like this could work.

>

> Writing the abort bit can generate either ABORT+STOP, or just STOP

> event, this seems to cover both.

>

> quick check, case 1: timeout and cmd completion at the same time.

>

> cpu1                                    cpu2

>

> queue_command(first), p++ (=1)

> queue_command(more),

> --completion irq fires--                -- timer times out at same time--

> handle_cmd_completion()                 handle_cmd_timeout(),)

> lock(xhci_lock  )                       spin_on(xhci_lock)

> del_timer() fail, p (=1, nochange)

> cur_cmd = list_next(), p++ (=2)

> unlock(xhci_lock)

>                                         lock(xhci_lock)

>                                         p-- (=1)

>                                         if (p > 0), exit

> OK works

>

> case 2: normal timeout case with ABORT+STOP, no race.

>

> cpu1                                    cpu2

>

> queue_command(first), p++ (=1)

> queue_command(more),

>                                         handle_cmd_timeout()

>                                         p-- (P=0), don't exit

>                                         mod_timer(), p++ (P=1)

>                                         write_abort_bit()

> handle_cmd_comletion(ABORT)

> del_timer(), ok, p-- (p = 0)

> handle_cmd_completion(STOP)

> del_timer(), fail, (P=0)

> handle_stopped_cmd_ring()

> cur_cmd = list_next(), p++ (=1)

> mod_timer()

>

> OK, works, and same for just STOP case, with the only difference that

> during handle_cmd_completion(STOP) p is decremented (p--)


Yes, that's the cases what I want to handle, thanks for your explicit
explanation.

>

> So unless there is a way to find out if cur_cmd is valid in command timeout

> in command timeout with the help of existing flags and lists this would be a

> working

> solution.

>

> -Mathias

>




-- 
Baolin.wang
Best Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mathias Nyman Dec. 19, 2016, 12:13 p.m. UTC | #2

On 19.12.2016 13:34, Baolin Wang wrote:
> Hi Mathias,

>

> On 19 December 2016 at 18:33, Mathias Nyman

> <mathias.nyman@linux.intel.com> wrote:

>> On 13.12.2016 05:21, Baolin Wang wrote:

>>>

>>> Hi Mathias,

>>>

>>> On 12 December 2016 at 23:52, Mathias Nyman

>>> <mathias.nyman@linux.intel.com> wrote:

>>>>

>>>> On 05.12.2016 09:51, Baolin Wang wrote:

>>>>>

>>>>>

>>>>> If a command event is found on the event ring during an interrupt,

>>>>> we need to stop the command timer with del_timer(). Since del_timer()

>>>>> can fail if the timer is running and waiting on the xHCI lock, then

>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()

>>>>> if host fetched a new command and updated the xhci->current_cmd in

>>>>> handle_cmd_completion(). For this situation, we need a way to signal

>>>>> to the command timer that everything is fine and it should exit.

>>>>

>>>>

>>>>

>>>> Ah, right, this could actually happen.

>>>>

>>>>>

>>>>>

>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number

>>>>> of pending commands. If we need to cancel the command timer and

>>>>> del_timer()

>>>>> succeeds, we decrement the number of pending commands. If del_timer()

>>>>> fails,

>>>>> we leave the number of pending commands alone.

>>>>>

>>>>> For handling timeout command, in xhci_handle_command_timeout() we will

>>>>> check

>>>>> the counter after decrementing it, if the counter

>>>>> (xhci->current_cmd_pending)

>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the

>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means

>>>>> current

>>>>> timeout command has been handled by host and host has fetched new

>>>>> command

>>>>> as

>>>>> xhci->current_cmd, then just return and wait for new current command.

>>>>

>>>>

>>>>

>>>> A counter like this could work.

>>>>

>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP

>>>> event, this seems to cover both.

>>>>

>>>> quick check, case 1: timeout and cmd completion at the same time.

>>>>

>>>> cpu1                                    cpu2

>>>>

>>>> queue_command(first), p++ (=1)

>>>> queue_command(more),

>>>> --completion irq fires--                -- timer times out at same time--

>>>> handle_cmd_completion()                 handle_cmd_timeout(),)

>>>> lock(xhci_lock  )                       spin_on(xhci_lock)

>>>> del_timer() fail, p (=1, nochange)

>>>> cur_cmd = list_next(), p++ (=2)

>>>> unlock(xhci_lock)

>>>>                                           lock(xhci_lock)

>>>>                                           p-- (=1)

>>>>                                           if (p > 0), exit

>>>> OK works

>>>>

>>>> case 2: normal timeout case with ABORT+STOP, no race.

>>>>

>>>> cpu1                                    cpu2

>>>>

>>>> queue_command(first), p++ (=1)

>>>> queue_command(more),

>>>>                                           handle_cmd_timeout()

>>>>                                           p-- (P=0), don't exit

>>>>                                           mod_timer(), p++ (P=1)

>>>>                                           write_abort_bit()

>>>> handle_cmd_comletion(ABORT)

>>>> del_timer(), ok, p-- (p = 0)

>>>> handle_cmd_completion(STOP)

>>>> del_timer(), fail, (P=0)

>>>> handle_stopped_cmd_ring()

>>>> cur_cmd = list_next(), p++ (=1)

>>>> mod_timer()

>>>>

>>>> OK, works, and same for just STOP case, with the only difference that

>>>> during handle_cmd_completion(STOP) p is decremented (p--)

>>>

>>>

>>> Yes, that's the cases what I want to handle, thanks for your explicit

>>> explanation.

>>>

>>

>> Gave this some more thought over the weekend, and this implementation

>> doesn't solve the case when the last command times out and races with the

>> completion handler:

>>

>> cpu1                                    cpu2

>>

>> queue_command(first), p++ (=1)

>> --completion irq fires--                -- timer times out at same time--

>> handle_cmd_completion()                 handle_cmd_timeout(),)

>> lock(xhci_lock )                        spin_on(xhci_lock)

>> del_timer() fail, p (=1, nochange)

>> no more commands, P (=1, nochange)

>> unlock(xhci_lock)

>>                                          lock(xhci_lock)

>>                                          p-- (=0)

>>                                          p == 0, continue, even if we should

>> not.

>>                                            For this we still need to rely on

>> checking cur_cmd == NULL in the timeout function.

>> (Baolus patch sets it to NULL if there are no more commands pending)

>

> As I pointed out in patch 1 of this patchset, this patchset is based

> on Lu Baolu's new fix patch:

> usb: xhci: fix possible wild pointer

> https://www.spinics.net/lists/linux-usb/msg150219.html

>

> After applying Baolu's patch, after decrement the counter, we will

> check the xhci->cur_command if is NULL. So in this situation:

> cpu1                                    cpu2

>

>   queue_command(first), p++ (=1)

>   --completion irq fires--                -- timer times out at same time--

>   handle_cmd_completion()                 handle_cmd_timeout(),)

>   lock(xhci_lock )                        spin_on(xhci_lock)

>   del_timer() fail, p (=1, nochange)

>   no more commands, P (=1, nochange)

>   unlock(xhci_lock)

>                                           lock(xhci_lock)

>                                           p-- (=0)

>                                           no current command, return

>                                           if (!xhci->current_cmd) {

>                                                unlock(xhci_lock);

>                                                return;

>                                           }

>

> It can work.


Yes,

What I wanted to say is that as it relies on Baolus patch for that one case
it seems that patch 2/2 can be replaced by a single line change:

if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))

Right?

-Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Exiting) Baolin Wang Dec. 20, 2016, 3:23 a.m. UTC | #3

On 19 December 2016 at 20:13, Mathias Nyman <mathias.nyman@intel.com> wrote:
> On 19.12.2016 13:34, Baolin Wang wrote:

>>

>> Hi Mathias,

>>

>> On 19 December 2016 at 18:33, Mathias Nyman

>> <mathias.nyman@linux.intel.com> wrote:

>>>

>>> On 13.12.2016 05:21, Baolin Wang wrote:

>>>>

>>>>

>>>> Hi Mathias,

>>>>

>>>> On 12 December 2016 at 23:52, Mathias Nyman

>>>> <mathias.nyman@linux.intel.com> wrote:

>>>>>

>>>>>

>>>>> On 05.12.2016 09:51, Baolin Wang wrote:

>>>>>>

>>>>>>

>>>>>>

>>>>>> If a command event is found on the event ring during an interrupt,

>>>>>> we need to stop the command timer with del_timer(). Since del_timer()

>>>>>> can fail if the timer is running and waiting on the xHCI lock, then

>>>>>> it maybe get the wrong timeout command in

>>>>>> xhci_handle_command_timeout()

>>>>>> if host fetched a new command and updated the xhci->current_cmd in

>>>>>> handle_cmd_completion(). For this situation, we need a way to signal

>>>>>> to the command timer that everything is fine and it should exit.

>>>>>

>>>>>

>>>>>

>>>>>

>>>>> Ah, right, this could actually happen.

>>>>>

>>>>>>

>>>>>>

>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the

>>>>>> number

>>>>>> of pending commands. If we need to cancel the command timer and

>>>>>> del_timer()

>>>>>> succeeds, we decrement the number of pending commands. If del_timer()

>>>>>> fails,

>>>>>> we leave the number of pending commands alone.

>>>>>>

>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will

>>>>>> check

>>>>>> the counter after decrementing it, if the counter

>>>>>> (xhci->current_cmd_pending)

>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If

>>>>>> the

>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means

>>>>>> current

>>>>>> timeout command has been handled by host and host has fetched new

>>>>>> command

>>>>>> as

>>>>>> xhci->current_cmd, then just return and wait for new current command.

>>>>>

>>>>>

>>>>>

>>>>>

>>>>> A counter like this could work.

>>>>>

>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP

>>>>> event, this seems to cover both.

>>>>>

>>>>> quick check, case 1: timeout and cmd completion at the same time.

>>>>>

>>>>> cpu1                                    cpu2

>>>>>

>>>>> queue_command(first), p++ (=1)

>>>>> queue_command(more),

>>>>> --completion irq fires--                -- timer times out at same

>>>>> time--

>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)

>>>>> lock(xhci_lock  )                       spin_on(xhci_lock)

>>>>> del_timer() fail, p (=1, nochange)

>>>>> cur_cmd = list_next(), p++ (=2)

>>>>> unlock(xhci_lock)

>>>>>                                           lock(xhci_lock)

>>>>>                                           p-- (=1)

>>>>>                                           if (p > 0), exit

>>>>> OK works

>>>>>

>>>>> case 2: normal timeout case with ABORT+STOP, no race.

>>>>>

>>>>> cpu1                                    cpu2

>>>>>

>>>>> queue_command(first), p++ (=1)

>>>>> queue_command(more),

>>>>>                                           handle_cmd_timeout()

>>>>>                                           p-- (P=0), don't exit

>>>>>                                           mod_timer(), p++ (P=1)

>>>>>                                           write_abort_bit()

>>>>> handle_cmd_comletion(ABORT)

>>>>> del_timer(), ok, p-- (p = 0)

>>>>> handle_cmd_completion(STOP)

>>>>> del_timer(), fail, (P=0)

>>>>> handle_stopped_cmd_ring()

>>>>> cur_cmd = list_next(), p++ (=1)

>>>>> mod_timer()

>>>>>

>>>>> OK, works, and same for just STOP case, with the only difference that

>>>>> during handle_cmd_completion(STOP) p is decremented (p--)

>>>>

>>>>

>>>>

>>>> Yes, that's the cases what I want to handle, thanks for your explicit

>>>> explanation.

>>>>

>>>

>>> Gave this some more thought over the weekend, and this implementation

>>> doesn't solve the case when the last command times out and races with the

>>> completion handler:

>>>

>>> cpu1                                    cpu2

>>>

>>> queue_command(first), p++ (=1)

>>> --completion irq fires--                -- timer times out at same time--

>>> handle_cmd_completion()                 handle_cmd_timeout(),)

>>> lock(xhci_lock )                        spin_on(xhci_lock)

>>> del_timer() fail, p (=1, nochange)

>>> no more commands, P (=1, nochange)

>>> unlock(xhci_lock)

>>>                                          lock(xhci_lock)

>>>                                          p-- (=0)

>>>                                          p == 0, continue, even if we

>>> should

>>> not.

>>>                                            For this we still need to rely

>>> on

>>> checking cur_cmd == NULL in the timeout function.

>>> (Baolus patch sets it to NULL if there are no more commands pending)

>>

>>

>> As I pointed out in patch 1 of this patchset, this patchset is based

>> on Lu Baolu's new fix patch:

>> usb: xhci: fix possible wild pointer

>> https://www.spinics.net/lists/linux-usb/msg150219.html

>>

>> After applying Baolu's patch, after decrement the counter, we will

>> check the xhci->cur_command if is NULL. So in this situation:

>> cpu1                                    cpu2

>>

>>   queue_command(first), p++ (=1)

>>   --completion irq fires--                -- timer times out at same

>> time--

>>   handle_cmd_completion()                 handle_cmd_timeout(),)

>>   lock(xhci_lock )                        spin_on(xhci_lock)

>>   del_timer() fail, p (=1, nochange)

>>   no more commands, P (=1, nochange)

>>   unlock(xhci_lock)

>>                                           lock(xhci_lock)

>>                                           p-- (=0)

>>                                           no current command, return

>>                                           if (!xhci->current_cmd) {

>>                                                unlock(xhci_lock);

>>                                                return;

>>                                           }

>>

>> It can work.

>

>

> Yes,

>

> What I wanted to say is that as it relies on Baolus patch for that one case

> it seems that patch 2/2 can be replaced by a single line change:

>

> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))

>

> Right?


After checking the code again, I am agree with you. I will resend the
patch with fixing this issue. Thanks.

-- 
Baolin.wang
Best Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Exiting) Baolin Wang Dec. 20, 2016, 6:06 a.m. UTC | #4

Hi,

On 20 December 2016 at 12:29, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> Hi Mathias,

>

> On 12/19/2016 08:13 PM, Mathias Nyman wrote:

>> On 19.12.2016 13:34, Baolin Wang wrote:

>>> Hi Mathias,

>>>

>>> On 19 December 2016 at 18:33, Mathias Nyman

>>> <mathias.nyman@linux.intel.com> wrote:

>>>> On 13.12.2016 05:21, Baolin Wang wrote:

>>>>>

>>>>> Hi Mathias,

>>>>>

>>>>> On 12 December 2016 at 23:52, Mathias Nyman

>>>>> <mathias.nyman@linux.intel.com> wrote:

>>>>>>

>>>>>> On 05.12.2016 09:51, Baolin Wang wrote:

>>>>>>>

>>>>>>>

>>>>>>> If a command event is found on the event ring during an interrupt,

>>>>>>> we need to stop the command timer with del_timer(). Since del_timer()

>>>>>>> can fail if the timer is running and waiting on the xHCI lock, then

>>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()

>>>>>>> if host fetched a new command and updated the xhci->current_cmd in

>>>>>>> handle_cmd_completion(). For this situation, we need a way to signal

>>>>>>> to the command timer that everything is fine and it should exit.

>>>>>>

>>>>>>

>>>>>>

>>>>>> Ah, right, this could actually happen.

>>>>>>

>>>>>>>

>>>>>>>

>>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number

>>>>>>> of pending commands. If we need to cancel the command timer and

>>>>>>> del_timer()

>>>>>>> succeeds, we decrement the number of pending commands. If del_timer()

>>>>>>> fails,

>>>>>>> we leave the number of pending commands alone.

>>>>>>>

>>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will

>>>>>>> check

>>>>>>> the counter after decrementing it, if the counter

>>>>>>> (xhci->current_cmd_pending)

>>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the

>>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means

>>>>>>> current

>>>>>>> timeout command has been handled by host and host has fetched new

>>>>>>> command

>>>>>>> as

>>>>>>> xhci->current_cmd, then just return and wait for new current command.

>>>>>>

>>>>>>

>>>>>>

>>>>>> A counter like this could work.

>>>>>>

>>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP

>>>>>> event, this seems to cover both.

>>>>>>

>>>>>> quick check, case 1: timeout and cmd completion at the same time.

>>>>>>

>>>>>> cpu1                                    cpu2

>>>>>>

>>>>>> queue_command(first), p++ (=1)

>>>>>> queue_command(more),

>>>>>> --completion irq fires--                -- timer times out at same time--

>>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)

>>>>>> lock(xhci_lock  )                       spin_on(xhci_lock)

>>>>>> del_timer() fail, p (=1, nochange)

>>>>>> cur_cmd = list_next(), p++ (=2)

>>>>>> unlock(xhci_lock)

>>>>>>                                           lock(xhci_lock)

>>>>>>                                           p-- (=1)

>>>>>>                                           if (p > 0), exit

>>>>>> OK works

>>>>>>

>>>>>> case 2: normal timeout case with ABORT+STOP, no race.

>>>>>>

>>>>>> cpu1                                    cpu2

>>>>>>

>>>>>> queue_command(first), p++ (=1)

>>>>>> queue_command(more),

>>>>>>                                           handle_cmd_timeout()

>>>>>>                                           p-- (P=0), don't exit

>>>>>>                                           mod_timer(), p++ (P=1)

>>>>>>                                           write_abort_bit()

>>>>>> handle_cmd_comletion(ABORT)

>>>>>> del_timer(), ok, p-- (p = 0)

>>>>>> handle_cmd_completion(STOP)

>>>>>> del_timer(), fail, (P=0)

>>>>>> handle_stopped_cmd_ring()

>>>>>> cur_cmd = list_next(), p++ (=1)

>>>>>> mod_timer()

>>>>>>

>>>>>> OK, works, and same for just STOP case, with the only difference that

>>>>>> during handle_cmd_completion(STOP) p is decremented (p--)

>>>>>

>>>>>

>>>>> Yes, that's the cases what I want to handle, thanks for your explicit

>>>>> explanation.

>>>>>

>>>>

>>>> Gave this some more thought over the weekend, and this implementation

>>>> doesn't solve the case when the last command times out and races with the

>>>> completion handler:

>>>>

>>>> cpu1                                    cpu2

>>>>

>>>> queue_command(first), p++ (=1)

>>>> --completion irq fires--                -- timer times out at same time--

>>>> handle_cmd_completion()                 handle_cmd_timeout(),)

>>>> lock(xhci_lock )                        spin_on(xhci_lock)

>>>> del_timer() fail, p (=1, nochange)

>>>> no more commands, P (=1, nochange)

>>>> unlock(xhci_lock)

>>>>                                          lock(xhci_lock)

>>>>                                          p-- (=0)

>>>>                                          p == 0, continue, even if we should

>>>> not.

>>>>                                            For this we still need to rely on

>>>> checking cur_cmd == NULL in the timeout function.

>>>> (Baolus patch sets it to NULL if there are no more commands pending)

>>>

>>> As I pointed out in patch 1 of this patchset, this patchset is based

>>> on Lu Baolu's new fix patch:

>>> usb: xhci: fix possible wild pointer

>>> https://www.spinics.net/lists/linux-usb/msg150219.html

>>>

>>> After applying Baolu's patch, after decrement the counter, we will

>>> check the xhci->cur_command if is NULL. So in this situation:

>>> cpu1                                    cpu2

>>>

>>>   queue_command(first), p++ (=1)

>>>   --completion irq fires--                -- timer times out at same time--

>>>   handle_cmd_completion()                 handle_cmd_timeout(),)

>>>   lock(xhci_lock )                        spin_on(xhci_lock)

>>>   del_timer() fail, p (=1, nochange)

>>>   no more commands, P (=1, nochange)

>>>   unlock(xhci_lock)

>>>                                           lock(xhci_lock)

>>>                                           p-- (=0)

>>>                                           no current command, return

>>>                                           if (!xhci->current_cmd) {

>>>                                                unlock(xhci_lock);

>>>                                                return;

>>>                                           }

>>>

>>> It can work.

>>

>> Yes,

>>

>> What I wanted to say is that as it relies on Baolus patch for that one case

>> it seems that patch 2/2 can be replaced by a single line change:

>>

>> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))

>>

>> Right?

>>

>> -Mathias

>>

>

> It seems that the watch dog algorithm for command queue becomes

> more and more complicated and hard for maintain. I am also seeing

> another case where a command may lose the chance to be tracked by

> the watch dog timer.

>

> Say,

>

> queue_command(the only command in queue)

>   - completion irq fires--                - timer times out at same time--      - another command enqueue--

>   - lock(xhci_lock )                         - spin_on(xhci_lock)                           - spin_on(xhci_lock)

>   - del_timer() fail

>   - free the command and

>     set current_cmd to NULL

>   - unlock(xhci_lock)

>                                                                                                                 - lock(xhci_lock)

>                                                                                                                 - queue_command()(timer will

>                                                                                                                    not rescheduled since the timer

>                                                                                                                    is pending)


In this case, since the command timer was fired and you did not re-add
the command timer, why here timer is pending? Maybe I missed
something? Thanks.

>                                                      - lock(xhci_lock)

>                                                      - no current command

>                                                      - return

>

> As the result, the later command isn't under track of the watch dog.

> If hardware fails to response to this command, kernel will hang in

> the thread which is waiting for the completion of the command.

>

> I can write a patch to fix this and cc stable kernel as well. For long

> term, in order to make it simple and easy to maintain, how about

> allocating a watch dog timer for each command? It could be part

> of the command structure and be managed just like the life cycle

> of a command structure.

>

> I can write a patch for review and discussion, if you think this

> change is possible.

>

> Best regards,

> Lu Baolu




-- 
Baolin.wang
Best Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Exiting) Baolin Wang Dec. 20, 2016, 6:46 a.m. UTC | #5

On 20 December 2016 at 14:39, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> Hi,

>

> On 12/20/2016 02:06 PM, Baolin Wang wrote:

>> Hi,

>>

>> On 20 December 2016 at 12:29, Lu Baolu <baolu.lu@linux.intel.com> wrote:

>>> Hi Mathias,

>>>

>>> On 12/19/2016 08:13 PM, Mathias Nyman wrote:

>>>> On 19.12.2016 13:34, Baolin Wang wrote:

>>>>> Hi Mathias,

>>>>>

>>>>> On 19 December 2016 at 18:33, Mathias Nyman

>>>>> <mathias.nyman@linux.intel.com> wrote:

>>>>>> On 13.12.2016 05:21, Baolin Wang wrote:

>>>>>>> Hi Mathias,

>>>>>>>

>>>>>>> On 12 December 2016 at 23:52, Mathias Nyman

>>>>>>> <mathias.nyman@linux.intel.com> wrote:

>>>>>>>> On 05.12.2016 09:51, Baolin Wang wrote:

>>>>>>>>>

>>>>>>>>> If a command event is found on the event ring during an interrupt,

>>>>>>>>> we need to stop the command timer with del_timer(). Since del_timer()

>>>>>>>>> can fail if the timer is running and waiting on the xHCI lock, then

>>>>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()

>>>>>>>>> if host fetched a new command and updated the xhci->current_cmd in

>>>>>>>>> handle_cmd_completion(). For this situation, we need a way to signal

>>>>>>>>> to the command timer that everything is fine and it should exit.

>>>>>>>>

>>>>>>>>

>>>>>>>> Ah, right, this could actually happen.

>>>>>>>>

>>>>>>>>>

>>>>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number

>>>>>>>>> of pending commands. If we need to cancel the command timer and

>>>>>>>>> del_timer()

>>>>>>>>> succeeds, we decrement the number of pending commands. If del_timer()

>>>>>>>>> fails,

>>>>>>>>> we leave the number of pending commands alone.

>>>>>>>>>

>>>>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will

>>>>>>>>> check

>>>>>>>>> the counter after decrementing it, if the counter

>>>>>>>>> (xhci->current_cmd_pending)

>>>>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the

>>>>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means

>>>>>>>>> current

>>>>>>>>> timeout command has been handled by host and host has fetched new

>>>>>>>>> command

>>>>>>>>> as

>>>>>>>>> xhci->current_cmd, then just return and wait for new current command.

>>>>>>>>

>>>>>>>>

>>>>>>>> A counter like this could work.

>>>>>>>>

>>>>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP

>>>>>>>> event, this seems to cover both.

>>>>>>>>

>>>>>>>> quick check, case 1: timeout and cmd completion at the same time.

>>>>>>>>

>>>>>>>> cpu1                                    cpu2

>>>>>>>>

>>>>>>>> queue_command(first), p++ (=1)

>>>>>>>> queue_command(more),

>>>>>>>> --completion irq fires--                -- timer times out at same time--

>>>>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)

>>>>>>>> lock(xhci_lock  )                       spin_on(xhci_lock)

>>>>>>>> del_timer() fail, p (=1, nochange)

>>>>>>>> cur_cmd = list_next(), p++ (=2)

>>>>>>>> unlock(xhci_lock)

>>>>>>>>                                           lock(xhci_lock)

>>>>>>>>                                           p-- (=1)

>>>>>>>>                                           if (p > 0), exit

>>>>>>>> OK works

>>>>>>>>

>>>>>>>> case 2: normal timeout case with ABORT+STOP, no race.

>>>>>>>>

>>>>>>>> cpu1                                    cpu2

>>>>>>>>

>>>>>>>> queue_command(first), p++ (=1)

>>>>>>>> queue_command(more),

>>>>>>>>                                           handle_cmd_timeout()

>>>>>>>>                                           p-- (P=0), don't exit

>>>>>>>>                                           mod_timer(), p++ (P=1)

>>>>>>>>                                           write_abort_bit()

>>>>>>>> handle_cmd_comletion(ABORT)

>>>>>>>> del_timer(), ok, p-- (p = 0)

>>>>>>>> handle_cmd_completion(STOP)

>>>>>>>> del_timer(), fail, (P=0)

>>>>>>>> handle_stopped_cmd_ring()

>>>>>>>> cur_cmd = list_next(), p++ (=1)

>>>>>>>> mod_timer()

>>>>>>>>

>>>>>>>> OK, works, and same for just STOP case, with the only difference that

>>>>>>>> during handle_cmd_completion(STOP) p is decremented (p--)

>>>>>>>

>>>>>>> Yes, that's the cases what I want to handle, thanks for your explicit

>>>>>>> explanation.

>>>>>>>

>>>>>> Gave this some more thought over the weekend, and this implementation

>>>>>> doesn't solve the case when the last command times out and races with the

>>>>>> completion handler:

>>>>>>

>>>>>> cpu1                                    cpu2

>>>>>>

>>>>>> queue_command(first), p++ (=1)

>>>>>> --completion irq fires--                -- timer times out at same time--

>>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)

>>>>>> lock(xhci_lock )                        spin_on(xhci_lock)

>>>>>> del_timer() fail, p (=1, nochange)

>>>>>> no more commands, P (=1, nochange)

>>>>>> unlock(xhci_lock)

>>>>>>                                          lock(xhci_lock)

>>>>>>                                          p-- (=0)

>>>>>>                                          p == 0, continue, even if we should

>>>>>> not.

>>>>>>                                            For this we still need to rely on

>>>>>> checking cur_cmd == NULL in the timeout function.

>>>>>> (Baolus patch sets it to NULL if there are no more commands pending)

>>>>> As I pointed out in patch 1 of this patchset, this patchset is based

>>>>> on Lu Baolu's new fix patch:

>>>>> usb: xhci: fix possible wild pointer

>>>>> https://www.spinics.net/lists/linux-usb/msg150219.html

>>>>>

>>>>> After applying Baolu's patch, after decrement the counter, we will

>>>>> check the xhci->cur_command if is NULL. So in this situation:

>>>>> cpu1                                    cpu2

>>>>>

>>>>>   queue_command(first), p++ (=1)

>>>>>   --completion irq fires--                -- timer times out at same time--

>>>>>   handle_cmd_completion()                 handle_cmd_timeout(),)

>>>>>   lock(xhci_lock )                        spin_on(xhci_lock)

>>>>>   del_timer() fail, p (=1, nochange)

>>>>>   no more commands, P (=1, nochange)

>>>>>   unlock(xhci_lock)

>>>>>                                           lock(xhci_lock)

>>>>>                                           p-- (=0)

>>>>>                                           no current command, return

>>>>>                                           if (!xhci->current_cmd) {

>>>>>                                                unlock(xhci_lock);

>>>>>                                                return;

>>>>>                                           }

>>>>>

>>>>> It can work.

>>>> Yes,

>>>>

>>>> What I wanted to say is that as it relies on Baolus patch for that one case

>>>> it seems that patch 2/2 can be replaced by a single line change:

>>>>

>>>> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))

>>>>

>>>> Right?

>>>>

>>>> -Mathias

>>>>

>>> It seems that the watch dog algorithm for command queue becomes

>>> more and more complicated and hard for maintain. I am also seeing

>>> another case where a command may lose the chance to be tracked by

>>> the watch dog timer.

>>>

>>> Say,

>>>

>>> queue_command(the only command in queue)

>>>   - completion irq fires--                - timer times out at same time--      - another command enqueue--

>>>   - lock(xhci_lock )                         - spin_on(xhci_lock)                           - spin_on(xhci_lock)

>>>   - del_timer() fail

>>>   - free the command and

>>>     set current_cmd to NULL

>>>   - unlock(xhci_lock)

>>>                                                                                                                 - lock(xhci_lock)

>>>                                                                                                                 - queue_command()(timer will

>>>                                                                                                                    not rescheduled since the timer

>>>                                                                                                                    is pending)

>> In this case, since the command timer was fired and you did not re-add

>> the command timer, why here timer is pending? Maybe I missed

>> something? Thanks.

>

> In queue_command(),

>

>         /* if there are no other commands queued we start the timeout timer */

>         if (list_is_singular(&xhci->cmd_list) &&

>             !timer_pending(&xhci->cmd_timer)) {

>                 xhci->current_cmd = cmd;

>                 mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);

>         }

>

> timer_pending() will return true if the timer is fired, but the function is still

> running on another CPU. Do I understand it right?


From my understanding, if the timer was fired, no matter the timeout
function is running or finished, timer_pending() will return false.
Please correct me if I made mistakes. Thanks.

>

> Best regards,

> Lu Baolu

>

>>>                                                      - lock(xhci_lock)

>>>                                                      - no current command

>>>                                                      - return

>>>

>>> As the result, the later command isn't under track of the watch dog.

>>> If hardware fails to response to this command, kernel will hang in

>>> the thread which is waiting for the completion of the command.

>>>

>>> I can write a patch to fix this and cc stable kernel as well. For long

>>> term, in order to make it simple and easy to maintain, how about

>>> allocating a watch dog timer for each command? It could be part

>>> of the command structure and be managed just like the life cycle

>>> of a command structure.

>>>

>>> I can write a patch for review and discussion, if you think this

>>> change is possible.

>>>

>>> Best regards,

>>> Lu Baolu

>>

>>

>




-- 
Baolin.wang
Best Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mathias Nyman Dec. 20, 2016, 3:13 p.m. UTC | #6

On 20.12.2016 09:30, Baolin Wang wrote:
...

Alright, I gathered all current work related to xhci races and timeouts
and put them into a branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes

Its based on 4.9
It includes a few other patches just to avoid conflicts and  make my life easier

Interesting patches are:

ee4eb91 xhci: remove unnecessary check for pending timer
0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
4f2535f xhci: Handle command completion and timeout race
b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
529a5a0 usb: xhci: fix possible wild pointer
4766555 xhci: Fix race related to abort operation
de834a3 xhci: Use delayed_work instead of timer for command timeout
69973b8 Linux 4.9

The fixes for command queue races will go to usb-linus and stable, the
reworks for stop ep watchdog timer will go to usb-next.

Still completely untested, (well it compiles)

Felipe gave instructions how to modify dwc3 driver to timeout on address
devicecommands to test these, I'll try to set that up.

All additional testing is welcome, especially if you can trigger timeouts
and races

-Mathias

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Exiting) Baolin Wang Dec. 21, 2016, 2:22 a.m. UTC | #7

Hi Mathias,

On 20 December 2016 at 23:13, Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
> On 20.12.2016 09:30, Baolin Wang wrote:

> ...

>

> Alright, I gathered all current work related to xhci races and timeouts

> and put them into a branch:

>

> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git

> timeout_race_fixes

>

> Its based on 4.9

> It includes a few other patches just to avoid conflicts and  make my life

> easier

>

> Interesting patches are:

>

> ee4eb91 xhci: remove unnecessary check for pending timer

> 0cba67d xhci: detect stop endpoint race using pending timer instead of

> counter.

> 4f2535f xhci: Handle command completion and timeout race

> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort

> command

> 529a5a0 usb: xhci: fix possible wild pointer

> 4766555 xhci: Fix race related to abort operation

> de834a3 xhci: Use delayed_work instead of timer for command timeout

> 69973b8 Linux 4.9

>

> The fixes for command queue races will go to usb-linus and stable, the

> reworks for stop ep watchdog timer will go to usb-next.


How about applying below patch in your 'timeout_race_fixes' branch?
[PATCH] usb: host: xhci: Clean up commands when stop endpoint command is timeout
https://lkml.org/lkml/2016/12/13/94

>

> Still completely untested, (well it compiles)

>

> Felipe gave instructions how to modify dwc3 driver to timeout on address

> devicecommands to test these, I'll try to set that up.

>

> All additional testing is welcome, especially if you can trigger timeouts

> and races


I will try to test these patches.

-- 
Baolin.wang
Best Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Baolu Lu Dec. 21, 2016, 6:17 a.m. UTC | #8

Hi Mathias,

I have some comments for the implementation of xhci_abort_cmd_ring() below.

On 12/20/2016 11:13 PM, Mathias Nyman wrote:
> On 20.12.2016 09:30, Baolin Wang wrote:

> ...

>

> Alright, I gathered all current work related to xhci races and timeouts

> and put them into a branch:

>

> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes

>

> Its based on 4.9

> It includes a few other patches just to avoid conflicts and  make my life easier

>

> Interesting patches are:

>

> ee4eb91 xhci: remove unnecessary check for pending timer

> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter.

> 4f2535f xhci: Handle command completion and timeout race

> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command

> 529a5a0 usb: xhci: fix possible wild pointer

> 4766555 xhci: Fix race related to abort operation

> de834a3 xhci: Use delayed_work instead of timer for command timeout

> 69973b8 Linux 4.9

>

> The fixes for command queue races will go to usb-linus and stable, the

> reworks for stop ep watchdog timer will go to usb-next.

>

> Still completely untested, (well it compiles)

>

> Felipe gave instructions how to modify dwc3 driver to timeout on address

> devicecommands to test these, I'll try to set that up.

>

> All additional testing is welcome, especially if you can trigger timeouts

> and races

>

> -Mathias

>

>


Below is the latest code. I put my comments in line.

 322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
 323 {
 324         u64 temp_64;
 325         int ret;
 326
 327         xhci_dbg(xhci, "Abort command ring\n");
 328
 329         reinit_completion(&xhci->cmd_ring_stop_completion);
 330
 331         temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
 332         xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
 333                         &xhci->op_regs->cmd_ring);

We should hold xhci->lock when we are modifying xhci registers
at runtime.


 334
 335         /* Section 4.6.1.2 of xHCI 1.0 spec says software should
 336          * time the completion od all xHCI commands, including

s/od/of/g

 337          * the Command Abort operation. If software doesn't see
 338          * CRR negated in a timely manner (e.g. longer than 5
 339          * seconds), then it should assume that the there are
 340          * larger problems with the xHC and assert HCRST.
 341          */
 342         ret = xhci_handshake(&xhci->op_regs->cmd_ring,
 343                         CMD_RING_RUNNING, 0, 5 * 1000 * 1000);
 344         if (ret < 0) {
 345                 /* we are about to kill xhci, give it one more chance */

The retry of setting CMD_RING_ABORT is not necessary according to
previous discussion. We have cleaned code for second try in
xhci_handle_command_timeout(). Need to clean up here as well.

 346                 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
 347                               &xhci->op_regs->cmd_ring);
 348                 udelay(1000);
 349                 ret = xhci_handshake(&xhci->op_regs->cmd_ring,
 350                                      CMD_RING_RUNNING, 0, 3 * 1000 * 1000);
 351                 if (ret < 0) {
 352                         xhci_err(xhci, "Stopped the command ring failed, "
 353                                  "maybe the host is dead\n");
 354                         xhci->xhc_state |= XHCI_STATE_DYING;
 355                         xhci_halt(xhci);
 356                         return -ESHUTDOWN;
 357                 }
 358         }
 359         /*
 360          * Writing the CMD_RING_ABORT bit should cause a cmd completion event,
 361          * however on some host hw the CMD_RING_RUNNING bit is correctly cleared
 362          * but the completion event in never sent. Wait 2 secs (arbitrary

s/in never sent/is never sent/g

 363          * number) to handle those cases after negation of CMD_RING_RUNNING.
 364          */

This should be implemented with a quirk bit. It's not common for all host controllers.

 365         if (!wait_for_completion_timeout(&xhci->cmd_ring_stop_completion,
 366                                          2 * HZ)) {
 367                 xhci_dbg(xhci, "No stop event for abort, ring start fail?\n");
 368                 xhci_cleanup_command_queue(xhci);
 369         } else {
 370                 unsigned long flags;
 371
 372                 spin_lock_irqsave(&xhci->lock, flags);
 373                 xhci_handle_stopped_cmd_ring(xhci, xhci_next_queued_cmd(xhci));
 374                 spin_unlock_irqrestore(&xhci->lock, flags);
 375         }
 376         return 0;
 377 }

Best regards,
Lu Baolu
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Baolu Lu Dec. 21, 2016, 6:57 a.m. UTC | #9

Hi Mathias,

I have some comments for the implementation of
xhci_handle_command_timeout() as well.

On 12/20/2016 11:13 PM, Mathias Nyman wrote:
> On 20.12.2016 09:30, Baolin Wang wrote:

> ...

>

> Alright, I gathered all current work related to xhci races and timeouts

> and put them into a branch:

>

> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes

>

> Its based on 4.9

> It includes a few other patches just to avoid conflicts and  make my life easier

>

> Interesting patches are:

>

> ee4eb91 xhci: remove unnecessary check for pending timer

> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter.

> 4f2535f xhci: Handle command completion and timeout race

> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command

> 529a5a0 usb: xhci: fix possible wild pointer

> 4766555 xhci: Fix race related to abort operation

> de834a3 xhci: Use delayed_work instead of timer for command timeout

> 69973b8 Linux 4.9

>

> The fixes for command queue races will go to usb-linus and stable, the

> reworks for stop ep watchdog timer will go to usb-next.

>

> Still completely untested, (well it compiles)

>

> Felipe gave instructions how to modify dwc3 driver to timeout on address

> devicecommands to test these, I'll try to set that up.

>

> All additional testing is welcome, especially if you can trigger timeouts

> and races

>

> -Mathias

>

>


I post the code below and add my comments in line.

1276 void xhci_handle_command_timeout(struct work_struct *work)
1277 {
1278         struct xhci_hcd *xhci;
1279         int ret;
1280         unsigned long flags;
1281         u64 hw_ring_state;
1282
1283         xhci = container_of(to_delayed_work(work), struct xhci_hcd, cmd_timer);
1284
1285         spin_lock_irqsave(&xhci->lock, flags);
1286
1287         /*
1288          * If timeout work is pending, or current_cmd is NULL, it means we
1289          * raced with command completion. Command is handled so just return.
1290          */
1291         if (!xhci->current_cmd || delayed_work_pending(&xhci->cmd_timer)) {
1292                 spin_unlock_irqrestore(&xhci->lock, flags);
1293                 return;
1294         }
1295         /* mark this command to be cancelled */
1296         xhci->current_cmd->status = COMP_CMD_ABORT;
1297
1298         /* Make sure command ring is running before aborting it */
1299         hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
1300         if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
1301             (hw_ring_state & CMD_RING_RUNNING))  {
1302                 /* Prevent new doorbell, and start command abort */
1303                 xhci->cmd_ring_state = CMD_RING_STATE_ABORTED;
1304                 spin_unlock_irqrestore(&xhci->lock, flags);
1305                 xhci_dbg(xhci, "Command timeout\n");
1306                 ret = xhci_abort_cmd_ring(xhci);
1307                 if (unlikely(ret == -ESHUTDOWN)) {
1308                         xhci_err(xhci, "Abort command ring failed\n");
1309                         xhci_cleanup_command_queue(xhci);
1310                         usb_hc_died(xhci_to_hcd(xhci)->primary_hcd);
1311                         xhci_dbg(xhci, "xHCI host controller is dead.\n");
1312                 }
1313                 return;
1314         }
1315
1316         /* host removed. Bail out */
1317         if (xhci->xhc_state & XHCI_STATE_REMOVING) {
1318                 spin_unlock_irqrestore(&xhci->lock, flags);
1319                 xhci_dbg(xhci, "host removed, ring start fail?\n");
1320                 xhci_cleanup_command_queue(xhci);
1321                 return;
1322         }

I think this part of code should be moved up to line 1295.

1323
1324         /* command timeout on stopped ring, ring can't be aborted */
1325         xhci_dbg(xhci, "Command timeout on stopped ring\n");
1326         xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
1327         spin_unlock_irqrestore(&xhci->lock, flags);

This part of code is tricky. I have no idea about in which case should this
code be executed? Anyway, we shouldn't call xhci_handle_stopped_cmd_ring()
here, right?

1328         return;
1329 }

Best regards,
Lu Baolu
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

OGAWA Hirofumi Dec. 21, 2016, 2:10 p.m. UTC | #10

Mathias Nyman <mathias.nyman@linux.intel.com> writes:

>> Below is the latest code. I put my comments in line.

>>

>>   322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)

>>   323 {

>>   324         u64 temp_64;

>>   325         int ret;

>>   326

>>   327         xhci_dbg(xhci, "Abort command ring\n");

>>   328

>>   329         reinit_completion(&xhci->cmd_ring_stop_completion);

>>   330

>>   331         temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);

>>   332         xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,

>>   333                         &xhci->op_regs->cmd_ring);

>>

>> We should hold xhci->lock when we are modifying xhci registers

>> at runtime.

>>

>

> Makes sense, but we need to unlock it before sleeping or waiting for

> completion.  I need to look into that in more detail.

>

> But this was an issue already before these changes.


We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
is for taking lock for register though, I guess it should be enough just
lock around of read=>write of ->cmd_ring if need lock.

[Rather ->cmd_ring user should check CMD_RING_STATE_ABORTED state.]

> But then again I really like OGAWA Hiroumi's solution that separates the

> command ring stopping from aborting commands and restarting the ring.

>

> The current way of always restarting the command ring as a response to

> a stop command ring command really limits its usage.

>

> So, with this in mind most reasonable would be to

> 1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable

> 2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only

> 3. remove unnecessary second abort try as a separate patch, send to usb-next

> 4. remove polling for the Command ring running (CRR), waiting for completion

>     is enough, if completion times out then we can check CRR. for usb-next


I think we should check both of CRR and even of stop completion. Because
CRR and stop completion was not same time (both can be first).

Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mathias Nyman Dec. 23, 2016, 12:54 p.m. UTC | #11

On 22.12.2016 03:46, Lu Baolu wrote:
> Hi,

>

> On 12/21/2016 11:18 PM, OGAWA Hirofumi wrote:

>> Mathias Nyman <mathias.nyman@linux.intel.com> writes:

>>

>>>> We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what

>>>> is for taking lock for register though, I guess it should be enough just

>>>> lock around of read=>write of ->cmd_ring if need lock.

>>> After your patch it should be enough to have the lock only while

>>> reading and writing the cmd_ring register.

>>>

>>> If we want a locking fix that applies more easily to older stable

>>> releases before your change then the lock needs to cover set

>>> CMD_RING_STATE_ABORT, read cmd_reg, write cmd_reg and busiloop

>>> checking CRR bit.  Otherwise the stop cmd ring interrupt handler may

>>> restart the ring just before we start checing CRR. The stop cmd ring

>>> interrupt will set the CMD_RING_STATE_ABORTED to

>>> CMD_RING_STATE_RUNNING so ring will really restart in the interrupt

>>> handler.

>> Just for record (no chance to make patch I myself for now, sorry), while

>> checking locking slightly, I noticed unrelated missing locking.

>>

>> 	xhci_cleanup_command_queue()

>>

>> We are calling it without locking, but we need to lock for accessing list.

>

> Yeah. I can make the patch.

>


Force updated timeout_race_fixes branch.

I'll be out for a few days during xmas, continue testing after that

-Mathias

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Exiting) Baolin Wang Jan. 3, 2017, 6:20 a.m. UTC | #12

On 2 January 2017 at 22:57, Mathias Nyman <mathias.nyman@linux.intel.com> wrote:
> On 27.12.2016 05:07, Baolin Wang wrote:

>>

>> Hi,

>>

>> On 21 December 2016 at 21:00, Mathias Nyman

>> <mathias.nyman@linux.intel.com> wrote:

>>>

>>> On 21.12.2016 04:22, Baolin Wang wrote:

>>>>

>>>>

>>>> Hi Mathias,

>>>>

>>>> On 20 December 2016 at 23:13, Mathias Nyman

>>>> <mathias.nyman@linux.intel.com> wrote:

>>>>>

>>>>>

>>>>> On 20.12.2016 09:30, Baolin Wang wrote:

>>>>> ...

>>>>>

>>>>> Alright, I gathered all current work related to xhci races and timeouts

>>>>> and put them into a branch:

>>>>>

>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git

>>>>> timeout_race_fixes

>>>>>

>>>>> Its based on 4.9

>>>>> It includes a few other patches just to avoid conflicts and  make my

>>>>> life

>>>>> easier

>>>>>

>>>>> Interesting patches are:

>>>>>

>>>>> ee4eb91 xhci: remove unnecessary check for pending timer

>>>>> 0cba67d xhci: detect stop endpoint race using pending timer instead of

>>>>> counter.

>>>>> 4f2535f xhci: Handle command completion and timeout race

>>>>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort

>>>>> command

>>>>> 529a5a0 usb: xhci: fix possible wild pointer

>>>>> 4766555 xhci: Fix race related to abort operation

>>>>> de834a3 xhci: Use delayed_work instead of timer for command timeout

>>>>> 69973b8 Linux 4.9

>>>>>

>>>>> The fixes for command queue races will go to usb-linus and stable, the

>>>>> reworks for stop ep watchdog timer will go to usb-next.

>>>>

>>>>

>>>>

>>>> How about applying below patch in your 'timeout_race_fixes' branch?

>>>> [PATCH] usb: host: xhci: Clean up commands when stop endpoint command is

>>>> timeout

>>>> https://lkml.org/lkml/2016/12/13/94

>>>>

>>>

>>> usb_hc_died() should eventyally end up calling xhci_mem_cleanup()

>>> which will cleanup the command queue. But I need to look into that

>>

>>

>> usb_hc_died() did not call xhci_mem_cleanup() to clean up command

>> queue, it just disconnects all children devices attached on the dying

>> hub. I did not find where it will clean up the command queue when

>> issuing usb_hc_died(). Also before issuing usb_hc_died() in

>> xhci_handle_command_timeout(), we will call

>> xhci_cleanup_command_queue(). I think it should same as in

>> xhci_stop_endpoint_command_watchdog().

>>

>

> You're right, it doesn't call xhci_mem_cleanup.

> Need to look at this after getting first series of fixes to usb-linus


Thanks.

-- 
Baolin.wang
Best Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[2/2] usb: host: xhci: Handle the right timeout command

Commit Message

Comments

Patch