diff mbox series

bus: mhi: Command completion workaround

Message ID 1615376308-1941-1-git-send-email-loic.poulain@linaro.org
State Superseded
Headers show
Series bus: mhi: Command completion workaround | expand

Commit Message

Loic Poulain March 10, 2021, 11:38 a.m. UTC
Some buggy hardwares (e.g sdx24) may report the current command
ring wp pointer instead of the command completion pointer. It's
obviously wrong, causing completion timeout. We can however deal
with that situation by completing the cmd n-1 element, which is
what the device actually completes.

Signed-off-by: Loic Poulain <loic.poulain@linaro.org>

---
 drivers/bus/mhi/core/main.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

-- 
2.7.4

Comments

Jeffrey Hugo March 10, 2021, 4:19 p.m. UTC | #1
On 3/10/2021 4:38 AM, Loic Poulain wrote:
> Some buggy hardwares (e.g sdx24) may report the current command

> ring wp pointer instead of the command completion pointer. It's

> obviously wrong, causing completion timeout. We can however deal

> with that situation by completing the cmd n-1 element, which is

> what the device actually completes.

> 

> Signed-off-by: Loic Poulain <loic.poulain@linaro.org>

> ---

>   drivers/bus/mhi/core/main.c | 18 ++++++++++++++++++

>   1 file changed, 18 insertions(+)

> 

> diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c

> index 16b9640..3e3c520 100644

> --- a/drivers/bus/mhi/core/main.c

> +++ b/drivers/bus/mhi/core/main.c

> @@ -707,6 +707,7 @@ static void mhi_process_cmd_completion(struct mhi_controller *mhi_cntrl,

>   {

>   	dma_addr_t ptr = MHI_TRE_GET_EV_PTR(tre);

>   	struct mhi_cmd *cmd_ring = &mhi_cntrl->mhi_cmd[PRIMARY_CMD_RING];

> +	struct device *dev = &mhi_cntrl->mhi_dev->dev;

>   	struct mhi_ring *mhi_ring = &cmd_ring->ring;

>   	struct mhi_tre *cmd_pkt;

>   	struct mhi_chan *mhi_chan;

> @@ -714,6 +715,23 @@ static void mhi_process_cmd_completion(struct mhi_controller *mhi_cntrl,

>   

>   	cmd_pkt = mhi_to_virtual(mhi_ring, ptr);

>   

> +	if (unlikely(cmd_pkt == mhi_ring->wp)) {

> +		/* Some buggy hardwares (e.g sdx24) sometimes report the current

> +		 * command ring wp pointer instead of the command completion

> +		 * pointer. It's obviously wrong, causing completion timeout. We

> +		 * can however deal with that situation by completing the cmd

> +		 * n-1 element.

> +		 */

> +		void *ring_ptr = (void *)cmd_pkt - mhi_ring->el_size;

> +

> +		if (ring_ptr < mhi_ring->base)

> +			ring_ptr += mhi_ring->len;

> +

> +		cmd_pkt = ring_ptr;

> +

> +		dev_warn(dev, "Bad completion pointer (ptr == ring_wp)\n");


Is there value in having this warning every time?  I wonder if a _once 
version would be better to not flood the kernel log.  Although this is 
only for commands, which shouldn't be frequent, so maybe that is the 
implicit rate limiter.

What do you think?

> +	}

> +

>   	chan = MHI_TRE_GET_CMD_CHID(cmd_pkt);

>   	mhi_chan = &mhi_cntrl->mhi_chan[chan];

>   	write_lock_bh(&mhi_chan->lock);

> 



-- 
Jeffrey Hugo
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.
Hemant Kumar March 10, 2021, 8:43 p.m. UTC | #2
Hi Loic,

On 3/10/21 3:38 AM, Loic Poulain wrote:
> Some buggy hardwares (e.g sdx24) may report the current command

> ring wp pointer instead of the command completion pointer. It's

> obviously wrong, causing completion timeout. We can however deal

> with that situation by completing the cmd n-1 element, which is

> what the device actually completes.

> 

> Signed-off-by: Loic Poulain <loic.poulain@linaro.org>

> ---

>   drivers/bus/mhi/core/main.c | 18 ++++++++++++++++++

>   1 file changed, 18 insertions(+)

> 

> diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c

> index 16b9640..3e3c520 100644

> --- a/drivers/bus/mhi/core/main.c

> +++ b/drivers/bus/mhi/core/main.c

> @@ -707,6 +707,7 @@ static void mhi_process_cmd_completion(struct mhi_controller *mhi_cntrl,

>   {

>   	dma_addr_t ptr = MHI_TRE_GET_EV_PTR(tre);

>   	struct mhi_cmd *cmd_ring = &mhi_cntrl->mhi_cmd[PRIMARY_CMD_RING];

> +	struct device *dev = &mhi_cntrl->mhi_dev->dev;

>   	struct mhi_ring *mhi_ring = &cmd_ring->ring;

>   	struct mhi_tre *cmd_pkt;

>   	struct mhi_chan *mhi_chan;

> @@ -714,6 +715,23 @@ static void mhi_process_cmd_completion(struct mhi_controller *mhi_cntrl,

>   

>   	cmd_pkt = mhi_to_virtual(mhi_ring, ptr);

>   

> +	if (unlikely(cmd_pkt == mhi_ring->wp)) {

As per spec : The location of the command ring read pointer is reported 
to the host on the command completion events in the primary event ring.

If device is buggy and updates with WP instead of Rp, we should not 
workaround it by processing Wp - 1. We can print a warning if cmd_pkt != 
mhi_ring->rp and let the command completion timeout. This needs to be 
fixed by device. We can not accommodate device side bug in host side.

> +		/* Some buggy hardwares (e.g sdx24) sometimes report the current

> +		 * command ring wp pointer instead of the command completion

> +		 * pointer. It's obviously wrong, causing completion timeout. We

> +		 * can however deal with that situation by completing the cmd

> +		 * n-1 element.

> +		 */

> +		void *ring_ptr = (void *)cmd_pkt - mhi_ring->el_size;

> +

> +		if (ring_ptr < mhi_ring->base)

> +			ring_ptr += mhi_ring->len;

> +

> +		cmd_pkt = ring_ptr;

> +

> +		dev_warn(dev, "Bad completion pointer (ptr == ring_wp)\n");

> +	}

> +

>   	chan = MHI_TRE_GET_CMD_CHID(cmd_pkt);

>   	mhi_chan = &mhi_cntrl->mhi_chan[chan];

>   	write_lock_bh(&mhi_chan->lock);

> 


Hi Mani,

What do you think about this workaround ?

Thanks,
Hemant
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project
Loic Poulain March 11, 2021, 7:56 a.m. UTC | #3
Hi Hemant,

On Wed, 10 Mar 2021 at 21:43, Hemant Kumar <hemantk@codeaurora.org> wrote:
>

> Hi Loic,

>

> On 3/10/21 3:38 AM, Loic Poulain wrote:

> > Some buggy hardwares (e.g sdx24) may report the current command

> > ring wp pointer instead of the command completion pointer. It's

> > obviously wrong, causing completion timeout. We can however deal

> > with that situation by completing the cmd n-1 element, which is

> > what the device actually completes.

> >

> > Signed-off-by: Loic Poulain <loic.poulain@linaro.org>

> > ---

> >   drivers/bus/mhi/core/main.c | 18 ++++++++++++++++++

> >   1 file changed, 18 insertions(+)

> >

> > diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c

> > index 16b9640..3e3c520 100644

> > --- a/drivers/bus/mhi/core/main.c

> > +++ b/drivers/bus/mhi/core/main.c

> > @@ -707,6 +707,7 @@ static void mhi_process_cmd_completion(struct mhi_controller *mhi_cntrl,

> >   {

> >       dma_addr_t ptr = MHI_TRE_GET_EV_PTR(tre);

> >       struct mhi_cmd *cmd_ring = &mhi_cntrl->mhi_cmd[PRIMARY_CMD_RING];

> > +     struct device *dev = &mhi_cntrl->mhi_dev->dev;

> >       struct mhi_ring *mhi_ring = &cmd_ring->ring;

> >       struct mhi_tre *cmd_pkt;

> >       struct mhi_chan *mhi_chan;

> > @@ -714,6 +715,23 @@ static void mhi_process_cmd_completion(struct mhi_controller *mhi_cntrl,

> >

> >       cmd_pkt = mhi_to_virtual(mhi_ring, ptr);

> >

> > +     if (unlikely(cmd_pkt == mhi_ring->wp)) {

> As per spec : The location of the command ring read pointer is reported

> to the host on the command completion events in the primary event ring.

>

> If device is buggy and updates with WP instead of Rp, we should not

> workaround it by processing Wp - 1. We can print a warning if cmd_pkt !=

> mhi_ring->rp and let the command completion timeout. This needs to be

> fixed by device. We can not accommodate device side bug in host side.


I see your point, but here it's not to accommodate the device but the
users using such
'buggy' device. The kernel has a ton of 'quirks' in various drivers,
I'm not a fan of this
but my argument is that:
- It captures a behavior that was not captured until now
- It workarounds an issue without any impact on non 'buggy' devices
- It clearly prints a warn to highlight that it's a known issue that
should be fixed
- Fixing devices in the wild is quite complex, and we may have to live with it.

Regards,
Loic
Loic Poulain March 11, 2021, 8:05 a.m. UTC | #4
Hi Jeffrey,

On Wed, 10 Mar 2021 at 17:19, Jeffrey Hugo <jhugo@codeaurora.org> wrote:
>

> On 3/10/2021 4:38 AM, Loic Poulain wrote:

> > Some buggy hardwares (e.g sdx24) may report the current command

> > ring wp pointer instead of the command completion pointer. It's

> > obviously wrong, causing completion timeout. We can however deal

> > with that situation by completing the cmd n-1 element, which is

> > what the device actually completes.

> >

> > Signed-off-by: Loic Poulain <loic.poulain@linaro.org>

> > ---

> >   drivers/bus/mhi/core/main.c | 18 ++++++++++++++++++

> >   1 file changed, 18 insertions(+)

> >

> > diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c

> > index 16b9640..3e3c520 100644

> > --- a/drivers/bus/mhi/core/main.c

> > +++ b/drivers/bus/mhi/core/main.c

> > @@ -707,6 +707,7 @@ static void mhi_process_cmd_completion(struct mhi_controller *mhi_cntrl,

> >   {

> >       dma_addr_t ptr = MHI_TRE_GET_EV_PTR(tre);

> >       struct mhi_cmd *cmd_ring = &mhi_cntrl->mhi_cmd[PRIMARY_CMD_RING];

> > +     struct device *dev = &mhi_cntrl->mhi_dev->dev;

> >       struct mhi_ring *mhi_ring = &cmd_ring->ring;

> >       struct mhi_tre *cmd_pkt;

> >       struct mhi_chan *mhi_chan;

> > @@ -714,6 +715,23 @@ static void mhi_process_cmd_completion(struct mhi_controller *mhi_cntrl,

> >

> >       cmd_pkt = mhi_to_virtual(mhi_ring, ptr);

> >

> > +     if (unlikely(cmd_pkt == mhi_ring->wp)) {

> > +             /* Some buggy hardwares (e.g sdx24) sometimes report the current

> > +              * command ring wp pointer instead of the command completion

> > +              * pointer. It's obviously wrong, causing completion timeout. We

> > +              * can however deal with that situation by completing the cmd

> > +              * n-1 element.

> > +              */

> > +             void *ring_ptr = (void *)cmd_pkt - mhi_ring->el_size;

> > +

> > +             if (ring_ptr < mhi_ring->base)

> > +                     ring_ptr += mhi_ring->len;

> > +

> > +             cmd_pkt = ring_ptr;

> > +

> > +             dev_warn(dev, "Bad completion pointer (ptr == ring_wp)\n");

>

> Is there value in having this warning every time?  I wonder if a _once

> version would be better to not flood the kernel log.  Although this is

> only for commands, which shouldn't be frequent, so maybe that is the

> implicit rate limiter.

>

> What do you think?


As you said it's kind of self rate-limited because of the unfrequent
command operations, mostly for starting and stopping channels. A _once
variant would hide the issue a bit, and probably not annoying enough
to raise curiosity.

>

> > +     }

> > +

> >       chan = MHI_TRE_GET_CMD_CHID(cmd_pkt);

> >       mhi_chan = &mhi_cntrl->mhi_chan[chan];

> >       write_lock_bh(&mhi_chan->lock);

> >

>

>

> --

> Jeffrey Hugo

> Qualcomm Technologies, Inc. is a member of the

> Code Aurora Forum, a Linux Foundation Collaborative Project.
Jeffrey Hugo March 11, 2021, 2:46 p.m. UTC | #5
On 3/11/2021 1:05 AM, Loic Poulain wrote:
> Hi Jeffrey,

> 

> On Wed, 10 Mar 2021 at 17:19, Jeffrey Hugo <jhugo@codeaurora.org> wrote:

>>

>> On 3/10/2021 4:38 AM, Loic Poulain wrote:

>>> Some buggy hardwares (e.g sdx24) may report the current command

>>> ring wp pointer instead of the command completion pointer. It's

>>> obviously wrong, causing completion timeout. We can however deal

>>> with that situation by completing the cmd n-1 element, which is

>>> what the device actually completes.

>>>

>>> Signed-off-by: Loic Poulain <loic.poulain@linaro.org>

>>> ---

>>>    drivers/bus/mhi/core/main.c | 18 ++++++++++++++++++

>>>    1 file changed, 18 insertions(+)

>>>

>>> diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c

>>> index 16b9640..3e3c520 100644

>>> --- a/drivers/bus/mhi/core/main.c

>>> +++ b/drivers/bus/mhi/core/main.c

>>> @@ -707,6 +707,7 @@ static void mhi_process_cmd_completion(struct mhi_controller *mhi_cntrl,

>>>    {

>>>        dma_addr_t ptr = MHI_TRE_GET_EV_PTR(tre);

>>>        struct mhi_cmd *cmd_ring = &mhi_cntrl->mhi_cmd[PRIMARY_CMD_RING];

>>> +     struct device *dev = &mhi_cntrl->mhi_dev->dev;

>>>        struct mhi_ring *mhi_ring = &cmd_ring->ring;

>>>        struct mhi_tre *cmd_pkt;

>>>        struct mhi_chan *mhi_chan;

>>> @@ -714,6 +715,23 @@ static void mhi_process_cmd_completion(struct mhi_controller *mhi_cntrl,

>>>

>>>        cmd_pkt = mhi_to_virtual(mhi_ring, ptr);

>>>

>>> +     if (unlikely(cmd_pkt == mhi_ring->wp)) {

>>> +             /* Some buggy hardwares (e.g sdx24) sometimes report the current

>>> +              * command ring wp pointer instead of the command completion

>>> +              * pointer. It's obviously wrong, causing completion timeout. We

>>> +              * can however deal with that situation by completing the cmd

>>> +              * n-1 element.

>>> +              */

>>> +             void *ring_ptr = (void *)cmd_pkt - mhi_ring->el_size;

>>> +

>>> +             if (ring_ptr < mhi_ring->base)

>>> +                     ring_ptr += mhi_ring->len;

>>> +

>>> +             cmd_pkt = ring_ptr;

>>> +

>>> +             dev_warn(dev, "Bad completion pointer (ptr == ring_wp)\n");

>>

>> Is there value in having this warning every time?  I wonder if a _once

>> version would be better to not flood the kernel log.  Although this is

>> only for commands, which shouldn't be frequent, so maybe that is the

>> implicit rate limiter.

>>

>> What do you think?

> 

> As you said it's kind of self rate-limited because of the unfrequent

> command operations, mostly for starting and stopping channels. A _once

> variant would hide the issue a bit, and probably not annoying enough

> to raise curiosity.


Thats fair.

I happened to notice just now that the block comment you have above is 
not the proper style.  That looks like the netdev style, but we are not 
in the netdev area.

I'm curious to see where you and Hemant land on his comment.

-- 
Jeffrey Hugo
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.
diff mbox series

Patch

diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
index 16b9640..3e3c520 100644
--- a/drivers/bus/mhi/core/main.c
+++ b/drivers/bus/mhi/core/main.c
@@ -707,6 +707,7 @@  static void mhi_process_cmd_completion(struct mhi_controller *mhi_cntrl,
 {
 	dma_addr_t ptr = MHI_TRE_GET_EV_PTR(tre);
 	struct mhi_cmd *cmd_ring = &mhi_cntrl->mhi_cmd[PRIMARY_CMD_RING];
+	struct device *dev = &mhi_cntrl->mhi_dev->dev;
 	struct mhi_ring *mhi_ring = &cmd_ring->ring;
 	struct mhi_tre *cmd_pkt;
 	struct mhi_chan *mhi_chan;
@@ -714,6 +715,23 @@  static void mhi_process_cmd_completion(struct mhi_controller *mhi_cntrl,
 
 	cmd_pkt = mhi_to_virtual(mhi_ring, ptr);
 
+	if (unlikely(cmd_pkt == mhi_ring->wp)) {
+		/* Some buggy hardwares (e.g sdx24) sometimes report the current
+		 * command ring wp pointer instead of the command completion
+		 * pointer. It's obviously wrong, causing completion timeout. We
+		 * can however deal with that situation by completing the cmd
+		 * n-1 element.
+		 */
+		void *ring_ptr = (void *)cmd_pkt - mhi_ring->el_size;
+
+		if (ring_ptr < mhi_ring->base)
+			ring_ptr += mhi_ring->len;
+
+		cmd_pkt = ring_ptr;
+
+		dev_warn(dev, "Bad completion pointer (ptr == ring_wp)\n");
+	}
+
 	chan = MHI_TRE_GET_CMD_CHID(cmd_pkt);
 	mhi_chan = &mhi_cntrl->mhi_chan[chan];
 	write_lock_bh(&mhi_chan->lock);