diff mbox series

mmc: Try power cycling card if command request times out

Message ID 20210216224252.22187-1-marten.lindahl@axis.com
State New
Headers show
Series mmc: Try power cycling card if command request times out | expand

Commit Message

Mårten Lindahl Feb. 16, 2021, 10:42 p.m. UTC
Sometimes SD cards that has been run for a long time enters a state
where it cannot by itself be recovered, but needs a power cycle to be
operational again. Card status analysis has indicated that the card can
end up in a state where all external commands are ignored by the card
since it is halted by data timeouts.

If the card has been heavily used for a long time it can be weared out,
and should typically be replaced. But on some tests, it shows that the
card can still be functional after a power cycle, but as it requires an
operator to do it, the card can remain in a non-operational state for a
long time until the problem has been observed by the operator.

This patch adds function to power cycle the card in case it does not
respond to a command, and then resend the command if the power cycle
was successful. This procedure will be tested 1 time before giving up,
and resuming host operation as normal.

Signed-off-by: Mårten Lindahl <marten.lindahl@axis.com>
---
Please note: This might not be the way we want to handle these cases,
but at least it lets us start the discussion. In which cases should the
mmc framework deal with error messages like ETIMEDOUT, and in which
cases should it be handled by userspace?
The mmc framework tries to recover a failed block request
(mmc_blk_mq_rw_recovery) which may end up in a HW reset of the card.
Would it be an idea to act in a similar way when an ioctl times out?

 drivers/mmc/core/block.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

Comments

Ulf Hansson March 1, 2021, 8:50 a.m. UTC | #1
+ Adrian

On Tue, 16 Feb 2021 at 23:43, Mårten Lindahl <marten.lindahl@axis.com> wrote:
>

> Sometimes SD cards that has been run for a long time enters a state

> where it cannot by itself be recovered, but needs a power cycle to be

> operational again. Card status analysis has indicated that the card can

> end up in a state where all external commands are ignored by the card

> since it is halted by data timeouts.

>

> If the card has been heavily used for a long time it can be weared out,

> and should typically be replaced. But on some tests, it shows that the

> card can still be functional after a power cycle, but as it requires an

> operator to do it, the card can remain in a non-operational state for a

> long time until the problem has been observed by the operator.

>

> This patch adds function to power cycle the card in case it does not

> respond to a command, and then resend the command if the power cycle

> was successful. This procedure will be tested 1 time before giving up,

> and resuming host operation as normal.


I assume the context above is all about the ioctl interface?

So, when the card enters this non functional state, have you tried
just reading a block through the regular I/O interface. Does it
trigger a power cycle of the card - and then makes it functional
again?

>

> Signed-off-by: Mårten Lindahl <marten.lindahl@axis.com>

> ---

> Please note: This might not be the way we want to handle these cases,

> but at least it lets us start the discussion. In which cases should the

> mmc framework deal with error messages like ETIMEDOUT, and in which

> cases should it be handled by userspace?

> The mmc framework tries to recover a failed block request

> (mmc_blk_mq_rw_recovery) which may end up in a HW reset of the card.

> Would it be an idea to act in a similar way when an ioctl times out?


Maybe, it's a good idea to allow the similar reset for ioctls as we do
for regular I/O requests. My concern with this though, is that we
might allow user space to trigger a HW resets a bit too easily - and
that could damage the card.

Did you consider this?

>

>  drivers/mmc/core/block.c | 20 ++++++++++++++++++--

>  1 file changed, 18 insertions(+), 2 deletions(-)

>

> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c

> index 42e27a298218..d007b2af64d6 100644

> --- a/drivers/mmc/core/block.c

> +++ b/drivers/mmc/core/block.c

> @@ -976,6 +976,7 @@ static inline void mmc_blk_reset_success(struct mmc_blk_data *md, int type)

>   */

>  static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)

>  {

> +       int type = rq_data_dir(req) == READ ? MMC_BLK_READ : MMC_BLK_WRITE;

>         struct mmc_queue_req *mq_rq;

>         struct mmc_card *card = mq->card;

>         struct mmc_blk_data *md = mq->blkdata;

> @@ -983,7 +984,7 @@ static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)

>         bool rpmb_ioctl;

>         u8 **ext_csd;

>         u32 status;

> -       int ret;

> +       int ret, retry = 1;

>         int i;

>

>         mq_rq = req_to_mmc_queue_req(req);

> @@ -994,9 +995,24 @@ static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)

>         case MMC_DRV_OP_IOCTL_RPMB:

>                 idata = mq_rq->drv_op_data;

>                 for (i = 0, ret = 0; i < mq_rq->ioc_count; i++) {

> +cmd_do:

>                         ret = __mmc_blk_ioctl_cmd(card, md, idata[i]);

> -                       if (ret)

> +                       if (ret == -ETIMEDOUT) {

> +                               dev_warn(mmc_dev(card->host),

> +                                        "error %d sending command\n", ret);

> +cmd_reset:

> +                               mmc_blk_reset_success(md, type);

> +                               if (retry--) {

> +                                       dev_warn(mmc_dev(card->host),

> +                                                "power cycling card\n");

> +                                       if (mmc_blk_reset

> +                                           (md, card->host, type))

> +                                               goto cmd_reset;

> +                                       mmc_blk_reset_success(md, type);

> +                                       goto cmd_do;

> +                               }

>                                 break;

> +                       }

>                 }

>                 /* Always switch back to main area after RPMB access */

>                 if (rpmb_ioctl)

> --

> 2.11.0

>


Kind regards
Uffe
Adrian Hunter March 1, 2021, 10:40 a.m. UTC | #2
On 1/03/21 10:50 am, Ulf Hansson wrote:
> + Adrian

> 

> On Tue, 16 Feb 2021 at 23:43, Mårten Lindahl <marten.lindahl@axis.com> wrote:

>>

>> Sometimes SD cards that has been run for a long time enters a state

>> where it cannot by itself be recovered, but needs a power cycle to be

>> operational again. Card status analysis has indicated that the card can

>> end up in a state where all external commands are ignored by the card

>> since it is halted by data timeouts.

>>

>> If the card has been heavily used for a long time it can be weared out,

>> and should typically be replaced. But on some tests, it shows that the

>> card can still be functional after a power cycle, but as it requires an

>> operator to do it, the card can remain in a non-operational state for a

>> long time until the problem has been observed by the operator.

>>

>> This patch adds function to power cycle the card in case it does not

>> respond to a command, and then resend the command if the power cycle

>> was successful. This procedure will be tested 1 time before giving up,

>> and resuming host operation as normal.

> 

> I assume the context above is all about the ioctl interface?

> 

> So, when the card enters this non functional state, have you tried

> just reading a block through the regular I/O interface. Does it

> trigger a power cycle of the card - and then makes it functional

> again?

> 

>>

>> Signed-off-by: Mårten Lindahl <marten.lindahl@axis.com>

>> ---

>> Please note: This might not be the way we want to handle these cases,

>> but at least it lets us start the discussion. In which cases should the

>> mmc framework deal with error messages like ETIMEDOUT, and in which

>> cases should it be handled by userspace?

>> The mmc framework tries to recover a failed block request

>> (mmc_blk_mq_rw_recovery) which may end up in a HW reset of the card.

>> Would it be an idea to act in a similar way when an ioctl times out?

> 

> Maybe, it's a good idea to allow the similar reset for ioctls as we do

> for regular I/O requests. My concern with this though, is that we

> might allow user space to trigger a HW resets a bit too easily - and

> that could damage the card.

> 

> Did you consider this?

> 

>>

>>  drivers/mmc/core/block.c | 20 ++++++++++++++++++--

>>  1 file changed, 18 insertions(+), 2 deletions(-)

>>

>> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c

>> index 42e27a298218..d007b2af64d6 100644

>> --- a/drivers/mmc/core/block.c

>> +++ b/drivers/mmc/core/block.c

>> @@ -976,6 +976,7 @@ static inline void mmc_blk_reset_success(struct mmc_blk_data *md, int type)

>>   */

>>  static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)

>>  {

>> +       int type = rq_data_dir(req) == READ ? MMC_BLK_READ : MMC_BLK_WRITE;

>>         struct mmc_queue_req *mq_rq;

>>         struct mmc_card *card = mq->card;

>>         struct mmc_blk_data *md = mq->blkdata;

>> @@ -983,7 +984,7 @@ static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)

>>         bool rpmb_ioctl;

>>         u8 **ext_csd;

>>         u32 status;

>> -       int ret;

>> +       int ret, retry = 1;

>>         int i;

>>

>>         mq_rq = req_to_mmc_queue_req(req);

>> @@ -994,9 +995,24 @@ static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)

>>         case MMC_DRV_OP_IOCTL_RPMB:


SD cards do not have RPMB.  Did you mean eMMC?


>>                 idata = mq_rq->drv_op_data;

>>                 for (i = 0, ret = 0; i < mq_rq->ioc_count; i++) {

>> +cmd_do:

>>                         ret = __mmc_blk_ioctl_cmd(card, md, idata[i]);

>> -                       if (ret)

>> +                       if (ret == -ETIMEDOUT) {

>> +                               dev_warn(mmc_dev(card->host),

>> +                                        "error %d sending command\n", ret);

>> +cmd_reset:

>> +                               mmc_blk_reset_success(md, type);


mmc_blk_reset_success() is called upon success, not failure.  The reset will
not be attempted twice in a row, for a given type, without a "success" in
between.

>> +                               if (retry--) {

>> +                                       dev_warn(mmc_dev(card->host),

>> +                                                "power cycling card\n");

>> +                                       if (mmc_blk_reset

>> +                                           (md, card->host, type))

>> +                                               goto cmd_reset;

>> +                                       mmc_blk_reset_success(md, type);

>> +                                       goto cmd_do;

>> +                               }

>>                                 break;

>> +                       }

>>                 }

>>                 /* Always switch back to main area after RPMB access */

>>                 if (rpmb_ioctl)

>> --

>> 2.11.0

>>

> 

> Kind regards

> Uffe

>
Marten Lindahl March 1, 2021, 9:59 p.m. UTC | #3
Hi Ulf!

Thank you for your comments!

On Mon, Mar 01, 2021 at 09:50:56AM +0100, Ulf Hansson wrote:
> + Adrian

> 

> On Tue, 16 Feb 2021 at 23:43, Mårten Lindahl <marten.lindahl@axis.com> wrote:

> >

> > Sometimes SD cards that has been run for a long time enters a state

> > where it cannot by itself be recovered, but needs a power cycle to be

> > operational again. Card status analysis has indicated that the card can

> > end up in a state where all external commands are ignored by the card

> > since it is halted by data timeouts.

> >

> > If the card has been heavily used for a long time it can be weared out,

> > and should typically be replaced. But on some tests, it shows that the

> > card can still be functional after a power cycle, but as it requires an

> > operator to do it, the card can remain in a non-operational state for a

> > long time until the problem has been observed by the operator.

> >

> > This patch adds function to power cycle the card in case it does not

> > respond to a command, and then resend the command if the power cycle

> > was successful. This procedure will be tested 1 time before giving up,

> > and resuming host operation as normal.

> 

> I assume the context above is all about the ioctl interface?

> 


Yes, that's correct. The problem we have seen is triggered by ioctls.

> So, when the card enters this non functional state, have you tried

> just reading a block through the regular I/O interface. Does it

> trigger a power cycle of the card - and then makes it functional

> again?

> 


Yes, we have tried that, and it does trigger a power cycle, making the card
operational again. But as it requires an operator to trigger it, I thought
it might be something that could be automated here. At least once.

> >

> > Signed-off-by: Mårten Lindahl <marten.lindahl@axis.com>

> > ---

> > Please note: This might not be the way we want to handle these cases,

> > but at least it lets us start the discussion. In which cases should the

> > mmc framework deal with error messages like ETIMEDOUT, and in which

> > cases should it be handled by userspace?

> > The mmc framework tries to recover a failed block request

> > (mmc_blk_mq_rw_recovery) which may end up in a HW reset of the card.

> > Would it be an idea to act in a similar way when an ioctl times out?

> 

> Maybe, it's a good idea to allow the similar reset for ioctls as we do

> for regular I/O requests. My concern with this though, is that we

> might allow user space to trigger a HW resets a bit too easily - and

> that could damage the card.

> 

> Did you consider this?

> 


Yes, that is a valid point, and that is why the power cycle is only tried
once. But the conditon for this reset is a -ETIMEDOUT, and this is the part of
this patch where I am myself not sure of if it is enough to check for. Would
this be an error that you could expect to happen with ioctl requests in other
situations also, but not necessarily cause by a stalled card?

Kind regards
Mårten

> >

> >  drivers/mmc/core/block.c | 20 ++++++++++++++++++--

> >  1 file changed, 18 insertions(+), 2 deletions(-)

> >

> > diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c

> > index 42e27a298218..d007b2af64d6 100644

> > --- a/drivers/mmc/core/block.c

> > +++ b/drivers/mmc/core/block.c

> > @@ -976,6 +976,7 @@ static inline void mmc_blk_reset_success(struct mmc_blk_data *md, int type)

> >   */

> >  static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)

> >  {

> > +       int type = rq_data_dir(req) == READ ? MMC_BLK_READ : MMC_BLK_WRITE;

> >         struct mmc_queue_req *mq_rq;

> >         struct mmc_card *card = mq->card;

> >         struct mmc_blk_data *md = mq->blkdata;

> > @@ -983,7 +984,7 @@ static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)

> >         bool rpmb_ioctl;

> >         u8 **ext_csd;

> >         u32 status;

> > -       int ret;

> > +       int ret, retry = 1;

> >         int i;

> >

> >         mq_rq = req_to_mmc_queue_req(req);

> > @@ -994,9 +995,24 @@ static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)

> >         case MMC_DRV_OP_IOCTL_RPMB:

> >                 idata = mq_rq->drv_op_data;

> >                 for (i = 0, ret = 0; i < mq_rq->ioc_count; i++) {

> > +cmd_do:

> >                         ret = __mmc_blk_ioctl_cmd(card, md, idata[i]);

> > -                       if (ret)

> > +                       if (ret == -ETIMEDOUT) {

> > +                               dev_warn(mmc_dev(card->host),

> > +                                        "error %d sending command\n", ret);

> > +cmd_reset:

> > +                               mmc_blk_reset_success(md, type);

> > +                               if (retry--) {

> > +                                       dev_warn(mmc_dev(card->host),

> > +                                                "power cycling card\n");

> > +                                       if (mmc_blk_reset

> > +                                           (md, card->host, type))

> > +                                               goto cmd_reset;

> > +                                       mmc_blk_reset_success(md, type);

> > +                                       goto cmd_do;

> > +                               }

> >                                 break;

> > +                       }

> >                 }

> >                 /* Always switch back to main area after RPMB access */

> >                 if (rpmb_ioctl)

> > --

> > 2.11.0

> >

> 

> Kind regards

> Uffe
Marten Lindahl March 1, 2021, 10:30 p.m. UTC | #4
Hi Adrian!

Thank you for your comments!

On Mon, Mar 01, 2021 at 11:40:03AM +0100, Adrian Hunter wrote:
> On 1/03/21 10:50 am, Ulf Hansson wrote:
> > + Adrian
> > 
> > On Tue, 16 Feb 2021 at 23:43, Mårten Lindahl <marten.lindahl@axis.com> wrote:
> >>
> >> Sometimes SD cards that has been run for a long time enters a state
> >> where it cannot by itself be recovered, but needs a power cycle to be
> >> operational again. Card status analysis has indicated that the card can
> >> end up in a state where all external commands are ignored by the card
> >> since it is halted by data timeouts.
> >>
> >> If the card has been heavily used for a long time it can be weared out,
> >> and should typically be replaced. But on some tests, it shows that the
> >> card can still be functional after a power cycle, but as it requires an
> >> operator to do it, the card can remain in a non-operational state for a
> >> long time until the problem has been observed by the operator.
> >>
> >> This patch adds function to power cycle the card in case it does not
> >> respond to a command, and then resend the command if the power cycle
> >> was successful. This procedure will be tested 1 time before giving up,
> >> and resuming host operation as normal.
> > 
> > I assume the context above is all about the ioctl interface?
> > 
> > So, when the card enters this non functional state, have you tried
> > just reading a block through the regular I/O interface. Does it
> > trigger a power cycle of the card - and then makes it functional
> > again?
> > 
> >>
> >> Signed-off-by: Mårten Lindahl <marten.lindahl@axis.com>
> >> ---
> >> Please note: This might not be the way we want to handle these cases,
> >> but at least it lets us start the discussion. In which cases should the
> >> mmc framework deal with error messages like ETIMEDOUT, and in which
> >> cases should it be handled by userspace?
> >> The mmc framework tries to recover a failed block request
> >> (mmc_blk_mq_rw_recovery) which may end up in a HW reset of the card.
> >> Would it be an idea to act in a similar way when an ioctl times out?
> > 
> > Maybe, it's a good idea to allow the similar reset for ioctls as we do
> > for regular I/O requests. My concern with this though, is that we
> > might allow user space to trigger a HW resets a bit too easily - and
> > that could damage the card.
> > 
> > Did you consider this?
> > 
> >>
> >>  drivers/mmc/core/block.c | 20 ++++++++++++++++++--
> >>  1 file changed, 18 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
> >> index 42e27a298218..d007b2af64d6 100644
> >> --- a/drivers/mmc/core/block.c
> >> +++ b/drivers/mmc/core/block.c
> >> @@ -976,6 +976,7 @@ static inline void mmc_blk_reset_success(struct mmc_blk_data *md, int type)
> >>   */
> >>  static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)
> >>  {
> >> +       int type = rq_data_dir(req) == READ ? MMC_BLK_READ : MMC_BLK_WRITE;
> >>         struct mmc_queue_req *mq_rq;
> >>         struct mmc_card *card = mq->card;
> >>         struct mmc_blk_data *md = mq->blkdata;
> >> @@ -983,7 +984,7 @@ static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)
> >>         bool rpmb_ioctl;
> >>         u8 **ext_csd;
> >>         u32 status;
> >> -       int ret;
> >> +       int ret, retry = 1;
> >>         int i;
> >>
> >>         mq_rq = req_to_mmc_queue_req(req);
> >> @@ -994,9 +995,24 @@ static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)
> >>         case MMC_DRV_OP_IOCTL_RPMB:
> 
> SD cards do not have RPMB.  Did you mean eMMC?
> 

No, you are right. This action should be excluded from 'case MMC_DRV_OP_IOCTL_RPMB'.

> 
> >>                 idata = mq_rq->drv_op_data;
> >>                 for (i = 0, ret = 0; i < mq_rq->ioc_count; i++) {
> >> +cmd_do:
> >>                         ret = __mmc_blk_ioctl_cmd(card, md, idata[i]);
> >> -                       if (ret)
> >> +                       if (ret == -ETIMEDOUT) {
> >> +                               dev_warn(mmc_dev(card->host),
> >> +                                        "error %d sending command\n", ret);
> >> +cmd_reset:
> >> +                               mmc_blk_reset_success(md, type);
> 
> mmc_blk_reset_success() is called upon success, not failure.  The reset will
> not be attempted twice in a row, for a given type, without a "success" in
> between.
> 

Ok, yes I see. This line and the cmd_reset label should be removed, and if
mmc_blk_reset fails we should break, not retry.

Kind regards
Mårten

> >> +                               if (retry--) {
> >> +                                       dev_warn(mmc_dev(card->host),
> >> +                                                "power cycling card\n");
> >> +                                       if (mmc_blk_reset
> >> +                                           (md, card->host, type))
> >> +                                               goto cmd_reset;
> >> +                                       mmc_blk_reset_success(md, type);
> >> +                                       goto cmd_do;
> >> +                               }
> >>                                 break;
> >> +                       }
> >>                 }
> >>                 /* Always switch back to main area after RPMB access */
> >>                 if (rpmb_ioctl)
> >> --
> >> 2.11.0
> >>
> > 
> > Kind regards
> > Uffe
> > 
>
Ulf Hansson March 2, 2021, 8:45 a.m. UTC | #5
On Mon, 1 Mar 2021 at 22:59, Marten Lindahl <martenli@axis.com> wrote:
>
> Hi Ulf!
>
> Thank you for your comments!
>
> On Mon, Mar 01, 2021 at 09:50:56AM +0100, Ulf Hansson wrote:
> > + Adrian
> >
> > On Tue, 16 Feb 2021 at 23:43, Mårten Lindahl <marten.lindahl@axis.com> wrote:
> > >
> > > Sometimes SD cards that has been run for a long time enters a state
> > > where it cannot by itself be recovered, but needs a power cycle to be
> > > operational again. Card status analysis has indicated that the card can
> > > end up in a state where all external commands are ignored by the card
> > > since it is halted by data timeouts.
> > >
> > > If the card has been heavily used for a long time it can be weared out,
> > > and should typically be replaced. But on some tests, it shows that the
> > > card can still be functional after a power cycle, but as it requires an
> > > operator to do it, the card can remain in a non-operational state for a
> > > long time until the problem has been observed by the operator.
> > >
> > > This patch adds function to power cycle the card in case it does not
> > > respond to a command, and then resend the command if the power cycle
> > > was successful. This procedure will be tested 1 time before giving up,
> > > and resuming host operation as normal.
> >
> > I assume the context above is all about the ioctl interface?
> >
>
> Yes, that's correct. The problem we have seen is triggered by ioctls.
>
> > So, when the card enters this non functional state, have you tried
> > just reading a block through the regular I/O interface. Does it
> > trigger a power cycle of the card - and then makes it functional
> > again?
> >
>
> Yes, we have tried that, and it does trigger a power cycle, making the card
> operational again. But as it requires an operator to trigger it, I thought
> it might be something that could be automated here. At least once.

Not sure what you mean by operator here? In the end it's a userspace
program running and I assume it can deal with error paths. :-)

In any case, I understand your point.

>
> > >
> > > Signed-off-by: Mårten Lindahl <marten.lindahl@axis.com>
> > > ---
> > > Please note: This might not be the way we want to handle these cases,
> > > but at least it lets us start the discussion. In which cases should the
> > > mmc framework deal with error messages like ETIMEDOUT, and in which
> > > cases should it be handled by userspace?
> > > The mmc framework tries to recover a failed block request
> > > (mmc_blk_mq_rw_recovery) which may end up in a HW reset of the card.
> > > Would it be an idea to act in a similar way when an ioctl times out?
> >
> > Maybe, it's a good idea to allow the similar reset for ioctls as we do
> > for regular I/O requests. My concern with this though, is that we
> > might allow user space to trigger a HW resets a bit too easily - and
> > that could damage the card.
> >
> > Did you consider this?
> >
>
> Yes, that is a valid point, and that is why the power cycle is only tried
> once. But the conditon for this reset is a -ETIMEDOUT, and this is the part of
> this patch where I am myself not sure of if it is enough to check for. Would
> this be an error that you could expect to happen with ioctl requests in other
> situations also, but not necessarily cause by a stalled card?

Exactly.

Many different commands can get pushed down to the card through the
mmc ioctl interface. It's difficult to know what error path we should
pick, other than reporting and propagating the error codes.

[...]

Kind regards
Uffe
Marten Lindahl March 4, 2021, 1:48 p.m. UTC | #6
Hi Ulf! My apologies for the delay.

On Tue, Mar 02, 2021 at 09:45:02AM +0100, Ulf Hansson wrote:
> On Mon, 1 Mar 2021 at 22:59, Marten Lindahl <martenli@axis.com> wrote:

> >

> > Hi Ulf!

> >

> > Thank you for your comments!

> >

> > On Mon, Mar 01, 2021 at 09:50:56AM +0100, Ulf Hansson wrote:

> > > + Adrian

> > >

> > > On Tue, 16 Feb 2021 at 23:43, Mårten Lindahl <marten.lindahl@axis.com> wrote:

> > > >

> > > > Sometimes SD cards that has been run for a long time enters a state

> > > > where it cannot by itself be recovered, but needs a power cycle to be

> > > > operational again. Card status analysis has indicated that the card can

> > > > end up in a state where all external commands are ignored by the card

> > > > since it is halted by data timeouts.

> > > >

> > > > If the card has been heavily used for a long time it can be weared out,

> > > > and should typically be replaced. But on some tests, it shows that the

> > > > card can still be functional after a power cycle, but as it requires an

> > > > operator to do it, the card can remain in a non-operational state for a

> > > > long time until the problem has been observed by the operator.

> > > >

> > > > This patch adds function to power cycle the card in case it does not

> > > > respond to a command, and then resend the command if the power cycle

> > > > was successful. This procedure will be tested 1 time before giving up,

> > > > and resuming host operation as normal.

> > >

> > > I assume the context above is all about the ioctl interface?

> > >

> >

> > Yes, that's correct. The problem we have seen is triggered by ioctls.

> >

> > > So, when the card enters this non functional state, have you tried

> > > just reading a block through the regular I/O interface. Does it

> > > trigger a power cycle of the card - and then makes it functional

> > > again?

> > >

> >

> > Yes, we have tried that, and it does trigger a power cycle, making the card

> > operational again. But as it requires an operator to trigger it, I thought

> > it might be something that could be automated here. At least once.

> 

> Not sure what you mean by operator here? In the end it's a userspace

> program running and I assume it can deal with error paths. :-)

> 

> In any case, I understand your point.

> 


Yes, we have a userspace program. So if the userspace program will try to
restore the card in a situation such as the one we are trying to solve
here, how shall it perform it? Is it expected that a ioctl CMD0 request
should be enough, or is there any other support for a userspace program to
reset the card?

If it falls on a ioctl command to reset the card, how do we handle the case
where the ioctl times out anyway? Or is the only way for a userspace program
to restore the card, to make a block transfer that fails?

Kind regards
Mårten

> >

> > > >

> > > > Signed-off-by: Mårten Lindahl <marten.lindahl@axis.com>

> > > > ---

> > > > Please note: This might not be the way we want to handle these cases,

> > > > but at least it lets us start the discussion. In which cases should the

> > > > mmc framework deal with error messages like ETIMEDOUT, and in which

> > > > cases should it be handled by userspace?

> > > > The mmc framework tries to recover a failed block request

> > > > (mmc_blk_mq_rw_recovery) which may end up in a HW reset of the card.

> > > > Would it be an idea to act in a similar way when an ioctl times out?

> > >

> > > Maybe, it's a good idea to allow the similar reset for ioctls as we do

> > > for regular I/O requests. My concern with this though, is that we

> > > might allow user space to trigger a HW resets a bit too easily - and

> > > that could damage the card.

> > >

> > > Did you consider this?

> > >

> >

> > Yes, that is a valid point, and that is why the power cycle is only tried

> > once. But the conditon for this reset is a -ETIMEDOUT, and this is the part of

> > this patch where I am myself not sure of if it is enough to check for. Would

> > this be an error that you could expect to happen with ioctl requests in other

> > situations also, but not necessarily cause by a stalled card?

> 

> Exactly.

> 

> Many different commands can get pushed down to the card through the

> mmc ioctl interface. It's difficult to know what error path we should

> pick, other than reporting and propagating the error codes.

> 

> [...]

> 

> Kind regards

> Uffe
Ulf Hansson March 4, 2021, 2:06 p.m. UTC | #7
On Thu, 4 Mar 2021 at 14:48, Marten Lindahl <martenli@axis.com> wrote:
>

> Hi Ulf! My apologies for the delay.

>

> On Tue, Mar 02, 2021 at 09:45:02AM +0100, Ulf Hansson wrote:

> > On Mon, 1 Mar 2021 at 22:59, Marten Lindahl <martenli@axis.com> wrote:

> > >

> > > Hi Ulf!

> > >

> > > Thank you for your comments!

> > >

> > > On Mon, Mar 01, 2021 at 09:50:56AM +0100, Ulf Hansson wrote:

> > > > + Adrian

> > > >

> > > > On Tue, 16 Feb 2021 at 23:43, Mårten Lindahl <marten.lindahl@axis.com> wrote:

> > > > >

> > > > > Sometimes SD cards that has been run for a long time enters a state

> > > > > where it cannot by itself be recovered, but needs a power cycle to be

> > > > > operational again. Card status analysis has indicated that the card can

> > > > > end up in a state where all external commands are ignored by the card

> > > > > since it is halted by data timeouts.

> > > > >

> > > > > If the card has been heavily used for a long time it can be weared out,

> > > > > and should typically be replaced. But on some tests, it shows that the

> > > > > card can still be functional after a power cycle, but as it requires an

> > > > > operator to do it, the card can remain in a non-operational state for a

> > > > > long time until the problem has been observed by the operator.

> > > > >

> > > > > This patch adds function to power cycle the card in case it does not

> > > > > respond to a command, and then resend the command if the power cycle

> > > > > was successful. This procedure will be tested 1 time before giving up,

> > > > > and resuming host operation as normal.

> > > >

> > > > I assume the context above is all about the ioctl interface?

> > > >

> > >

> > > Yes, that's correct. The problem we have seen is triggered by ioctls.

> > >

> > > > So, when the card enters this non functional state, have you tried

> > > > just reading a block through the regular I/O interface. Does it

> > > > trigger a power cycle of the card - and then makes it functional

> > > > again?

> > > >

> > >

> > > Yes, we have tried that, and it does trigger a power cycle, making the card

> > > operational again. But as it requires an operator to trigger it, I thought

> > > it might be something that could be automated here. At least once.

> >

> > Not sure what you mean by operator here? In the end it's a userspace

> > program running and I assume it can deal with error paths. :-)

> >

> > In any case, I understand your point.

> >

>

> Yes, we have a userspace program. So if the userspace program will try to

> restore the card in a situation such as the one we are trying to solve

> here, how shall it perform it? Is it expected that a ioctl CMD0 request

> should be enough, or is there any other support for a userspace program to

> reset the card?


Correct, there is no way for userspace to reset cards through an ioctl.

>

> If it falls on a ioctl command to reset the card, how do we handle the case

> where the ioctl times out anyway? Or is the only way for a userspace program

> to restore the card, to make a block transfer that fails?


Yes, that is what I was thinking. According to the use case you have
described, this should be possible for you to implement as a part of
your userspace program, no?

[...]

Kind regards
Uffe
Marten Lindahl March 4, 2021, 2:59 p.m. UTC | #8
On Thu, Mar 04, 2021 at 03:06:54PM +0100, Ulf Hansson wrote:
> On Thu, 4 Mar 2021 at 14:48, Marten Lindahl <martenli@axis.com> wrote:

> >

> > Hi Ulf! My apologies for the delay.

> >

> > On Tue, Mar 02, 2021 at 09:45:02AM +0100, Ulf Hansson wrote:

> > > On Mon, 1 Mar 2021 at 22:59, Marten Lindahl <martenli@axis.com> wrote:

> > > >

> > > > Hi Ulf!

> > > >

> > > > Thank you for your comments!

> > > >

> > > > On Mon, Mar 01, 2021 at 09:50:56AM +0100, Ulf Hansson wrote:

> > > > > + Adrian

> > > > >

> > > > > On Tue, 16 Feb 2021 at 23:43, Mårten Lindahl <marten.lindahl@axis.com> wrote:

> > > > > >

> > > > > > Sometimes SD cards that has been run for a long time enters a state

> > > > > > where it cannot by itself be recovered, but needs a power cycle to be

> > > > > > operational again. Card status analysis has indicated that the card can

> > > > > > end up in a state where all external commands are ignored by the card

> > > > > > since it is halted by data timeouts.

> > > > > >

> > > > > > If the card has been heavily used for a long time it can be weared out,

> > > > > > and should typically be replaced. But on some tests, it shows that the

> > > > > > card can still be functional after a power cycle, but as it requires an

> > > > > > operator to do it, the card can remain in a non-operational state for a

> > > > > > long time until the problem has been observed by the operator.

> > > > > >

> > > > > > This patch adds function to power cycle the card in case it does not

> > > > > > respond to a command, and then resend the command if the power cycle

> > > > > > was successful. This procedure will be tested 1 time before giving up,

> > > > > > and resuming host operation as normal.

> > > > >

> > > > > I assume the context above is all about the ioctl interface?

> > > > >

> > > >

> > > > Yes, that's correct. The problem we have seen is triggered by ioctls.

> > > >

> > > > > So, when the card enters this non functional state, have you tried

> > > > > just reading a block through the regular I/O interface. Does it

> > > > > trigger a power cycle of the card - and then makes it functional

> > > > > again?

> > > > >

> > > >

> > > > Yes, we have tried that, and it does trigger a power cycle, making the card

> > > > operational again. But as it requires an operator to trigger it, I thought

> > > > it might be something that could be automated here. At least once.

> > >

> > > Not sure what you mean by operator here? In the end it's a userspace

> > > program running and I assume it can deal with error paths. :-)

> > >

> > > In any case, I understand your point.

> > >

> >

> > Yes, we have a userspace program. So if the userspace program will try to

> > restore the card in a situation such as the one we are trying to solve

> > here, how shall it perform it? Is it expected that a ioctl CMD0 request

> > should be enough, or is there any other support for a userspace program to

> > reset the card?

> 

> Correct, there is no way for userspace to reset cards through an ioctl.

> 

> >

> > If it falls on a ioctl command to reset the card, how do we handle the case

> > where the ioctl times out anyway? Or is the only way for a userspace program

> > to restore the card, to make a block transfer that fails?

> 

> Yes, that is what I was thinking. According to the use case you have

> described, this should be possible for you to implement as a part of

> your userspace program, no?


Ok, I will discuss that with the people maintaining the userspace program :)

But would it be of interest to review a patch introducing a more clean card
reset request, without block transfers?

Kind regards
Mårten

> 

> [...]

> 

> Kind regards

> Uffe
Ulf Hansson March 4, 2021, 6:02 p.m. UTC | #9
On Thu, 4 Mar 2021 at 15:59, Marten Lindahl <martenli@axis.com> wrote:
>

> On Thu, Mar 04, 2021 at 03:06:54PM +0100, Ulf Hansson wrote:

> > On Thu, 4 Mar 2021 at 14:48, Marten Lindahl <martenli@axis.com> wrote:

> > >

> > > Hi Ulf! My apologies for the delay.

> > >

> > > On Tue, Mar 02, 2021 at 09:45:02AM +0100, Ulf Hansson wrote:

> > > > On Mon, 1 Mar 2021 at 22:59, Marten Lindahl <martenli@axis.com> wrote:

> > > > >

> > > > > Hi Ulf!

> > > > >

> > > > > Thank you for your comments!

> > > > >

> > > > > On Mon, Mar 01, 2021 at 09:50:56AM +0100, Ulf Hansson wrote:

> > > > > > + Adrian

> > > > > >

> > > > > > On Tue, 16 Feb 2021 at 23:43, Mårten Lindahl <marten.lindahl@axis.com> wrote:

> > > > > > >

> > > > > > > Sometimes SD cards that has been run for a long time enters a state

> > > > > > > where it cannot by itself be recovered, but needs a power cycle to be

> > > > > > > operational again. Card status analysis has indicated that the card can

> > > > > > > end up in a state where all external commands are ignored by the card

> > > > > > > since it is halted by data timeouts.

> > > > > > >

> > > > > > > If the card has been heavily used for a long time it can be weared out,

> > > > > > > and should typically be replaced. But on some tests, it shows that the

> > > > > > > card can still be functional after a power cycle, but as it requires an

> > > > > > > operator to do it, the card can remain in a non-operational state for a

> > > > > > > long time until the problem has been observed by the operator.

> > > > > > >

> > > > > > > This patch adds function to power cycle the card in case it does not

> > > > > > > respond to a command, and then resend the command if the power cycle

> > > > > > > was successful. This procedure will be tested 1 time before giving up,

> > > > > > > and resuming host operation as normal.

> > > > > >

> > > > > > I assume the context above is all about the ioctl interface?

> > > > > >

> > > > >

> > > > > Yes, that's correct. The problem we have seen is triggered by ioctls.

> > > > >

> > > > > > So, when the card enters this non functional state, have you tried

> > > > > > just reading a block through the regular I/O interface. Does it

> > > > > > trigger a power cycle of the card - and then makes it functional

> > > > > > again?

> > > > > >

> > > > >

> > > > > Yes, we have tried that, and it does trigger a power cycle, making the card

> > > > > operational again. But as it requires an operator to trigger it, I thought

> > > > > it might be something that could be automated here. At least once.

> > > >

> > > > Not sure what you mean by operator here? In the end it's a userspace

> > > > program running and I assume it can deal with error paths. :-)

> > > >

> > > > In any case, I understand your point.

> > > >

> > >

> > > Yes, we have a userspace program. So if the userspace program will try to

> > > restore the card in a situation such as the one we are trying to solve

> > > here, how shall it perform it? Is it expected that a ioctl CMD0 request

> > > should be enough, or is there any other support for a userspace program to

> > > reset the card?

> >

> > Correct, there is no way for userspace to reset cards through an ioctl.

> >

> > >

> > > If it falls on a ioctl command to reset the card, how do we handle the case

> > > where the ioctl times out anyway? Or is the only way for a userspace program

> > > to restore the card, to make a block transfer that fails?

> >

> > Yes, that is what I was thinking. According to the use case you have

> > described, this should be possible for you to implement as a part of

> > your userspace program, no?

>

> Ok, I will discuss that with the people maintaining the userspace program :)

>

> But would it be of interest to review a patch introducing a more clean card

> reset request, without block transfers?


Well, if you can solve it with block transfers that's the preferred
option, in my opinion.

As I stated earlier, my main issue with the HW reset through the ioctl
interface, is that we don't know what combination of
request/command/response we should be doing a reset for.

Kind regards
Uffe
diff mbox series

Patch

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 42e27a298218..d007b2af64d6 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -976,6 +976,7 @@  static inline void mmc_blk_reset_success(struct mmc_blk_data *md, int type)
  */
 static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)
 {
+	int type = rq_data_dir(req) == READ ? MMC_BLK_READ : MMC_BLK_WRITE;
 	struct mmc_queue_req *mq_rq;
 	struct mmc_card *card = mq->card;
 	struct mmc_blk_data *md = mq->blkdata;
@@ -983,7 +984,7 @@  static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)
 	bool rpmb_ioctl;
 	u8 **ext_csd;
 	u32 status;
-	int ret;
+	int ret, retry = 1;
 	int i;
 
 	mq_rq = req_to_mmc_queue_req(req);
@@ -994,9 +995,24 @@  static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)
 	case MMC_DRV_OP_IOCTL_RPMB:
 		idata = mq_rq->drv_op_data;
 		for (i = 0, ret = 0; i < mq_rq->ioc_count; i++) {
+cmd_do:
 			ret = __mmc_blk_ioctl_cmd(card, md, idata[i]);
-			if (ret)
+			if (ret == -ETIMEDOUT) {
+				dev_warn(mmc_dev(card->host),
+					 "error %d sending command\n", ret);
+cmd_reset:
+				mmc_blk_reset_success(md, type);
+				if (retry--) {
+					dev_warn(mmc_dev(card->host),
+						 "power cycling card\n");
+					if (mmc_blk_reset
+					    (md, card->host, type))
+						goto cmd_reset;
+					mmc_blk_reset_success(md, type);
+					goto cmd_do;
+				}
 				break;
+			}
 		}
 		/* Always switch back to main area after RPMB access */
 		if (rpmb_ioctl)