diff mbox series

mmc: block: add reset workaround for partition switch failures

Message ID 20250224045918.3321394-1-guan.wang.jy@renesas.com
State New
Headers show
Series mmc: block: add reset workaround for partition switch failures | expand

Commit Message

Guan Wang Feb. 24, 2025, 4:59 a.m. UTC
Some eMMC devices (e.g., BGSD4R and AIM20F) may enter an unresponsive state
after encountering CRC errors during RPMB writes (CMD25). This prevents the
device from switching back to the main partition via CMD6, blocking further
I/O operations.

The root cause is suspected to be a firmware/hardware issue in specific
eMMC models. A workaround is to perform a hardware reset via mmc_hw_reset()
when the partition switch fails, followed by a retry.

Add a workaround that:
1. If initial partition switch fails after rpmb access
2. Performs mmc card reset using mmc_hw_reset()
3. Retries switching to main partition
This helps resolve cases where the device becomes unresponsive after
RPMB operations.

Signed-off-by: Guan Wang <guan.wang.jy@renesas.com>
---
 drivers/mmc/core/block.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

Comments

Avri Altman Feb. 27, 2025, 7:50 a.m. UTC | #1
Hi,
> Some eMMC devices (e.g., BGSD4R and AIM20F) may enter an unresponsive
> state
> after encountering CRC errors during RPMB writes (CMD25). This prevents the
> device from switching back to the main partition via CMD6, blocking further
> I/O operations.
Different cards on the same platform?
Can you share which platform, and few lines from the log supporting your analysis?

> 
> The root cause is suspected to be a firmware/hardware issue in specific
> eMMC models. A workaround is to perform a hardware reset via
> mmc_hw_reset()
> when the partition switch fails, followed by a retry.
Same fw bug in 2 different products?

Why do we need to fix it here?
The ioctl will eventually return an error, and reset is needed anyway.
If the eMMC is the primary storage,  the platform is rebooting without being aware what went wrong.

Thanks,
Avri

> 
> Add a workaround that:
> 1. If initial partition switch fails after rpmb access
> 2. Performs mmc card reset using mmc_hw_reset()
> 3. Retries switching to main partition
> This helps resolve cases where the device becomes unresponsive after
> RPMB operations.
> 
> Signed-off-by: Guan Wang <guan.wang.jy@renesas.com>
> ---
>  drivers/mmc/core/block.c | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
> index 4830628510e6..29388786624c 100644
> --- a/drivers/mmc/core/block.c
> +++ b/drivers/mmc/core/block.c
> @@ -1174,8 +1174,24 @@ static void mmc_blk_issue_drv_op(struct
> mmc_queue *mq, struct request *req)
>  				break;
>  		}
>  		/* Always switch back to main area after RPMB access */
> -		if (rpmb_ioctl)
> -			mmc_blk_part_switch(card, 0);
> +		if (rpmb_ioctl) {
> +			if (mmc_blk_part_switch(card, 0)) {
> +				pr_warn("%s: failed to switch back to main
> area, will reset and switch again\n",
> +						md->disk->disk_name);
> +
> +				/*
> +				 * Reset eMMC device if partition switch fails.
> +				 * Some eMMC devices may get stuck by write
> CRC error in RPMB,
> +				 * preventing switch back to main partition.
> This workaround
> +				 * helps recover from this error state.
> +				 */
> +				mmc_hw_reset(card);
> +
> +				if (mmc_blk_part_switch(card, 0))
> +					pr_err("%s: failed to switch back to
> main area even after reset\n",
> +						   md->disk->disk_name);
> +			}
> +		}
>  		else if (card->reenable_cmdq && !card->ext_csd.cmdq_en)
>  			mmc_cmdq_enable(card);
>  		break;
> --
> 2.25.1
Guan Wang March 3, 2025, 2:13 a.m. UTC | #2
Hello,
>> Some eMMC devices (e.g., BGSD4R and AIM20F) may enter an unresponsive 
>> state after encountering CRC errors during RPMB writes (CMD25). This 
>> prevents the device from switching back to the main partition via 
>> CMD6, blocking further I/O operations.
>Different cards on the same platform?
>Can you share which platform, and few lines from the log supporting your analysis?

I tested on R-Car Gen3/4 platforms, which use the same host controller IP and the tmio_mmc host driver.
The tests were conducted on different board and eMMC combinations:
- Gen3 Board with Samsung eMMC (BGSD4R) → Issue observed
- Gen3 Board with Micron eMMC (AIM20F, new version) → Issue observed
- Gen3 Board with Micron eMMC (AIM20F, old version) → No issue
- Gen4 Board with Micron eMMC (G1M15L) → No issue

The issue only occurs in the RPMB partition during write operations, where a CRC error is triggered.
To investigate further, I hacked the host driver to generate a dummy CRC during the CMD25 data phase.
The reproduced log is as follows:
$ ./mmc rpmb read-counter /dev/mmcblk0rpmb
[   75.557848] w_t: -->START_CMD6 (arg: 3b30301)
[   75.557863] w_t:    resp[0]=900
[   75.557875] w_t: -->START_CMD13 (arg: 10000)
[   75.557884] w_t:    resp[0]=900
[   75.557894] w_t: -->START_CMD23 (arg: 1)
[   75.557903] w_t:    resp[0]=900
[   75.557915] w_t: -->START_CMD25 (arg: 0)
[   75.557924] w_t:    resp[0]=900
[   75.557931] !!!!!!!!!!!!!!!!, make a dummy write CRC on DAT
[   75.563631] w_t: (data_err) -84 stat=20820604 error=5800 (which means eMMC device feedbacked nagative CRC status)
[   75.563672] renesas_sdhi_internal_dmac ee140000.sd: __mmc_blk_ioctl_cmd: data error -84
[   75.573112] w_t: -->START_CMD6 (arg: 3b30001)
[   75.573132] w_t: (cmd_err -110) stat=20c00401 error=12000
[   75.573154] w_t: -->START_CMD6 (arg: 3b30001)
[   75.573169] w_t: (cmd_err -110) stat=20c00401 error=12000
[   75.573183] w_t: -->START_CMD6 (arg: 3b30001)
[   75.573197] w_t: (cmd_err -110) stat=20c00401 error=12000
[   75.573211] w_t: -->START_CMD6 (arg: 3b30001)
[   75.573225] w_t: (cmd_err -110) stat=20c00401 error=12000
After this issue occurs, the eMMC device no longer responds to CMD6, even subsequent accesses to the main partition proceed abnormally.
However, if we perform an eMMC card reset at this point, the retry of CMD6 works as expected.

BTW,
I now believe that sending CMD12 is a better solution in this case rather than performing a reset.
According to information from the eMMC vendor, even in a closed-end write operation (CMD23 + CMD25), CMD12 is required if any communication error occurs.
The JESD84 specification also mentions a similar requirement: "A stop command is not required at the end of this type of multiple block write unless terminated with an error."
I just simply tested this approach on the affected board, and it can work successfully.

>> 
>> The root cause is suspected to be a firmware/hardware issue in 
>> specific eMMC models. A workaround is to perform a hardware reset via
>> mmc_hw_reset()
>> when the partition switch fails, followed by a retry.
>Same fw bug in 2 different products?
>
>Why do we need to fix it here?
>The ioctl will eventually return an error, and reset is needed anyway.
>If the eMMC is the primary storage,  the platform is rebooting without being aware what went wrong.

In the main partition, a similar reset operation is already implemented in mmc_blk_issue_rw_rq(),
So I believe applying the same approach for RPMB should be acceptable.
		case MMC_BLK_ABORT:
			if (!mmc_blk_reset(md, card->host, type))
				break;
			mmc_blk_rw_cmd_abort(mq, card, old_req, mq_rq);
			mmc_blk_rw_try_restart(mq, new_req, mqrq_cur);
			return;


Best Regards,
Guan Wang
Avri Altman March 3, 2025, 8:51 a.m. UTC | #3
> Hello,
> >> Some eMMC devices (e.g., BGSD4R and AIM20F) may enter an
> unresponsive
> >> state after encountering CRC errors during RPMB writes (CMD25). This
> >> prevents the device from switching back to the main partition via
> >> CMD6, blocking further I/O operations.
> >Different cards on the same platform?
> >Can you share which platform, and few lines from the log supporting your
> analysis?
> 
> I tested on R-Car Gen3/4 platforms, which use the same host controller IP and
> the tmio_mmc host driver.
> The tests were conducted on different board and eMMC combinations:
> - Gen3 Board with Samsung eMMC (BGSD4R) → Issue observed
> - Gen3 Board with Micron eMMC (AIM20F, new version) → Issue observed
> - Gen3 Board with Micron eMMC (AIM20F, old version) → No issue
> - Gen4 Board with Micron eMMC (G1M15L) → No issue
> 
> The issue only occurs in the RPMB partition during write operations, where a
> CRC error is triggered.
> To investigate further, I hacked the host driver to generate a dummy CRC
> during the CMD25 data phase.
> The reproduced log is as follows:
> $ ./mmc rpmb read-counter /dev/mmcblk0rpmb
> [   75.557848] w_t: -->START_CMD6 (arg: 3b30301)
> [   75.557863] w_t:    resp[0]=900
> [   75.557875] w_t: -->START_CMD13 (arg: 10000)
> [   75.557884] w_t:    resp[0]=900
> [   75.557894] w_t: -->START_CMD23 (arg: 1)
> [   75.557903] w_t:    resp[0]=900
> [   75.557915] w_t: -->START_CMD25 (arg: 0)
> [   75.557924] w_t:    resp[0]=900
> [   75.557931] !!!!!!!!!!!!!!!!, make a dummy write CRC on DAT
> [   75.563631] w_t: (data_err) -84 stat=20820604 error=5800 (which means
> eMMC device feedbacked nagative CRC status)
> [   75.563672] renesas_sdhi_internal_dmac ee140000.sd:
> __mmc_blk_ioctl_cmd: data error -84
> [   75.573112] w_t: -->START_CMD6 (arg: 3b30001)
> [   75.573132] w_t: (cmd_err -110) stat=20c00401 error=12000
> [   75.573154] w_t: -->START_CMD6 (arg: 3b30001)
> [   75.573169] w_t: (cmd_err -110) stat=20c00401 error=12000
> [   75.573183] w_t: -->START_CMD6 (arg: 3b30001)
> [   75.573197] w_t: (cmd_err -110) stat=20c00401 error=12000
> [   75.573211] w_t: -->START_CMD6 (arg: 3b30001)
> [   75.573225] w_t: (cmd_err -110) stat=20c00401 error=12000
> After this issue occurs, the eMMC device no longer responds to CMD6, even
> subsequent accesses to the main partition proceed abnormally.
> However, if we perform an eMMC card reset at this point, the retry of CMD6
> works as expected.
Thank you for sharing it.

> 
> BTW,
> I now believe that sending CMD12 is a better solution in this case rather than
> performing a reset.
> According to information from the eMMC vendor, even in a closed-end write
> operation (CMD23 + CMD25), CMD12 is required if any communication error
> occurs.
> The JESD84 specification also mentions a similar requirement: "A stop
> command is not required at the end of this type of multiple block write unless
> terminated with an error."
> I just simply tested this approach on the affected board, and it can work
> successfully.
OK.
Please note that some host controllers do that as auto-cmd.

> 
> >>
> >> The root cause is suspected to be a firmware/hardware issue in
> >> specific eMMC models. A workaround is to perform a hardware reset via
> >> mmc_hw_reset()
> >> when the partition switch fails, followed by a retry.
> >Same fw bug in 2 different products?
> >
> >Why do we need to fix it here?
> >The ioctl will eventually return an error, and reset is needed anyway.
> >If the eMMC is the primary storage,  the platform is rebooting without being
> aware what went wrong.
> 
> In the main partition, a similar reset operation is already implemented in
> mmc_blk_issue_rw_rq(), So I believe applying the same approach for RPMB
> should be acceptable.
> 		case MMC_BLK_ABORT:
> 			if (!mmc_blk_reset(md, card->host, type))
> 				break;
> 			mmc_blk_rw_cmd_abort(mq, card, old_req, mq_rq);
> 			mmc_blk_rw_try_restart(mq, new_req, mqrq_cur);
> 			return;
The code that you are citing does no longer exist.
It was removed a while ago - see https://lore.kernel.org/linux-block/1511962879-24262-23-git-send-email-adrian.hunter@intel.com/

My point is that you are recovering silently on an ioctl error that is better for the sender to be aware of and recover by himself.

Thanks,
Avri

> 
> 
> Best Regards,
> Guan Wang
diff mbox series

Patch

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 4830628510e6..29388786624c 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1174,8 +1174,24 @@  static void mmc_blk_issue_drv_op(struct mmc_queue *mq, struct request *req)
 				break;
 		}
 		/* Always switch back to main area after RPMB access */
-		if (rpmb_ioctl)
-			mmc_blk_part_switch(card, 0);
+		if (rpmb_ioctl) {
+			if (mmc_blk_part_switch(card, 0)) {
+				pr_warn("%s: failed to switch back to main area, will reset and switch again\n",
+						md->disk->disk_name);
+
+				/*
+				 * Reset eMMC device if partition switch fails.
+				 * Some eMMC devices may get stuck by write CRC error in RPMB,
+				 * preventing switch back to main partition. This workaround
+				 * helps recover from this error state.
+				 */
+				mmc_hw_reset(card);
+
+				if (mmc_blk_part_switch(card, 0))
+					pr_err("%s: failed to switch back to main area even after reset\n",
+						   md->disk->disk_name);
+			}
+		}
 		else if (card->reenable_cmdq && !card->ext_csd.cmdq_en)
 			mmc_cmdq_enable(card);
 		break;