mbox series

[v1,0/6] Complementary changes for error handling

Message ID 1620885319-15151-1-git-send-email-cang@codeaurora.org
Headers show
Series Complementary changes for error handling | expand

Message

Can Guo May 13, 2021, 5:55 a.m. UTC
Commit cb7e6f05fce67c965194ac04467e1ba7bc70b069 ("scsi: ufs: core: Enable
power management for wlun") makes the UFS device W-LU the supplier, based
on which we need to make some changes to accomodate error handling.

This series is tested by fault injections (to IRQ handler, UIC cmds and
task abort where error handler can possibley be invoked) in all possible
contexts, e.g., scaling, gating, runtime and system suspend/resume.

Below changes are tested as a whole and based on 5.14/scsi-queue.

Can Guo (6):
  scsi: ufs: Differentiate status between hba pm ops and wl pm ops
  scsi: ufs: Update the return value of supplier pm ops
  scsi: ufs: Simplify error handling preparation
  scsi: ufs: Update ufshcd_recover_pm_error()
  scsi: ufs: Let host_sem cover the entire system suspend/resume
  scsi: ufs: Update the fast abort path in ufshcd_abort() for PM
    requests

 drivers/scsi/ufs/ufshcd.c | 180 +++++++++++++++++++++++++---------------------
 drivers/scsi/ufs/ufshcd.h |   4 +-
 2 files changed, 103 insertions(+), 81 deletions(-)

Comments

Bart Van Assche May 14, 2021, 4:05 a.m. UTC | #1
On 5/12/21 10:55 PM, Can Guo wrote:
> If PM requests fail during runtime suspend/resume, RPM framework saves the

> error to dev->power.runtime_error. Before the runtime_error gets cleared,

> runtime PM on this specific device won't work again, leaving the device

> in either suspended or active state permanently.

> 

> When task abort happens to a PM request sent during runtime suspend/resume,

> even if it can be successfully aborted, RPM framework anyways saves the

> (TIMEOUT) error. But we want more and we can do better - let error handling

> recover and clear the runtime_error. So, let PM requests take the fast

> abort path in ufshcd_abort().


The only RQF_PM requests I know of are START STOP UNIT and SYNCHRONIZE
CACHE. Are there devices for which these commands can time out or do
these commands perhaps only time out as the result of error injection?

> -	if (lrbp->lun == UFS_UPIU_UFS_DEVICE_WLUN) {

> +	if (lrbp->lun == UFS_UPIU_UFS_DEVICE_WLUN ||

> +	    (cmd->request->rq_flags & RQF_PM)) {


Which are the RQF_PM commands that are not sent to a WLUN? Are these
START STOP UNIT and SYNCHRONIZE CACHE only?

Thanks,

Bart.
Can Guo May 14, 2021, 4:17 a.m. UTC | #2
On 2021-05-14 12:05, Bart Van Assche wrote:
> On 5/12/21 10:55 PM, Can Guo wrote:

>> If PM requests fail during runtime suspend/resume, RPM framework saves 

>> the

>> error to dev->power.runtime_error. Before the runtime_error gets 

>> cleared,

>> runtime PM on this specific device won't work again, leaving the 

>> device

>> in either suspended or active state permanently.

>> 

>> When task abort happens to a PM request sent during runtime 

>> suspend/resume,

>> even if it can be successfully aborted, RPM framework anyways saves 

>> the

>> (TIMEOUT) error. But we want more and we can do better - let error 

>> handling

>> recover and clear the runtime_error. So, let PM requests take the fast

>> abort path in ufshcd_abort().

> 

> The only RQF_PM requests I know of are START STOP UNIT and SYNCHRONIZE

> CACHE. Are there devices for which these commands can time out or do

> these commands perhaps only time out as the result of error injection?


There are also REQUEST SENSE requests sent with RQF_PM flag set from
pm ops. And they do time out (device does not respond in 60s) in real
cases, at least I have seen quite a lot of related issues reported
from customers these years.

> 

>> -	if (lrbp->lun == UFS_UPIU_UFS_DEVICE_WLUN) {

>> +	if (lrbp->lun == UFS_UPIU_UFS_DEVICE_WLUN ||

>> +	    (cmd->request->rq_flags & RQF_PM)) {

> 

> Which are the RQF_PM commands that are not sent to a WLUN? Are these

> START STOP UNIT and SYNCHRONIZE CACHE only?

> 


There are also REQUEST SENSE cmds sent to the RPMB W-LU, in 
ufshcd_add_wlus(),
ufshcd_err_handler() and ufshcd_rpmb_resume() and/or ufshcd_wl_resume().

And SYNCHRONIZE CACHE cmd is only sent to general LUs, but not W-LUs.

Thanks,
Can Guo.

> Thanks,

> 

> Bart.