Message ID | 20240910073035.25974-3-peter.wang@mediatek.com |
---|---|
State | New |
Headers | show |
Series | fix abort defect | expand |
On Wed, 2024-09-11 at 12:11 -0700, Bart Van Assche wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > On 9/10/24 11:03 PM, Peter Wang (王信友) wrote: > > This statement is not quite accurate becasue in UFSHIC2.1, SDB mode > > specification already have OCS: ABORTED (0x6) define. > > And it is used in below UTRLCLR description: > > 'which means a Transfer Request was "aborted"' > > Therefore, the host controller should follow the > > specification and fill the OCS field with OCS: ABORTED. > > If not so, at what point does your host controller use the > > OCS: ABORTED status? > > Hmm ... I have not been able to find any explanation in the UFSHCI > 2.1 > specification that says when the OCS status is set to aborted. Did I > perhaps overlook something? > > This is what I found in the UTRLCLR description: "The host software > shall use this field only when a UTP Transfer Request is expected to > not be completed, e.g., when the host software receives a “FUNCTION > COMPLETE” Task Management response which means a Transfer Request was > aborted." This does not mean that the host controller is expected to > set the OCS status to "ABORTED". I will send an email to the JC-64 > mailing list to request clarification. > Hi Bart, Yes, you're right, just as I mentioned earlier, I also think the spec does not explicitly state that UTRLC should have corresponding behavior for OCS. This might lead to inconsistencies in how host controllers operate. > >>> +/* > >>> + * When the host software receives a "FUNCTION COMPLETE", set > flag > >>> + * to requeue command after receive response with OCS_ABORTED > >>> + * SDB mode: UTRLCLR Task Management response which means a > >> Transfer > >>> + * Request was aborted. > >>> + * MCQ mode: Host will post to CQ with OCS_ABORTED after SQ > >> cleanup > >>> + * This flag is set because ufshcd_abort_all forcibly aborts all > >>> + * commands, and the host will automatically fill in the OCS > field > >>> + * of the corresponding response with OCS_ABORTED. > >>> + * Therefore, upon receiving this response, it needs to be > >> requeued. > >>> + */ > >>> +if (!err) > >>> +lrbp->abort_initiated_by_err = true; > >>> + > >>> err = ufshcd_clear_cmd(hba, tag); > >>> if (err) > >>> dev_err(hba->dev, "%s: Failed clearing cmd at tag %d, err > %d\n", > >> > >> The above change is misplaced. ufshcd_try_to_abort_task() can be > >> called > >> when the SCSI core decides to abort a command while > >> abort_initiated_by_err must not be set in that case. Please move > the > >> above code block into ufshcd_abort_one(). > > > > But move to ufshcd_abort_one may have race condition, beacause we > > need set this flag before ufshcd_clear_cmd host controller fill > > OCS_ABORTED to response. I will add check ufshcd_eh_in_progress. > > Calling ufshcd_clear_cmd() does not affect the OCS status as far as I > know. Did I perhaps overlook something? > Because after ufshcd_clear_cmd, in MCQ mode: ufshcd_mcq_sq_cleanup, host controller will post CQ response with OCS ABORTED. in SDB mode: ufshcd_utrl_clear set UTRLC, Mediatek host controller (may not all host controller) will post response with OCS ABORTED. In both cases, we have an interrupt sent to the host, and there may be a race condition before we set this flag for requeue. So I need to set this flag before ufshcd_clear_cmd. Thanks. Peter > Thanks, > > Bart.
On 9/12/24 6:31 AM, Peter Wang (王信友) wrote: > in SDB mode: > ufshcd_utrl_clear set UTRLC, Mediatek host controller > (may not all host controller) will post response with OCS ABORTED. > > In both cases, we have an interrupt sent to the host, and there > may be a race condition before we set this flag for requeue. > So I need to set this flag before ufshcd_clear_cmd. If a completion interrupt is sent to the host if a command has been cleared in SDB mode (I doubt this is what happens), I think that's a severe controller bug. A UFSHCI controller is not allowed to send a completion interrupt to the host if a command is cleared by writing into the UTRLCLR register. Thanks, Bart.
On Thu, 2024-09-12 at 14:17 -0700, Bart Van Assche wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > On 9/12/24 6:31 AM, Peter Wang (王信友) wrote: > > in SDB mode: > > ufshcd_utrl_clear set UTRLC, Mediatek host controller > > (may not all host controller) will post response with OCS ABORTED. > > > > In both cases, we have an interrupt sent to the host, and there > > may be a race condition before we set this flag for requeue. > > So I need to set this flag before ufshcd_clear_cmd. > > If a completion interrupt is sent to the host if a command has been > cleared in SDB mode (I doubt this is what happens), I think that's a > severe controller bug. A UFSHCI controller is not allowed to send a > completion interrupt to the host if a command is cleared by writing > into > the UTRLCLR register. > > Thanks, > > Bart. Hi Bart, Because the MediaTek UFS controller uses UTRLCLR to clear commands and fills OCS with ABORTED. Regarding the specification of UTRCS: This bit is set to '1' by the host controller upon one of the following: Overall command Status (OCS) of the completed command is not equal to 'SUCCESS' even if its UTRD Interrupt bit set to '0' So, MediaTek host controller will send interrupt in this case. Thanks. Peter
On 9/13/24 12:10 AM, Peter Wang (王信友) wrote: > Because the MediaTek UFS controller uses UTRLCLR to clear > commands and fills OCS with ABORTED. > > Regarding the specification of UTRCS: > This bit is set to '1' by the host controller upon one of the > following: > Overall command Status (OCS) of the completed command is not > equal to 'SUCCESS' even if its UTRD Interrupt bit set to '0' > > So, MediaTek host controller will send interrupt in this case. Hi Peter, Thank you for having shared this information. Please consider introducing a quirk for ignoring completions triggered by clearing a command, e.g. as follows (there may be better approaches): * In ufshcd_clear_cmd(), before a command is cleared, initialize the completion that will be used for waiting for the completion interrupt. After a command has been cleared, call wait_for_completion_timeout(). * In ufshcd_compl_one_cqe(), check whether the completion is the result of a command being cleared. If so, call complete() instead of executing the regular completion code. Thanks, Bart.
On 9/10/24 11:03 PM, Peter Wang (王信友) wrote: > This statement is not quite accurate becasue in UFSHIC2.1, SDB mode > specification already have OCS: ABORTED (0x6) define. I think I found why that status code is defined in the UFSHCI 2.1 standard. From the UFS 2.0 standard: "TASK ABORTED - This status shall be returned when a command is aborted by a command or task management function on another I_T nexus and the Control mode page TAS bit is set to one. Since in UFS there is only one I_T nexus and TAS bit is zero TASK ABORTED status codes will never occur." Bart.
On Fri, 2024-09-13 at 10:41 -0700, Bart Van Assche wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > On 9/13/24 12:10 AM, Peter Wang (王信友) wrote: > > Because the MediaTek UFS controller uses UTRLCLR to clear > > commands and fills OCS with ABORTED. > > > > Regarding the specification of UTRCS: > > This bit is set to '1' by the host controller upon one of the > > following: > > Overall command Status (OCS) of the completed command is not > > equal to 'SUCCESS' even if its UTRD Interrupt bit set to '0' > > > > So, MediaTek host controller will send interrupt in this case. > > Hi Peter, > > Thank you for having shared this information. Please consider > introducing a quirk for ignoring completions triggered by clearing > a command, e.g. as follows (there may be better approaches): > * In ufshcd_clear_cmd(), before a command is cleared, initialize > the completion that will be used for waiting for the completion > interrupt. After a command has been cleared, call > wait_for_completion_timeout(). > * In ufshcd_compl_one_cqe(), check whether the completion is the > result of a command being cleared. If so, call complete() instead > of executing the regular completion code. > > Thanks, > > Bart. Hi Bart, Sorry for the lack of response over the past few days due to the Mid-Autumn Festival in Taiwan. I think this would make the abort flow more complicated and difficult to understand. Basically, this patch currently only needs to handle requeueing for the error handler abort. The approach for DBR mode and MCQ mode should be consistent. If receive an interrupt response (OCS:ABORTED or INVALID_OCS_VALUE), then set DID_REQUEUE. If there is no interrupt, it will also set SCSI DID_REQUEUE in ufshcd_err_handler through ufshcd_complete_requests with force_compl = true. As for the SCSI abort (30s) situation, in DBR mode, if there is an interrupt with OCS:ABORTED or INVALID_OCS_VALUE, then set DID_REQUEUE; if not, ufshcd_abort will also return SUCCESS, allowing SCSI to decide whether to retry the command. This part is less problematic, and we can keep the original approach. The more problematic part is with MCQ mode. To imitate the DBR approach, we just need to set DID_REQUEUE upon receiving an interrupt. Everything else remains the same. This would make things simpler. Moving forward, if we want to simplify things and we have also taken stock of the two or three scenarios where OCS: ABORTED occurs, do we even need a flag? Couldn't we just set DID_REQUEUE directly for OCS: ABORTED? What do you think? Thanks. Peter
On Sat, 2024-09-14 at 09:13 -0700, Bart Van Assche wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > On 9/10/24 11:03 PM, Peter Wang (王信友) wrote: > > This statement is not quite accurate becasue in UFSHIC2.1, SDB mode > > specification already have OCS: ABORTED (0x6) define. > > I think I found why that status code is defined in the UFSHCI 2.1 > standard. From the UFS 2.0 standard: "TASK ABORTED - This status > shall > be returned when a command is aborted by a command or task management > function on another I_T nexus and the Control mode page TAS bit is > set > to one. Since in UFS there is only one I_T nexus and TAS bit is zero > TASK ABORTED status codes will never occur." > > Bart. > Hi Bart, Okay, I think this is a very good reason that explains why UFSHCI2.1 has defined OCS: ABORTED. Thank you for sharing this information. Thanks. Peter
On 9/18/24 6:29 AM, Peter Wang (王信友) wrote: > Basically, this patch currently only needs to handle requeueing > for the error handler abort. > The approach for DBR mode and MCQ mode should be consistent. > If receive an interrupt response (OCS:ABORTED or INVALID_OCS_VALUE), > then set DID_REQUEUE. If there is no interrupt, it will also set > SCSI DID_REQUEUE in ufshcd_err_handler through > ufshcd_complete_requests > with force_compl = true. Reporting a completion for commands cleared by writing into the legacy UTRLCLR register is not compliant with any version of the UFSHCI standard. Reporting a completion for commands cleared by writing into that register is problematic because it causes ufshcd_release_scsi_cmd() to be called as follows: ufshcd_sl_intr() ufshcd_transfer_req_compl() ufshcd_poll() __ufshcd_transfer_req_compl() ufshcd_compl_one_cqe() cmd->result = ... ufshcd_release_scsi_cmd() scsi_done() Calling ufshcd_release_scsi_cmd() if a command has been cleared is problematic because the SCSI core does not expect this. If ufshcd_try_to_abort_task() clears a SCSI command, ufshcd_release_scsi_cmd() must not be called until the SCSI core decides to release the command. This is why I wrote in a previous mail that I think that a quirk should be introduced to suppress the completions generated by clearing a SCSI command. > The more problematic part is with MCQ mode. To imitate the DBR > approach, we just need to set DID_REQUEUE upon receiving an interrupt. > Everything else remains the same. This would make things simpler. > > Moving forward, if we want to simplify things and we have also > taken stock of the two or three scenarios where OCS: ABORTED occurs, > do we even need a flag? Couldn't we just set DID_REQUEUE directly > for OCS: ABORTED? > What do you think? How about making ufshcd_compl_one_cqe() skip entries with status OCS_ABORTED? That would make ufshcd_compl_one_cqe() behave as the SCSI core expects, namely not freeing any command resources if a SCSI command is aborted successfully. This approach may require further changes to ufshcd_abort_all(). In that function there are separate code paths for legacy and MCQ mode. This is less than ideal. Would it be possible to combine these code paths by removing the ufshcd_complete_requests() call from ufshcd_abort_all() and by handling completions from inside ufshcd_abort_one()? Thanks, Bart.
On Wed, 2024-09-18 at 11:29 -0700, Bart Van Assche wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > On 9/18/24 6:29 AM, Peter Wang (王信友) wrote: > > Basically, this patch currently only needs to handle requeueing > > for the error handler abort. > > The approach for DBR mode and MCQ mode should be consistent. > > If receive an interrupt response (OCS:ABORTED or > INVALID_OCS_VALUE), > > then set DID_REQUEUE. If there is no interrupt, it will also set > > SCSI DID_REQUEUE in ufshcd_err_handler through > > ufshcd_complete_requests > > with force_compl = true. > > Reporting a completion for commands cleared by writing into the > legacy > UTRLCLR register is not compliant with any version of the UFSHCI > standard. Reporting a completion for commands cleared by writing into > that register is problematic because it causes > ufshcd_release_scsi_cmd() > to be called as follows: > > ufshcd_sl_intr() > ufshcd_transfer_req_compl() > ufshcd_poll() > __ufshcd_transfer_req_compl() > ufshcd_compl_one_cqe() > cmd->result = ... > ufshcd_release_scsi_cmd() > scsi_done() > > Calling ufshcd_release_scsi_cmd() if a command has been cleared is > problematic because the SCSI core does not expect this. If > ufshcd_try_to_abort_task() clears a SCSI command, > ufshcd_release_scsi_cmd() must not be called until the SCSI core > decides to release the command. This is why I wrote in a previous > mail > that I think that a quirk should be introduced to suppress the > completions generated by clearing a SCSI command. > Hi Bart, I'm not sure if I'm misunderstanding your point, but I feel that ufshcd_release_scsi_cmd should always be called. It's scsi_done that shouldn't be called, as it should be left to the SCSI layer to decide how to handle this command. Because ufshcd_release_scsi_cmd is just about releasing resources related to ufshcd_map_sg and the clock at the UFS driver level. scsi_done is what notifies the SCSI layer that the cmd has finished, asking it to look at the result to decide how to proceed. > > The more problematic part is with MCQ mode. To imitate the DBR > > approach, we just need to set DID_REQUEUE upon receiving an > interrupt. > > Everything else remains the same. This would make things simpler. > > > > Moving forward, if we want to simplify things and we have also > > taken stock of the two or three scenarios where OCS: ABORTED > occurs, > > do we even need a flag? Couldn't we just set DID_REQUEUE directly > > for OCS: ABORTED? > > What do you think? > > How about making ufshcd_compl_one_cqe() skip entries with status > OCS_ABORTED? That would make ufshcd_compl_one_cqe() behave as the > SCSI core expects, namely not freeing any command resources if a > SCSI command is aborted successfully. > > This approach may require further changes to ufshcd_abort_all(). > In that function there are separate code paths for legacy and MCQ > mode. This is less than ideal. Would it be possible to combine > these code paths by removing the ufshcd_complete_requests() call > from ufshcd_abort_all() and by handling completions from inside > ufshcd_abort_one()? > > Thanks, > > Bart. The four case flows for abort are as follows: ---------------------------------------------------------------- Case1: DBR ufshcd_abort In this case, you can see that ufshcd_release_scsi_cmd will definitely be called. ufshcd_abort() ufshcd_try_to_abort_task() // It should trigger an interrupt, but the tensor might not get outstanding_lock clear outstanding_reqs tag ufshcd_release_scsi_cmd() release outstanding_lock ufshcd_intr() ufshcd_sl_intr() ufshcd_transfer_req_compl() ufshcd_poll() get outstanding_lock clear outstanding_reqs tag release outstanding_lock __ufshcd_transfer_req_compl() ufshcd_compl_one_cqe() cmd->result = DID_REQUEUE // mediatek may need quirk change DID_ABORT to DID_REQUEUE ufshcd_release_scsi_cmd() scsi_done(); In most cases, ufshcd_intr will not reach scsi_done because the outstanding_reqs tag is cleared by the original thread. Therefore, whether there is an interrupt or not doesn't affect the result because the ISR will do nothing in most cases. In a very low chance, the ISR will reach scsi_done and notify SCSI to requeue, and the original thread will not call ufshcd_release_scsi_cmd. MediaTek may need to change DID_ABORT to DID_REQUEUE in this situation, or perhaps not handle this ISR at all. ---------------------------------------------------------------- Case2: MCQ ufshcd_abort In the case of MCQ ufshcd_abort, you can also see that ufshcd_release_scsi_cmd will definitely be called too. However, there seems to be a problem here, as ufshcd_release_scsi_cmd might be called twice. This is because cmd is not null in ufshcd_release_scsi_cmd, which the previous version would set cmd to null. Skipping OCS: ABORTED in ufshcd_compl_one_cqe indeed can avoid this problem. This part needs further consideration on how to handle it. ufshcd_abort() ufshcd_mcq_abort() ufshcd_try_to_abort_task() // will trigger ISR ufshcd_release_scsi_cmd() ufs_mtk_mcq_intr() ufshcd_mcq_poll_cqe_lock() ufshcd_mcq_process_cqe() ufshcd_compl_one_cqe() cmd->result = DID_ABORT ufshcd_release_scsi_cmd() // will release twice scsi_done() ---------------------------------------------------------------- Case3: DBR ufshcd_err_handler In the case of the DBR mode error handler, it's the same; ufshcd_release_scsi_cmd will also be executed, and scsi_done will definitely be used to notify SCSI to requeue. ufshcd_err_handler() ufshcd_abort_all() ufshcd_abort_one() ufshcd_try_to_abort_task() // It should trigger an interrupt, but the tensor might not ufshcd_complete_requests() ufshcd_transfer_req_compl() ufshcd_poll() get outstanding_lock clear outstanding_reqs tag release outstanding_lock __ufshcd_transfer_req_compl() ufshcd_compl_one_cqe() cmd->result = DID_REQUEUE // mediatek may need quirk change DID_ABORT to DID_REQUEUE ufshcd_release_scsi_cmd() scsi_done() ufshcd_intr() ufshcd_sl_intr() ufshcd_transfer_req_compl() ufshcd_poll() get outstanding_lock clear outstanding_reqs tag release outstanding_lock __ufshcd_transfer_req_compl() ufshcd_compl_one_cqe() cmd->result = DID_REQUEUE // mediatek may need quirk change DID_ABORT to DID_REQUEUE ufshcd_release_scsi_cmd() scsi_done(); At this time, the same actions are taken regardless of whether there is an ISR, and with the protection of outstanding_lock, only one thread will execute ufshcd_release_scsi_cmd and scsi_done. ---------------------------------------------------------------- Case4: MCQ ufshcd_err_handler It's the same with MCQ mode; there is protection from the cqe lock, so only one thread will execute. What my patch 2 aims to do is to change DID_ABORT to DID_REQUEUE in this situation. ufshcd_err_handler() ufshcd_abort_all() ufshcd_abort_one() ufshcd_try_to_abort_task() // will trigger irq thread ufshcd_complete_requests() ufshcd_mcq_compl_pending_transfer() ufshcd_mcq_poll_cqe_lock() ufshcd_mcq_process_cqe() ufshcd_compl_one_cqe() cmd->result = DID_ABORT // should change to DID_REQUEUE ufshcd_release_scsi_cmd() scsi_done() ufs_mtk_mcq_intr() ufshcd_mcq_poll_cqe_lock() ufshcd_mcq_process_cqe() ufshcd_compl_one_cqe() cmd->result = DID_ABORT // should change to DID_REQUEUE ufshcd_release_scsi_cmd() scsi_done() Thanks Peter
On 9/19/24 5:16 AM, Peter Wang (王信友) wrote: > The four case flows for abort are as follows: > ---------------------------------------------------------------- > > Case1: DBR ufshcd_abort Please follow the terminology from the UFSHCI 4.0 standard and use the word "legacy" instead of "DBR". > In this case, you can see that ufshcd_release_scsi_cmd will > definitely be called. > > ufshcd_abort() > ufshcd_try_to_abort_task() // It should trigger an > interrupt, but the tensor might not > get outstanding_lock > clear outstanding_reqs tag > ufshcd_release_scsi_cmd() > release outstanding_lock > > ufshcd_intr() > ufshcd_sl_intr() > ufshcd_transfer_req_compl() > ufshcd_poll() > get outstanding_lock > clear outstanding_reqs tag > release outstanding_lock > __ufshcd_transfer_req_compl() > ufshcd_compl_one_cqe() > cmd->result = DID_REQUEUE // mediatek may need quirk > change DID_ABORT to DID_REQUEUE > ufshcd_release_scsi_cmd() > scsi_done(); > > In most cases, ufshcd_intr will not reach scsi_done because the > outstanding_reqs tag is cleared by the original thread. > Therefore, whether there is an interrupt or not doesn't affect > the result because the ISR will do nothing in most cases. > > In a very low chance, the ISR will reach scsi_done and notify > SCSI to requeue, and the original thread will not > call ufshcd_release_scsi_cmd. > MediaTek may need to change DID_ABORT to DID_REQUEUE in this > situation, or perhaps not handle this ISR at all. Please modify ufshcd_compl_one_cqe() such that it ignores commands with status OCS_ABORTED. This will make the UFSHCI driver behave in the same way for all UFSHCI controllers, whether or not clearing a command triggers a completion interrupt. > ---------------------------------------------------------------- > > Case2: MCQ ufshcd_abort > > In the case of MCQ ufshcd_abort, you can also see that > ufshcd_release_scsi_cmd will definitely be called too. > However, there seems to be a problem here, as > ufshcd_release_scsi_cmd might be called twice. > This is because cmd is not null in ufshcd_release_scsi_cmd, > which the previous version would set cmd to null. > Skipping OCS: ABORTED in ufshcd_compl_one_cqe indeed > can avoid this problem. This part needs further > consideration on how to handle it. > > ufshcd_abort() > ufshcd_mcq_abort() > ufshcd_try_to_abort_task() // will trigger ISR > ufshcd_release_scsi_cmd() > > ufs_mtk_mcq_intr() > ufshcd_mcq_poll_cqe_lock() > ufshcd_mcq_process_cqe() > ufshcd_compl_one_cqe() > cmd->result = DID_ABORT > ufshcd_release_scsi_cmd() // will release twice > scsi_done() Do you agree that this case can be addressed with the ufshcd_compl_one_cqe() change proposed above? > ---------------------------------------------------------------- > > Case3: DBR ufshcd_err_handler > > In the case of the DBR mode error handler, it's the same; > ufshcd_release_scsi_cmd will also be executed, and scsi_done > will definitely be used to notify SCSI to requeue. > > ufshcd_err_handler() > ufshcd_abort_all() > ufshcd_abort_one() > ufshcd_try_to_abort_task() // It should trigger an > interrupt, but the tensor might not > ufshcd_complete_requests() > ufshcd_transfer_req_compl() > ufshcd_poll() > get outstanding_lock > clear outstanding_reqs tag > release outstanding_lock > __ufshcd_transfer_req_compl() > ufshcd_compl_one_cqe() > cmd->result = DID_REQUEUE // mediatek may need quirk > change DID_ABORT to DID_REQUEUE > ufshcd_release_scsi_cmd() > scsi_done() > > ufshcd_intr() > ufshcd_sl_intr() > ufshcd_transfer_req_compl() > ufshcd_poll() > get outstanding_lock > clear outstanding_reqs tag > release outstanding_lock > __ufshcd_transfer_req_compl() > ufshcd_compl_one_cqe() > cmd->result = DID_REQUEUE // mediatek may need quirk change > DID_ABORT to DID_REQUEUE > ufshcd_release_scsi_cmd() > scsi_done(); > > At this time, the same actions are taken regardless of whether > there is an ISR, and with the protection of outstanding_lock, > only one thread will execute ufshcd_release_scsi_cmd and scsi_done. > ---------------------------------------------------------------- > > Case4: MCQ ufshcd_err_handler > > It's the same with MCQ mode; there is protection from the cqe lock, > so only one thread will execute. What my patch 2 aims to do is to > change DID_ABORT to DID_REQUEUE in this situation. > > ufshcd_err_handler() > ufshcd_abort_all() > ufshcd_abort_one() > ufshcd_try_to_abort_task() // will trigger irq thread > ufshcd_complete_requests() > ufshcd_mcq_compl_pending_transfer() > ufshcd_mcq_poll_cqe_lock() > ufshcd_mcq_process_cqe() > ufshcd_compl_one_cqe() > cmd->result = DID_ABORT // should change to DID_REQUEUE > ufshcd_release_scsi_cmd() > scsi_done() > > ufs_mtk_mcq_intr() > ufshcd_mcq_poll_cqe_lock() > ufshcd_mcq_process_cqe() > ufshcd_compl_one_cqe() > cmd->result = DID_ABORT // should change to DID_REQUEUE > ufshcd_release_scsi_cmd() > scsi_done() For legacy and MCQ mode, I prefer the following behavior for ufshcd_abort_all(): * ufshcd_compl_one_cqe() ignores commands with status OCS_ABORTED. * ufshcd_release_scsi_cmd() is called either by ufshcd_abort_one() or by ufshcd_abort_all(). Do you agree with making the changes proposed above? Thank you, Bart.
On Thu, 2024-09-19 at 11:49 -0700, Bart Van Assche wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > On 9/19/24 5:16 AM, Peter Wang (王信友) wrote: > > The four case flows for abort are as follows: > > ---------------------------------------------------------------- > > > > Case1: DBR ufshcd_abort > > Please follow the terminology from the UFSHCI 4.0 standard and use > the > word "legacy" instead of "DBR". > Hi Bart, Okay, but the current code comments all use 'SDB mode'. Should we just stick with that term? > > In this case, you can see that ufshcd_release_scsi_cmd will > > definitely be called. > > > > ufshcd_abort() > > ufshcd_try_to_abort_task()// It should trigger an > > interrupt, but the tensor might not > > get outstanding_lock > > clear outstanding_reqs tag > > ufshcd_release_scsi_cmd() > > release outstanding_lock > > > > ufshcd_intr() > > ufshcd_sl_intr() > > ufshcd_transfer_req_compl() > > ufshcd_poll() > > get outstanding_lock > > clear outstanding_reqs tag > > release outstanding_lock > > __ufshcd_transfer_req_compl() > > ufshcd_compl_one_cqe() > > cmd->result = DID_REQUEUE// mediatek may need quirk > > change DID_ABORT to DID_REQUEUE > > ufshcd_release_scsi_cmd() > > scsi_done(); > > > > In most cases, ufshcd_intr will not reach scsi_done because the > > outstanding_reqs tag is cleared by the original thread. > > Therefore, whether there is an interrupt or not doesn't affect > > the result because the ISR will do nothing in most cases. > > > > In a very low chance, the ISR will reach scsi_done and notify > > SCSI to requeue, and the original thread will not > > call ufshcd_release_scsi_cmd. > > MediaTek may need to change DID_ABORT to DID_REQUEUE in this > > situation, or perhaps not handle this ISR at all. > > Please modify ufshcd_compl_one_cqe() such that it ignores commands > with status OCS_ABORTED. This will make the UFSHCI driver behave in > the same way for all UFSHCI controllers, whether or not clearing a > command triggers a completion interrupt. > Yes, I am considering how to modify the code here. > > ---------------------------------------------------------------- > > > > Case2: MCQ ufshcd_abort > > > > In the case of MCQ ufshcd_abort, you can also see that > > ufshcd_release_scsi_cmd will definitely be called too. > > However, there seems to be a problem here, as > > ufshcd_release_scsi_cmd might be called twice. > > This is because cmd is not null in ufshcd_release_scsi_cmd, > > which the previous version would set cmd to null. > > Skipping OCS: ABORTED in ufshcd_compl_one_cqe indeed > > can avoid this problem. This part needs further > > consideration on how to handle it. > > > > ufshcd_abort() > > ufshcd_mcq_abort() > > ufshcd_try_to_abort_task()// will trigger ISR > > ufshcd_release_scsi_cmd() > > > > ufs_mtk_mcq_intr() > > ufshcd_mcq_poll_cqe_lock() > > ufshcd_mcq_process_cqe() > > ufshcd_compl_one_cqe() > > cmd->result = DID_ABORT > > ufshcd_release_scsi_cmd() // will release twice > > scsi_done() > > Do you agree that this case can be addressed with the > ufshcd_compl_one_cqe() change proposed above? > Agree. > > ---------------------------------------------------------------- > > > > Case3: DBR ufshcd_err_handler > > > > In the case of the DBR mode error handler, it's the same; > > ufshcd_release_scsi_cmd will also be executed, and scsi_done > > will definitely be used to notify SCSI to requeue. > > > > ufshcd_err_handler() > > ufshcd_abort_all() > > ufshcd_abort_one() > > ufshcd_try_to_abort_task()// It should trigger an > > interrupt, but the tensor might not > > ufshcd_complete_requests() > > ufshcd_transfer_req_compl() > > ufshcd_poll() > > get outstanding_lock > > clear outstanding_reqs tag > > release outstanding_lock > > __ufshcd_transfer_req_compl() > > ufshcd_compl_one_cqe() > > cmd->result = DID_REQUEUE // mediatek may need quirk > > change DID_ABORT to DID_REQUEUE > > ufshcd_release_scsi_cmd() > > scsi_done() > > > > ufshcd_intr() > > ufshcd_sl_intr() > > ufshcd_transfer_req_compl() > > ufshcd_poll() > > get outstanding_lock > > clear outstanding_reqs tag > > release outstanding_lock > > __ufshcd_transfer_req_compl() > > ufshcd_compl_one_cqe() > > cmd->result = DID_REQUEUE // mediatek may need quirk > change > > DID_ABORT to DID_REQUEUE > > ufshcd_release_scsi_cmd() > > scsi_done(); > > > > At this time, the same actions are taken regardless of whether > > there is an ISR, and with the protection of outstanding_lock, > > only one thread will execute ufshcd_release_scsi_cmd and scsi_done. > > ---------------------------------------------------------------- > > > > Case4: MCQ ufshcd_err_handler > > > > It's the same with MCQ mode; there is protection from the cqe lock, > > so only one thread will execute. What my patch 2 aims to do is to > > change DID_ABORT to DID_REQUEUE in this situation. > > > > ufshcd_err_handler() > > ufshcd_abort_all() > > ufshcd_abort_one() > > ufshcd_try_to_abort_task()// will trigger irq thread > > ufshcd_complete_requests() > > ufshcd_mcq_compl_pending_transfer() > > ufshcd_mcq_poll_cqe_lock() > > ufshcd_mcq_process_cqe() > > ufshcd_compl_one_cqe() > > cmd->result = DID_ABORT // should change to > DID_REQUEUE > > ufshcd_release_scsi_cmd() > > scsi_done() > > > > ufs_mtk_mcq_intr() > > ufshcd_mcq_poll_cqe_lock() > > ufshcd_mcq_process_cqe() > > ufshcd_compl_one_cqe() > > cmd->result = DID_ABORT // should change to DID_REQUEUE > > ufshcd_release_scsi_cmd() > > scsi_done() > > For legacy and MCQ mode, I prefer the following behavior for > ufshcd_abort_all(): > * ufshcd_compl_one_cqe() ignores commands with status OCS_ABORTED. > * ufshcd_release_scsi_cmd() is called either by ufshcd_abort_one() or > by ufshcd_abort_all(). > > Do you agree with making the changes proposed above? > > Thank you, This might not work, as SDB mode doesn't ignore OCS: INVALID_OCS_VALUE but rather notifies SCSI to requeue. So what we need to correct is to notify SCSI to requeue when MCQ mode receives OCS: ABORTED as well. Furthermore, ufshcd_compl_one_cqe, whether it comes from ufshcd_abort_all or ISR, does the same thing and is protected by a lock. Therefore, there is no need for special handling specifically for ufshcd_abort_all. After discussing with you, I realized that there are indeed many deficiencies and inconsistencies here that need to be addressed. I will submit a new patch for the content discussed above. Thanks. Peter > > Bar
On 9/19/24 7:02 PM, Peter Wang (王信友) wrote: > On Thu, 2024-09-19 at 11:49 -0700, Bart Van Assche wrote: >> For legacy and MCQ mode, I prefer the following behavior for >> ufshcd_abort_all(): >> * ufshcd_compl_one_cqe() ignores commands with status OCS_ABORTED. >> * ufshcd_release_scsi_cmd() is called either by ufshcd_abort_one() or >> by ufshcd_abort_all(). >> >> Do you agree with making the changes proposed above? > > This might not work, as SDB mode doesn't ignore > OCS: INVALID_OCS_VALUE but rather notifies SCSI to requeue. cmd->result should be ignored for aborted commands. Hence, how OCS_INVALID_COMMAND_STATUS is translated by ufshcd_transfer_rsp_status() is not relevant for aborted commands. > So what we need to correct is to notify SCSI to requeue > when MCQ mode receives OCS: ABORTED as well. Unless the host controller violates the UFSHCI specification, the command status is not set for aborted commands in legacy mode. Let's keep the code uniform for legacy mode, MCQ mode, compliant and non- ompliant controllers and not rely on the command status for aborted commands. Thanks, Bart.
On Fri, 2024-09-20 at 11:39 -0700, Bart Van Assche wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > On 9/19/24 7:02 PM, Peter Wang (王信友) wrote: > > On Thu, 2024-09-19 at 11:49 -0700, Bart Van Assche wrote: > >> For legacy and MCQ mode, I prefer the following behavior for > >> ufshcd_abort_all(): > >> * ufshcd_compl_one_cqe() ignores commands with status OCS_ABORTED. > >> * ufshcd_release_scsi_cmd() is called either by ufshcd_abort_one() > or > >> by ufshcd_abort_all(). > >> > >> Do you agree with making the changes proposed above? > > > > This might not work, as SDB mode doesn't ignore > > OCS: INVALID_OCS_VALUE but rather notifies SCSI to requeue. > > cmd->result should be ignored for aborted commands. Hence, > how OCS_INVALID_COMMAND_STATUS is translated by > ufshcd_transfer_rsp_status() is not relevant for aborted commands. > Hi Bart, Okay, I will not handle it and let it remain as it is. > > So what we need to correct is to notify SCSI to requeue > > when MCQ mode receives OCS: ABORTED as well. > > Unless the host controller violates the UFSHCI specification, the > command status is not set for aborted commands in legacy mode. Let's > keep the code uniform for legacy mode, MCQ mode, compliant and non- > ompliant controllers and not rely on the command status for aborted > commands. > > Thanks, > > Bart. > Okay, but under SDB mode, MediaTek might still need a quirk to keep the behavior of OCS_ABORTED consistent with OCS_INVALID_COMMAND_STATUS. Thanks. Peter
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c index a6f818cdef0e..615da47c1727 100644 --- a/drivers/ufs/core/ufshcd.c +++ b/drivers/ufs/core/ufshcd.c @@ -3006,6 +3006,7 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd) ufshcd_prepare_lrbp_crypto(scsi_cmd_to_rq(cmd), lrbp); lrbp->req_abort_skip = false; + lrbp->abort_initiated_by_err = false; ufshcd_comp_scsi_upiu(hba, lrbp); @@ -5404,7 +5405,10 @@ ufshcd_transfer_rsp_status(struct ufs_hba *hba, struct ufshcd_lrb *lrbp, } break; case OCS_ABORTED: - result |= DID_ABORT << 16; + if (lrbp->abort_initiated_by_err) + result |= DID_REQUEUE << 16; + else + result |= DID_ABORT << 16; break; case OCS_INVALID_COMMAND_STATUS: result |= DID_REQUEUE << 16; @@ -6471,26 +6475,12 @@ static bool ufshcd_abort_one(struct request *rq, void *priv) struct scsi_device *sdev = cmd->device; struct Scsi_Host *shost = sdev->host; struct ufs_hba *hba = shost_priv(shost); - struct ufshcd_lrb *lrbp = &hba->lrb[tag]; - struct ufs_hw_queue *hwq; - unsigned long flags; *ret = ufshcd_try_to_abort_task(hba, tag); dev_err(hba->dev, "Aborting tag %d / CDB %#02x %s\n", tag, hba->lrb[tag].cmd ? hba->lrb[tag].cmd->cmnd[0] : -1, *ret ? "failed" : "succeeded"); - /* Release cmd in MCQ mode if abort succeeds */ - if (hba->mcq_enabled && (*ret == 0)) { - hwq = ufshcd_mcq_req_to_hwq(hba, scsi_cmd_to_rq(lrbp->cmd)); - if (!hwq) - return 0; - spin_lock_irqsave(&hwq->cq_lock, flags); - if (ufshcd_cmd_inflight(lrbp->cmd)) - ufshcd_release_scsi_cmd(hba, lrbp); - spin_unlock_irqrestore(&hwq->cq_lock, flags); - } - return *ret == 0; } @@ -7561,6 +7551,20 @@ int ufshcd_try_to_abort_task(struct ufs_hba *hba, int tag) goto out; } + /* + * When the host software receives a "FUNCTION COMPLETE", set flag + * to requeue command after receive response with OCS_ABORTED + * SDB mode: UTRLCLR Task Management response which means a Transfer + * Request was aborted. + * MCQ mode: Host will post to CQ with OCS_ABORTED after SQ cleanup + * This flag is set because ufshcd_abort_all forcibly aborts all + * commands, and the host will automatically fill in the OCS field + * of the corresponding response with OCS_ABORTED. + * Therefore, upon receiving this response, it needs to be requeued. + */ + if (!err) + lrbp->abort_initiated_by_err = true; + err = ufshcd_clear_cmd(hba, tag); if (err) dev_err(hba->dev, "%s: Failed clearing cmd at tag %d, err %d\n", diff --git a/include/ufs/ufshcd.h b/include/ufs/ufshcd.h index 0fd2aebac728..15b357672ca5 100644 --- a/include/ufs/ufshcd.h +++ b/include/ufs/ufshcd.h @@ -173,6 +173,8 @@ struct ufs_pm_lvl_states { * @crypto_key_slot: the key slot to use for inline crypto (-1 if none) * @data_unit_num: the data unit number for the first block for inline crypto * @req_abort_skip: skip request abort task flag + * @abort_initiated_by_err: The flag is specifically used to handle aborts + * caused by errors due to host/device communication */ struct ufshcd_lrb { struct utp_transfer_req_desc *utr_descriptor_ptr; @@ -202,6 +204,7 @@ struct ufshcd_lrb { #endif bool req_abort_skip; + bool abort_initiated_by_err; }; /**