[SCSI] mpt3sas: Fix secure erase premature termination.

Message ID 2179ecb8-183f-a500-4d65-10f64f0f43cc@suse.de
State New
Headers show

Commit Message

Hannes Reinecke Oct. 30, 2016, 6:43 p.m.
On 10/30/2016 01:43 PM, Andrey Grodzovsky wrote:
> Problem:

> This is a work around for a bug with LSI Fusion MPT SAS2 when

> pefroming secure erase. Due to the very long time the operation

> takes commands issued during the erase will time out and will trigger

> execution of abort hook. Even though the abort hook is called for

> the specifc command which timed out this leads to entire device halt

> (scsi_state terminated) and premature termination of the secured erase.

>

Actually, it is _not_ the erase command which times out, it's the 
successive commands which time out, as the controller is unable to 
process them while erase is running.
I suspect a bug in the SAT-layer from the mpt3sas firmware, which simply 
does not return 'busy' for additional commands when erase is in progress.
That being said, this issue was obscured prior to implementing 
asynchronous aborts, as originally a timeout would be invoking SCSI EH, 
which would wait for all outstanding commands to complete.
So by the time SCSI EH was invoked the erase command was already 
completed, allowing for a successful retry of the failing command.
With asynchronous aborts we don't have this option, as the abort will 
succeed, but the command cannot be retried as the original erase command 
is still running.

In the light of the above I guess we need something like the attached 
patch. I'm not utterly proud of if, but I guess it's the best we can do 
for the moment.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

Comments

Hannes Reinecke Nov. 2, 2016, 2:07 a.m. | #1
On 11/02/2016 01:09 AM, Andrey Grodzovsky wrote:
> Problem:

> This is a work around for a bug with LSI Fusion MPT SAS2 when

> pefroming secure erase. Due to the very long time the operation

> takes commands issued during the erase will time out and will trigger

> execution of abort hook. Even though the abort hook is called for

> the specifc command which timed out this leads to entire device halt

> (scsi_state terminated) and premature termination of the secured erase.

>

> Fix:

> Set device state to busy while erase in progress to reject any incoming

> commands until the erase is done. The device is blocked any way during

> this time and cannot execute any other command.

> More data and logs can be found here -

> https://drive.google.com/file/d/0B9ocOHYHbbS1Q3VMdkkzeWFkTjg/view

>

> v2: Update according to example patch by Hannes Reinecke to apply

> the blocking logic to any ATA 12/16 command.

>

> Signed-off-by: Andrey Grodzovsky <andrey2805@gmail.com>

> Cc: <linux-scsi@vger.kernel.org>

> Cc: Sathya Prakash <sathya.prakash@broadcom.com>

> Cc: Chaitra P B <chaitra.basappa@broadcom.com>

> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>

> Cc: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com>

> Cc: Hannes Reinecke <hare@suse.de>

> Cc: <stable@vger.kernel.org>

> ---

>  drivers/scsi/mpt3sas/mpt3sas_scsih.c | 26 ++++++++++++++++++++++++++

>  1 file changed, 26 insertions(+)

>

> diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c

> index 5a97e32..43ab0cc 100644

> --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c

> +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c

> @@ -3500,6 +3500,20 @@ _scsih_eedp_error_handling(struct scsi_cmnd *scmd, u16 ioc_status)

>  	    SAM_STAT_CHECK_CONDITION;

>  }

>

> +/**

> + * This is a work around for a bug with LSI Fusion MPT SAS2 when

> + * pefroming secure erase. Due to the verly long time the operation

> + * takes commands issued during the erase will time out and will trigger

> + * execution of abort hook. This leads to device reset and premature

> + * termination of the secured erase.

> + *

> + */

> +static inline bool ata_12_16_cmd(struct scsi_cmnd *scmd)

> +{

> +   return (scmd->cmnd[0] == 0xa1 || scmd->cmnd[0] == 0x85);

> +}

> +

> +

>

>  /**

>   * _scsih_qcmd - main scsi request entry point

> @@ -3528,6 +3542,14 @@ _scsih_qcmd(struct Scsi_Host *shost, struct scsi_cmnd *scmd)

>  		scsi_print_command(scmd);

>  #endif

>

> +   /**

> +	* Lock the device for any subsequent command until

> +	* command is done.

> +	*/

> +	if (ata_12_16_cmd(scmd))

> +		scsi_internal_device_block(scmd->device);

> +

> +

>  	sas_device_priv_data = scmd->device->hostdata;

>  	if (!sas_device_priv_data || !sas_device_priv_data->sas_target) {

>  		scmd->result = DID_NO_CONNECT << 16;

> @@ -4062,6 +4084,10 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 msix_index, u32 reply)

>  	if (scmd == NULL)

>  		return 1;

>

> +	if (ata_12_16_cmd(scmd))

> +		scsi_internal_device_unblock(scmd->device, SDEV_RUNNING);

> +

> +

>  	mpi_request = mpt3sas_base_get_msg_frame(ioc, smid);

>

>  	if (mpi_reply == NULL) {

>

Yeah, it's ugly, but I can't think of a better solution for the moment.
Thanks for debugging this.

Reviewed-by: Hannes Reinecke <hare@suse.com>


Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen Nov. 4, 2016, 4:35 p.m. | #2
>>>>> "Hannes" == Hannes Reinecke <hare@suse.de> writes:


Hannes> Checking with SAT-3 (section 6.2.4: Commands the SATL queues
Hannes> internally) the implemented behaviour is standards conformant,
Hannes> although the standard also allows for returning 'TASK SET FULL'
Hannes> or 'BUSY' in these cases.  Doing so would nicely solve this
Hannes> issue.

I agree with Hannes that it would be appropriate for the SATL to report
busy when it makes an non-queued command queueable.

-- 
Martin K. Petersen	Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Reinecke April 24, 2018, 12:33 p.m. | #3
On 04/24/2018 11:09 AM, Steffen Maier wrote:
> 

> On 11/04/2016 05:35 PM, Martin K. Petersen wrote:

>>>>>>> "Hannes" == Hannes Reinecke <hare@suse.de> writes:

>>

>> Hannes> Checking with SAT-3 (section 6.2.4: Commands the SATL queues

>> Hannes> internally) the implemented behaviour is standards conformant,

>> Hannes> although the standard also allows for returning 'TASK SET FULL'

>> Hannes> or 'BUSY' in these cases.  Doing so would nicely solve this

>> Hannes> issue.

>>

>> I agree with Hannes that it would be appropriate for the SATL to report

>> busy when it makes an non-queued command queueable.

> 

> Wouldn't this potentially still cause problems if the secure erase takes 

> longer than max_retries * scmd_tmo. I.e. the command timing out by 

> default after 180 seconds as in 

> https://www.spinics.net/lists/linux-block/msg24837.html ?

> 

> The fix approach here seems to also handle this gracefully.

> 

Well, yes, of course the command will be terminated after it timed out.
But typically secure erase is invoked from userspace via sg ioctls, and 
it's in the responsibility of the application to set the correct timeout.

Cheers,

Hannes

Patch

From 1556746987c3b4c1a1a4705625280b1136554f89 Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <hare@suse.de>
Date: Sun, 30 Oct 2016 14:24:44 +0100
Subject: [PATCH] mpt3sas: hack: disable concurrent commands for ATA_16/ATA_12

There's a bug in the mpt3sas driver/firmware which would not return
BUSY if it's busy processing requests (eg 'erase') and cannot
respond to other commands. Hence these commands will timeout
and eventually start the error handler.
This patch disallows request processing whenever an ATA_12 or
ATA_16 command is received, thereby avoiding this problem.

Signed-off-by: Hannes Reinecke <hare@suse.com>
---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 97987e7..18b9f09 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -4096,6 +4096,13 @@  scsih_qcmd(struct Scsi_Host *shost, struct scsi_cmnd *scmd)
 	    sas_device_priv_data->block)
 		return SCSI_MLQUEUE_DEVICE_BUSY;
 
+	/*
+	 * Hack: block the device for any ATA_12/ATA_16 command
+	 */
+	if (scmd->cmnd[0] == 0xa1 || scmd->cmnd[0] == 0x85) {
+		sas_device_priv_data = scmd->device->hostdata;
+		_scsih_internal_device_block(scmd->device, sas_device_priv_data);
+	}
 	if (scmd->sc_data_direction == DMA_FROM_DEVICE)
 		mpi_control = MPI2_SCSIIO_CONTROL_READ;
 	else if (scmd->sc_data_direction == DMA_TO_DEVICE)
@@ -4835,6 +4842,10 @@  _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 msix_index, u32 reply)
 
  out:
 
+	if (scmd->cmnd[0] == 0xa1 || scmd->cmnd[0] == 0x85) {
+		sas_device_priv_data = scmd->device->hostdata;
+		_scsih_internal_device_unblock(scmd->device, sas_device_priv_data);
+	}
 	scsi_dma_unmap(scmd);
 
 	scmd->scsi_done(scmd);
-- 
2.6.6