mbox series

[0/4] scsi: remove devices in ALUA transitioning status

Message ID 20200930080256.90964-1-hare@suse.de
Headers show
Series scsi: remove devices in ALUA transitioning status | expand

Message

Hannes Reinecke Sept. 30, 2020, 8:02 a.m. UTC
Hi all,

during testing we found that there is an issue with dev_loss_tmo and
devices in ALUA transitioning state.
What happens is that I/O gets requeued via BLK_STS_RESOURCE for these
devices, so when dev_loss_tmo triggers the SCSI core cannot flush the
request list as I/O is simply requeued.

So when the driver is trying to re-establish the device it'll wait for
that last reference to drop in order to re-attach the device, but as I/O
is still outstanding on the (old) device it'll wait for ever.

Fix this by returning 'BLK_STS_AGAIN' from scsi_dh_alua when the device
is in ALUA transitioning, and also set the 'transitioning' state when
scsi_dh_alua is receiving a sense code, and not only after scsi_dh_alua
successfully received the response to a REPORT TARGET PORT GROUPS
command.

Hannes Reinecke (4):
  block: return status code in blk_mq_end_request()
  scsi_dh_alua: return BLK_STS_AGAIN for ALUA transitioning state
  scsi_dh_alua: set 'transitioning' state on unit attention
  scsi: return BLK_STS_AGAIN for ALUA transitioning

 block/blk-mq.c                             |  2 +-
 drivers/scsi/device_handler/scsi_dh_alua.c | 10 +++++++++-
 drivers/scsi/scsi_lib.c                    |  8 ++++++++
 3 files changed, 18 insertions(+), 2 deletions(-)

Comments

Martin K. Petersen Oct. 27, 2020, 12:21 a.m. UTC | #1
Hannes,

> during testing we found that there is an issue with dev_loss_tmo and

> devices in ALUA transitioning state.  What happens is that I/O gets

> requeued via BLK_STS_RESOURCE for these devices, so when dev_loss_tmo

> triggers the SCSI core cannot flush the request list as I/O is simply

> requeued.

>

> So when the driver is trying to re-establish the device it'll wait for

> that last reference to drop in order to re-attach the device, but as

> I/O is still outstanding on the (old) device it'll wait for ever.

>

> Fix this by returning 'BLK_STS_AGAIN' from scsi_dh_alua when the

> device is in ALUA transitioning, and also set the 'transitioning'

> state when scsi_dh_alua is receiving a sense code, and not only after

> scsi_dh_alua successfully received the response to a REPORT TARGET

> PORT GROUPS command.


It would be good to get this revived/reviewed.

Thanks!

-- 
Martin K. Petersen	Oracle Linux Engineering
Ewan Milne Nov. 7, 2020, 12:02 a.m. UTC | #2
On Wed, 2020-09-30 at 10:02 +0200, Hannes Reinecke wrote:
> Hi all,

> 

> during testing we found that there is an issue with dev_loss_tmo and

> devices in ALUA transitioning state.

> What happens is that I/O gets requeued via BLK_STS_RESOURCE for these

> devices, so when dev_loss_tmo triggers the SCSI core cannot flush the

> request list as I/O is simply requeued.

> 

> So when the driver is trying to re-establish the device it'll wait

> for

> that last reference to drop in order to re-attach the device, but as

> I/O

> is still outstanding on the (old) device it'll wait for ever.

> 

> Fix this by returning 'BLK_STS_AGAIN' from scsi_dh_alua when the

> device

> is in ALUA transitioning, and also set the 'transitioning' state when

> scsi_dh_alua is receiving a sense code, and not only after

> scsi_dh_alua

> successfully received the response to a REPORT TARGET PORT GROUPS

> command.

> 

> Hannes Reinecke (4):

>   block: return status code in blk_mq_end_request()

>   scsi_dh_alua: return BLK_STS_AGAIN for ALUA transitioning state

>   scsi_dh_alua: set 'transitioning' state on unit attention

>   scsi: return BLK_STS_AGAIN for ALUA transitioning

> 

>  block/blk-mq.c                             |  2 +-

>  drivers/scsi/device_handler/scsi_dh_alua.c | 10 +++++++++-

>  drivers/scsi/scsi_lib.c                    |  8 ++++++++

>  3 files changed, 18 insertions(+), 2 deletions(-)

> 


We had a report of I/O hangs during storage controller resets
and analysis of the kernel state showed the sdev in ALUA transitioning.

The patch set fixes the ALUA transitioning issue, it looks good.
There was a reproducible test case.

Reviewed-by: Ewan D. Milne <emilne@redhat.com>


-Ewan
Martin K. Petersen Nov. 11, 2020, 3:58 a.m. UTC | #3
Ewan, Hannes,

>> during testing we found that there is an issue with dev_loss_tmo and

>> devices in ALUA transitioning state.  What happens is that I/O gets

>> requeued via BLK_STS_RESOURCE for these devices, so when dev_loss_tmo

>> triggers the SCSI core cannot flush the request list as I/O is simply

>> requeued.


Applied to 5.11/scsi-staging, thanks!

-- 
Martin K. Petersen	Oracle Linux Engineering