mbox series

[v3,0/6] libsas and drivers: NCQ error handling

Message ID 1662476890-15467-1-git-send-email-john.garry@huawei.com
Headers show
Series libsas and drivers: NCQ error handling | expand

Message

John Garry Sept. 6, 2022, 3:08 p.m. UTC
As reported in [0], the pm8001 driver NCQ error handling more or less
duplicates what libata does in link error handling, as follows:
- abort all commands
- do autopsy with read log ext 10 command
- reset the target to recover, if necessary

Indeed for the hisi_sas driver we want to add similar handling for NCQ
errors.

This series add a new libsas API - sas_ata_device_link_abort() - to handle
host NCQ errors, and fixes up pm8001 and hisi_sas drivers to use it.

A difference in the pm8001 driver NCQ error handling is that we send
SATA_ABORT per-task prior to read log ext10, but I feel that this should
not make a difference to the error handling.

Damien kindly tested previous the series for pm8001, but any further pm8001
testing would be appreciated as I have since tweaked pm8001 handling again.
This is because the pm8001 driver hangs on my arm64 machine read log ext10
command.

Finally with these changes we can make the libsas task alloc/free APIs
private, which they should always have been.

Based on v6.0-rc4

[0] https://lore.kernel.org/linux-scsi/8fb3b093-55f0-1fab-81f4-e8519810a978@huawei.com/

Changes since v2:
- Stop sending SATA_ABORT all for pm8001 handling
- Make "reset" optional in sas_ata_device_link_abort()

Changes since v1:
- Rename sas_ata_link_abort() -> sas_ata_device_link_abort()
- Set EH RESET flag in sas_ata_device_link_abort()
- Add Jack's Ack tags
- Rebase

John Garry (5):
  scsi: pm8001: Modify task abort handling for SATA task
  scsi: libsas: Add sas_ata_device_link_abort()
  scsi: pm8001: Use sas_ata_device_link_abort() to handle NCQ errors
  scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task()
  scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private

Xingui Yang (1):
  scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw

 drivers/scsi/hisi_sas/hisi_sas_main.c  |   5 +-
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c |  22 ++-
 drivers/scsi/libsas/sas_ata.c          |  12 ++
 drivers/scsi/libsas/sas_init.c         |   3 -
 drivers/scsi/libsas/sas_internal.h     |   4 +
 drivers/scsi/pm8001/pm8001_hwi.c       | 186 ++++---------------------
 drivers/scsi/pm8001/pm8001_sas.c       |   8 ++
 drivers/scsi/pm8001/pm8001_sas.h       |   4 -
 drivers/scsi/pm8001/pm80xx_hwi.c       | 177 +++--------------------
 include/scsi/libsas.h                  |   4 -
 include/scsi/sas_ata.h                 |   6 +
 11 files changed, 96 insertions(+), 335 deletions(-)

Comments

Damien Le Moal Sept. 21, 2022, 3:59 a.m. UTC | #1
On 9/7/22 00:08, John Garry wrote:
> As reported in [0], the pm8001 driver NCQ error handling more or less
> duplicates what libata does in link error handling, as follows:
> - abort all commands
> - do autopsy with read log ext 10 command
> - reset the target to recover, if necessary
> 
> Indeed for the hisi_sas driver we want to add similar handling for NCQ
> errors.
> 
> This series add a new libsas API - sas_ata_device_link_abort() - to handle
> host NCQ errors, and fixes up pm8001 and hisi_sas drivers to use it.
> 
> A difference in the pm8001 driver NCQ error handling is that we send
> SATA_ABORT per-task prior to read log ext10, but I feel that this should
> not make a difference to the error handling.
> 
> Damien kindly tested previous the series for pm8001, but any further pm8001
> testing would be appreciated as I have since tweaked pm8001 handling again.
> This is because the pm8001 driver hangs on my arm64 machine read log ext10
> command.

I tested this series on top of the current Linus tree. All look good. As 
ususal, I hammered an SMR drive connected to a pm80xx adapter and 
generated lots of invalid commands to check EH. No issues that I can see.
E.g., an unaligned write error look like this:

[ 5384.194853] pm80xx0:: mpi_sata_event 2685: SATA EVENT 0x23
[ 5384.238720] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[ 5384.246171] sas: sas_scsi_find_task: aborting task 0x0000000050be2a4b
[ 5384.254654] pm80xx0:: mpi_sata_completion 2292: task null, freeing 
CCB tag 2
[ 5384.254676] sas: sas_scsi_find_task: task 0x0000000050be2a4b is aborted
[ 5384.294659] sas: sas_eh_handle_sas_errors: task 0x0000000050be2a4b is 
aborted
[ 5384.318691] ata22.00: exception Emask 0x0 SAct 0x200000 SErr 0x0 
action 0x0
[ 5384.328425] ata22.00: failed command: WRITE FPDMA QUEUED
[ 5384.336277] ata22.00: cmd 61/01:00:01:00:ea/00:00:02:00:00/40 tag 21 
ncq dma 4096 out
[ 5384.336277]          res 43/04:01:01:00:ea/00:00:02:00:00/00 Emask 
0x400 (NCQ error) <F>
[ 5384.357299] ata22.00: status: { DRDY SENSE ERR }
[ 5384.364459] ata22.00: error: { ABRT }
[ 5384.553177] ata22.00: configured for UDMA/133
[ 5384.560343] sd 19:0:3:0: [sdl] tag#80 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK cmd_age=0s
[ 5384.572175] sd 19:0:3:0: [sdl] tag#80 Sense Key : Illegal Request 
[current]
[ 5384.581765] sd 19:0:3:0: [sdl] tag#80 Add. Sense: Unaligned write command
[ 5384.591126] sd 19:0:3:0: [sdl] tag#80 CDB: Write(16) 8a 00 00 00 00 
00 02 ea 00 01 00 00 00 01 00 00
[ 5384.602938] I/O error, dev sdl, sector 391118856 op 0x1:(WRITE) flags 
0x8800 phys_seg 1 prio class 2
[ 5384.613854] ata22: EH complete
[ 5384.618570] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 
tries: 1

So looks good to me.

Feel free to add:

Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

> 
> Finally with these changes we can make the libsas task alloc/free APIs
> private, which they should always have been.
> 
> Based on v6.0-rc4
> 
> [0] https://lore.kernel.org/linux-scsi/8fb3b093-55f0-1fab-81f4-e8519810a978@huawei.com/
> 
> Changes since v2:
> - Stop sending SATA_ABORT all for pm8001 handling
> - Make "reset" optional in sas_ata_device_link_abort()
> 
> Changes since v1:
> - Rename sas_ata_link_abort() -> sas_ata_device_link_abort()
> - Set EH RESET flag in sas_ata_device_link_abort()
> - Add Jack's Ack tags
> - Rebase
> 
> John Garry (5):
>    scsi: pm8001: Modify task abort handling for SATA task
>    scsi: libsas: Add sas_ata_device_link_abort()
>    scsi: pm8001: Use sas_ata_device_link_abort() to handle NCQ errors
>    scsi: hisi_sas: Don't issue ATA softreset in hisi_sas_abort_task()
>    scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private
> 
> Xingui Yang (1):
>    scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw
> 
>   drivers/scsi/hisi_sas/hisi_sas_main.c  |   5 +-
>   drivers/scsi/hisi_sas/hisi_sas_v3_hw.c |  22 ++-
>   drivers/scsi/libsas/sas_ata.c          |  12 ++
>   drivers/scsi/libsas/sas_init.c         |   3 -
>   drivers/scsi/libsas/sas_internal.h     |   4 +
>   drivers/scsi/pm8001/pm8001_hwi.c       | 186 ++++---------------------
>   drivers/scsi/pm8001/pm8001_sas.c       |   8 ++
>   drivers/scsi/pm8001/pm8001_sas.h       |   4 -
>   drivers/scsi/pm8001/pm80xx_hwi.c       | 177 +++--------------------
>   include/scsi/libsas.h                  |   4 -
>   include/scsi/sas_ata.h                 |   6 +
>   11 files changed, 96 insertions(+), 335 deletions(-)
>
John Garry Sept. 21, 2022, 8:09 a.m. UTC | #2
On 21/09/2022 04:59, Damien Le Moal wrote:
> I tested this series on top of the current Linus tree. All look good. As 
> ususal, I hammered an SMR drive connected to a pm80xx adapter and 
> generated lots of invalid commands to check EH. No issues that I can see.
> E.g., an unaligned write error look like this:
> 
> [ 5384.194853] pm80xx0:: mpi_sata_event 2685: SATA EVENT 0x23
> [ 5384.238720] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
> [ 5384.246171] sas: sas_scsi_find_task: aborting task 0x0000000050be2a4b
> [ 5384.254654] pm80xx0:: mpi_sata_completion 2292: task null, freeing 
> CCB tag 2
> [ 5384.254676] sas: sas_scsi_find_task: task 0x0000000050be2a4b is aborted
> [ 5384.294659] sas: sas_eh_handle_sas_errors: task 0x0000000050be2a4b is 
> aborted
> [ 5384.318691] ata22.00: exception Emask 0x0 SAct 0x200000 SErr 0x0 
> action 0x0
> [ 5384.328425] ata22.00: failed command: WRITE FPDMA QUEUED
> [ 5384.336277] ata22.00: cmd 61/01:00:01:00:ea/00:00:02:00:00/40 tag 21 
> ncq dma 4096 out
> [ 5384.336277]          res 43/04:01:01:00:ea/00:00:02:00:00/00 Emask 
> 0x400 (NCQ error) <F>
> [ 5384.357299] ata22.00: status: { DRDY SENSE ERR }
> [ 5384.364459] ata22.00: error: { ABRT }
> [ 5384.553177] ata22.00: configured for UDMA/133
> [ 5384.560343] sd 19:0:3:0: [sdl] tag#80 FAILED Result: hostbyte=DID_OK 
> driverbyte=DRIVER_OK cmd_age=0s
> [ 5384.572175] sd 19:0:3:0: [sdl] tag#80 Sense Key : Illegal Request 
> [current]
> [ 5384.581765] sd 19:0:3:0: [sdl] tag#80 Add. Sense: Unaligned write 
> command
> [ 5384.591126] sd 19:0:3:0: [sdl] tag#80 CDB: Write(16) 8a 00 00 00 00 
> 00 02 ea 00 01 00 00 00 01 00 00
> [ 5384.602938] I/O error, dev sdl, sector 391118856 op 0x1:(WRITE) flags 
> 0x8800 phys_seg 1 prio class 2
> [ 5384.613854] ata22: EH complete
> [ 5384.618570] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 
> tries: 1
> 
> So looks good to me.
> 
> Feel free to add:
> 
> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

Great, thanks very much. I'll send an updated version based on mkp-scsi 
6.1 queue today.

John