mbox series

[0/2] Improve ATA NCQ command error in mpt3sas and mpi3mr

Message ID 20250606052747.742998-1-dlemoal@kernel.org
Headers show
Series Improve ATA NCQ command error in mpt3sas and mpi3mr | expand

Message

Damien Le Moal June 6, 2025, 5:27 a.m. UTC
Martin,

Two similar patches for the mpt3sas and mpi3mr drivers to improve the
handling of NCQ command terminated due to an NCQ command failure. These
so-called collateral aborts must be retried immediately but that must be
done without incrementing the command retry counter. Otherwise, these
collateral abort commands may endup being failed due to other NCQ
command errors.

This issue is especially easy to trigger with the mpi3mr driver with a
drive subject to a mixed workload of commands with a short CDL limit and
commands without limits. The failures due to the limit being exceeded,
which are normal, endup also failing commands without a limit, which is
incorrect.

Broadcom people,

I am working in the dark here, with zero information on how your HBA
handle ATA NCQ collateral aborts. I am patching against what I am
seeing, which may be only a partial picture of the problem. So please
check this !

Damien Le Moal (2):
  scsi: mpi3mr: Correctly handle ATA device errors
  scsi: mpt3sas: Correctly handle ATA device errors

 drivers/scsi/mpi3mr/mpi3mr_os.c      | 20 +++++++++++++++++++-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 19 +++++++++++++++++++
 2 files changed, 38 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig June 9, 2025, 5:50 a.m. UTC | #1
Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
I think promted this.

Yafang, can you check if this makes the writeback errors you're seeing
go away?
Yafang Shao June 9, 2025, 7:09 a.m. UTC | #2
On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
> I think promted this.

Thank you for the information and for addressing this so quickly!

>
> Yafang, can you check if this makes the writeback errors you're seeing
> go away?

I’m happy to test the fix and will share the results as soon as I have them.
Damien Le Moal June 9, 2025, 7:17 a.m. UTC | #3
On 6/9/25 16:09, Yafang Shao wrote:
> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
>>
>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
>> I think promted this.

Note that cdl-tools test suite has many test cases that do not pass without the
mpi3mr patch. CDL makes it easy to trigger the issue.

> 
> Thank you for the information and for addressing this so quickly!
> 
>>
>> Yafang, can you check if this makes the writeback errors you're seeing
>> go away?
> 
> I’m happy to test the fix and will share the results as soon as I have them.

Thanks. And my apologies for forgetting to CC you on these patches.
Yafang Shao June 11, 2025, 3:27 a.m. UTC | #4
On Mon, Jun 9, 2025 at 3:18 PM Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 6/9/25 16:09, Yafang Shao wrote:
> > On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
> >>
> >> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
> >> I think promted this.
>
> Note that cdl-tools test suite has many test cases that do not pass without the
> mpi3mr patch. CDL makes it easy to trigger the issue.
>
> >
> > Thank you for the information and for addressing this so quickly!
> >
> >>
> >> Yafang, can you check if this makes the writeback errors you're seeing
> >> go away?
> >
> > I’m happy to test the fix and will share the results as soon as I have them.
>
> Thanks. And my apologies for forgetting to CC you on these patches.

We have developed and deployed a hotfix to hundreds of our production
servers equipped with this drive, and it has been functioning as
expected so far. We are currently rolling it out to the remaining
production servers, though the process may take some time to complete.

Additionally, we encountered a writeback error in the libata driver,
which appears to be related to DID_TIME_OUT. Do you have any insights
into this issue?

[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 CDB: Write(16) 8a
00 00 00 00 08 2c eb 4a 58 00 00 02 c0 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113355864 op
0x1:(WRITE) flags 0x100000 phys_seg 88 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#29 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#29 CDB: Write(16) 8a
00 00 00 00 08 2c eb 45 18 00 00 05 40 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113354520 op
0x1:(WRITE) flags 0x104000 phys_seg 168 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#28 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#28 CDB: Write(16) 8a
00 00 00 00 08 2c eb 42 58 00 00 02 c0 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113353816 op
0x1:(WRITE) flags 0x100000 phys_seg 88 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#22 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#22 CDB: Read(16) 88
00 00 00 00 03 d8 33 44 18 00 00 02 c0 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512140312 op
0x0:(READ) flags 0x80700 phys_seg 88 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#21 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#21 CDB: Read(16) 88
00 00 00 00 03 d8 33 3e d8 00 00 05 40 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512138968 op
0x0:(READ) flags 0x84700 phys_seg 168 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#20 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#20 CDB: Read(16) 88
00 00 00 00 03 d8 33 3c 18 00 00 02 c0 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512138264 op
0x0:(READ) flags 0x80700 phys_seg 88 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#19 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#19 CDB: Write(16) 8a
00 00 00 00 08 2c eb 3d 18 00 00 05 40 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113352472 op
0x1:(WRITE) flags 0x104000 phys_seg 168 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#15 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#15 CDB: Read(16) 88
00 00 00 00 03 2c e9 4e b8 00 00 00 98 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 13638389432 op
0x0:(READ) flags 0x80700 phys_seg 19 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#14 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#14 CDB: Read(16) 88
00 00 00 00 03 2c e9 4c b8 00 00 02 00 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 13638388920 op
0x0:(READ) flags 0x80700 phys_seg 64 prio class 2
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#13 FAILED Result:
hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
[Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#13 CDB: Read(16) 88
00 00 00 00 03 d8 33 36 d8 00 00 05 40 00 00
[Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512136920 op
0x0:(READ) flags 0x84700 phys_seg 168 prio class 2
[Tue Jun 10 13:27:47 2025] XFS (sdl): metadata I/O error in
"xfs_imap_to_bp+0x4f/0x70 [xfs]" at daddr 0x4804c4398 len 32 error 5
[Tue Jun 10 13:27:48 2025] sdl: writeback error on inode 32213499599,
offset 0, sector 35113332208
Damien Le Moal June 11, 2025, 3:57 a.m. UTC | #5
On 6/11/25 12:27 PM, Yafang Shao wrote:
> On Mon, Jun 9, 2025 at 3:18 PM Damien Le Moal <dlemoal@kernel.org> wrote:
>>
>> On 6/9/25 16:09, Yafang Shao wrote:
>>> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
>>>>
>>>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
>>>> I think promted this.
>>
>> Note that cdl-tools test suite has many test cases that do not pass without the
>> mpi3mr patch. CDL makes it easy to trigger the issue.
>>
>>>
>>> Thank you for the information and for addressing this so quickly!
>>>
>>>>
>>>> Yafang, can you check if this makes the writeback errors you're seeing
>>>> go away?
>>>
>>> I’m happy to test the fix and will share the results as soon as I have them.
>>
>> Thanks. And my apologies for forgetting to CC you on these patches.
> 
> We have developed and deployed a hotfix to hundreds of our production
> servers equipped with this drive, and it has been functioning as
> expected so far. We are currently rolling it out to the remaining
> production servers, though the process may take some time to complete.
> 
> Additionally, we encountered a writeback error in the libata driver,
> which appears to be related to DID_TIME_OUT. Do you have any insights
> into this issue?
> 
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s

No idea. Timeout is hard to debug... A bus trace would help. It could be that
the drives are really bad and taking too long per command, thus causing time
out for commands that are at the end of the queue. E.g., if you hit many bad
sectors with read commands, the drive internal retry mechanism may cause such
read command to take several seconds. So if you are running the drives at high
queue depth, this can easily cause timeouts for the commands that are at the
end of the queue.

> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 CDB: Write(16) 8a
> 00 00 00 00 08 2c eb 4a 58 00 00 02 c0 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113355864 op
> 0x1:(WRITE) flags 0x100000 phys_seg 88 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#29 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#29 CDB: Write(16) 8a
> 00 00 00 00 08 2c eb 45 18 00 00 05 40 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113354520 op
> 0x1:(WRITE) flags 0x104000 phys_seg 168 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#28 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#28 CDB: Write(16) 8a
> 00 00 00 00 08 2c eb 42 58 00 00 02 c0 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113353816 op
> 0x1:(WRITE) flags 0x100000 phys_seg 88 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#22 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#22 CDB: Read(16) 88
> 00 00 00 00 03 d8 33 44 18 00 00 02 c0 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512140312 op
> 0x0:(READ) flags 0x80700 phys_seg 88 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#21 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#21 CDB: Read(16) 88
> 00 00 00 00 03 d8 33 3e d8 00 00 05 40 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512138968 op
> 0x0:(READ) flags 0x84700 phys_seg 168 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#20 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#20 CDB: Read(16) 88
> 00 00 00 00 03 d8 33 3c 18 00 00 02 c0 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512138264 op
> 0x0:(READ) flags 0x80700 phys_seg 88 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#19 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#19 CDB: Write(16) 8a
> 00 00 00 00 08 2c eb 3d 18 00 00 05 40 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113352472 op
> 0x1:(WRITE) flags 0x104000 phys_seg 168 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#15 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#15 CDB: Read(16) 88
> 00 00 00 00 03 2c e9 4e b8 00 00 00 98 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 13638389432 op
> 0x0:(READ) flags 0x80700 phys_seg 19 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#14 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#14 CDB: Read(16) 88
> 00 00 00 00 03 2c e9 4c b8 00 00 02 00 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 13638388920 op
> 0x0:(READ) flags 0x80700 phys_seg 64 prio class 2
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#13 FAILED Result:
> hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
> [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#13 CDB: Read(16) 88
> 00 00 00 00 03 d8 33 36 d8 00 00 05 40 00 00
> [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512136920 op
> 0x0:(READ) flags 0x84700 phys_seg 168 prio class 2
> [Tue Jun 10 13:27:47 2025] XFS (sdl): metadata I/O error in
> "xfs_imap_to_bp+0x4f/0x70 [xfs]" at daddr 0x4804c4398 len 32 error 5
> [Tue Jun 10 13:27:48 2025] sdl: writeback error on inode 32213499599,
> offset 0, sector 35113332208
>
Yafang Shao June 11, 2025, 5:42 a.m. UTC | #6
On Wed, Jun 11, 2025 at 11:59 AM Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 6/11/25 12:27 PM, Yafang Shao wrote:
> > On Mon, Jun 9, 2025 at 3:18 PM Damien Le Moal <dlemoal@kernel.org> wrote:
> >>
> >> On 6/9/25 16:09, Yafang Shao wrote:
> >>> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
> >>>>
> >>>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
> >>>> I think promted this.
> >>
> >> Note that cdl-tools test suite has many test cases that do not pass without the
> >> mpi3mr patch. CDL makes it easy to trigger the issue.
> >>
> >>>
> >>> Thank you for the information and for addressing this so quickly!
> >>>
> >>>>
> >>>> Yafang, can you check if this makes the writeback errors you're seeing
> >>>> go away?
> >>>
> >>> I’m happy to test the fix and will share the results as soon as I have them.
> >>
> >> Thanks. And my apologies for forgetting to CC you on these patches.
> >
> > We have developed and deployed a hotfix to hundreds of our production
> > servers equipped with this drive, and it has been functioning as
> > expected so far. We are currently rolling it out to the remaining
> > production servers, though the process may take some time to complete.
> >
> > Additionally, we encountered a writeback error in the libata driver,
> > which appears to be related to DID_TIME_OUT. Do you have any insights
> > into this issue?
> >
> > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 FAILED Result:
> > hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
>
> No idea. Timeout is hard to debug... A bus trace would help. It could be that
> the drives are really bad and taking too long per command, thus causing time
> out for commands that are at the end of the queue. E.g., if you hit many bad
> sectors with read commands, the drive internal retry mechanism may cause such
> read command to take several seconds. So if you are running the drives at high
> queue depth, this can easily cause timeouts for the commands that are at the
> end of the queue.

Thank you for the detailed explanation. We will investigate this further.
Yafang Shao June 16, 2025, 2:13 a.m. UTC | #7
On Mon, Jun 9, 2025 at 3:09 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
> > I think promted this.
>
> Thank you for the information and for addressing this so quickly!
>
> >
> > Yafang, can you check if this makes the writeback errors you're seeing
> > go away?
>
> I’m happy to test the fix and will share the results as soon as I have them.

We’ve confirmed that the DID_SOFT_ERROR issue no longer occurs after
applying this patch series to our production servers. Since our
production servers only use mpt3sas drives, we can verify the fix
specifically for this driver:

Tested-by: Yafang Shao <laoar.shao@gmail.com>


Hello Damien,

We’ve encountered another instance of DID_SOFT_ERROR triggered by a
reset on an mpt3sas drive. Do you have any insights into what might be
causing this failure? Below are the error details:

[Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610):
originator(PL), code(0x11), sub_code(0x0610)
[Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610):
originator(PL), code(0x11), sub_code(0x0610)
[Thu Jun 12 17:57:35 2025] sd 8:0:9:0: Power-on or device reset occurred
[Thu Jun 12 20:07:53 2025] mpt3sas_cm0: In func: _ctl_do_mpt_command
[Thu Jun 12 20:07:53 2025] mpt3sas_cm0: Command Timeout
[Thu Jun 12 20:07:53 2025] mf:

[Thu Jun 12 20:07:53 2025] 00000013
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 9fcda615
[Thu Jun 12 20:07:53 2025] 00fc0000
[Thu Jun 12 20:07:53 2025] 00000018
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 000000ff
[Thu Jun 12 20:07:53 2025]

[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000006
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 02000000
[Thu Jun 12 20:07:53 2025]

[Thu Jun 12 20:07:53 2025] 00000012
[Thu Jun 12 20:07:53 2025] 000000ff
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000
[Thu Jun 12 20:07:53 2025] 00000000

[Thu Jun 12 20:07:53 2025] mpt3sas_cm0: issue target reset: handle = (0x0013)
[Thu Jun 12 20:07:56 2025] scsi_io_completion_action: 22 callbacks suppressed
[Thu Jun 12 20:07:56 2025] blk_print_req_error: 22 callbacks suppressed
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 190811280 op
0x0:(READ) flags 0x80700 phys_seg 28 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2305 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2336 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=16s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2305 CDB: Read(16) 88
00 00 00 00 05 26 e3 68 40 00 00 04 00 00 00
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2158 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127274048 op
0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2158 CDB: Read(16) 88
00 00 00 00 03 0d 08 dc 38 00 00 04 00 00 00
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 13103586360 op
0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2369 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2369 CDB: Read(16) 88
00 00 00 00 01 50 ee 50 48 00 00 04 00 00 00
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 5652762696 op
0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2368 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2368 CDB: Read(16) 88
00 00 00 00 05 26 e3 64 10 00 00 00 20 00 00
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127272976 op
0x0:(READ) flags 0x80700 phys_seg 4 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2304 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2157 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2304 CDB: Read(16) 88
00 00 00 00 05 26 e3 64 30 00 00 04 10 00 00
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2157 CDB: Read(16) 88
00 00 00 00 03 0d 08 d8 38 00 00 04 00 00 00
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127273008 op
0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 13103585336 op
0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2400 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2400 CDB: Read(16) 88
00 00 00 00 01 50 ee 58 48 00 00 04 00 00 00
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 5652764744 op
0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2309 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2309 CDB: Read(16) 88
00 00 00 00 05 26 e3 78 10 00 00 02 c0 00 00
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127278096 op
0x0:(READ) flags 0x80700 phys_seg 88 prio class 2
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2376 FAILED Result:
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2376 CDB: Read(16) 88
00 00 00 00 03 80 11 3e c8 00 00 00 20 00 00
[Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 15033515720 op
0x0:(READ) flags 0x80700 phys_seg 3 prio class 2
[Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
originator(PL), code(0x14), sub_code(0x0000)
[Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
originator(PL), code(0x14), sub_code(0x0000)
[Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2336 CDB: Read(16) 88
00 00 00 00 01 50 ee 96 50 00 00 05 18 00 00
[Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
originator(PL), code(0x14), sub_code(0x0000)
[Thu Jun 12 20:07:57 2025] XFS (sdi): metadata I/O error in
"xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x484022c68 len 8 error 5
[Thu Jun 12 20:07:59 2025] sd 8:0:9:0: Power-on or device reset occurred
[Thu Jun 12 20:07:59 2025] sdi: writeback error on inode 12885147175,
offset 1285156864, sector 13979483112
Damien Le Moal June 16, 2025, 2:28 a.m. UTC | #8
On 6/16/25 11:13, Yafang Shao wrote:
> On Mon, Jun 9, 2025 at 3:09 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>
>> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
>>>
>>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
>>> I think promted this.
>>
>> Thank you for the information and for addressing this so quickly!
>>
>>>
>>> Yafang, can you check if this makes the writeback errors you're seeing
>>> go away?
>>
>> I’m happy to test the fix and will share the results as soon as I have them.
> 
> We’ve confirmed that the DID_SOFT_ERROR issue no longer occurs after
> applying this patch series to our production servers. Since our
> production servers only use mpt3sas drives, we can verify the fix
> specifically for this driver:
> 
> Tested-by: Yafang Shao <laoar.shao@gmail.com>

Thansk. I tested with the mpi3mr driver.


> We’ve encountered another instance of DID_SOFT_ERROR triggered by a
> reset on an mpt3sas drive. Do you have any insights into what might be
> causing this failure? Below are the error details:
> 
> [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610):
> originator(PL), code(0x11), sub_code(0x0610)

This decodes to:

Value     	31110610h
Type:     	30000000h	SAS
Origin:   	01000000h	PL
Code:     	00110000h	PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE)
Sub Code: 	00000600h	PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET

So it looks like a non-NCQ command failed. What were you doing when this happened ?

> [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610):
> originator(PL), code(0x11), sub_code(0x0610)
> [Thu Jun 12 17:57:35 2025] sd 8:0:9:0: Power-on or device reset occurred

And this command failure is trigerring a device reset (the HBA may be doing
that, or the drive did not like what you were doing and reset).

> [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: In func: _ctl_do_mpt_command
> [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: Command Timeout

This looks like an ioctl() to the adapter diver, and it never got its reply
apparently. This is 3 hours after the above power-on-reset, so these 2 events
are likely not related...

> [Thu Jun 12 20:07:53 2025] mf:

This is dumping the mpi_request bytes. Maybe you can try to decode that to get
hints as to what action triggered this.

I would love to get feedback from the Broadcom folks on this kind of problems,
but apparently, debugging issues with their HBA does not seem to be high on
their to-do list.

Broadcom folks,

Could you please comment on these issues ? Not the first time I ask. A reply
would be welcome so that we all know that you at least care about issues with
your drivers/HBAs. Thank you.


> 
> [Thu Jun 12 20:07:53 2025] 00000013
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 9fcda615
> [Thu Jun 12 20:07:53 2025] 00fc0000
> [Thu Jun 12 20:07:53 2025] 00000018
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 000000ff
> [Thu Jun 12 20:07:53 2025]
> 
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000006
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 02000000
> [Thu Jun 12 20:07:53 2025]
> 
> [Thu Jun 12 20:07:53 2025] 00000012
> [Thu Jun 12 20:07:53 2025] 000000ff
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> [Thu Jun 12 20:07:53 2025] 00000000
> 
> [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: issue target reset: handle = (0x0013)
> [Thu Jun 12 20:07:56 2025] scsi_io_completion_action: 22 callbacks suppressed
> [Thu Jun 12 20:07:56 2025] blk_print_req_error: 22 callbacks suppressed
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 190811280 op
> 0x0:(READ) flags 0x80700 phys_seg 28 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2305 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2336 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=16s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2305 CDB: Read(16) 88
> 00 00 00 00 05 26 e3 68 40 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2158 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127274048 op
> 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2158 CDB: Read(16) 88
> 00 00 00 00 03 0d 08 dc 38 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 13103586360 op
> 0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2369 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2369 CDB: Read(16) 88
> 00 00 00 00 01 50 ee 50 48 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 5652762696 op
> 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2368 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2368 CDB: Read(16) 88
> 00 00 00 00 05 26 e3 64 10 00 00 00 20 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127272976 op
> 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2304 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2157 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2304 CDB: Read(16) 88
> 00 00 00 00 05 26 e3 64 30 00 00 04 10 00 00
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2157 CDB: Read(16) 88
> 00 00 00 00 03 0d 08 d8 38 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127273008 op
> 0x0:(READ) flags 0x84700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 13103585336 op
> 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2400 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2400 CDB: Read(16) 88
> 00 00 00 00 01 50 ee 58 48 00 00 04 00 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 5652764744 op
> 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2309 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2309 CDB: Read(16) 88
> 00 00 00 00 05 26 e3 78 10 00 00 02 c0 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127278096 op
> 0x0:(READ) flags 0x80700 phys_seg 88 prio class 2
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2376 FAILED Result:
> hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2376 CDB: Read(16) 88
> 00 00 00 00 03 80 11 3e c8 00 00 00 20 00 00
> [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 15033515720 op
> 0x0:(READ) flags 0x80700 phys_seg 3 prio class 2
> [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
> originator(PL), code(0x14), sub_code(0x0000)
> [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
> originator(PL), code(0x14), sub_code(0x0000)
> [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2336 CDB: Read(16) 88
> 00 00 00 00 01 50 ee 96 50 00 00 05 18 00 00
> [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000):
> originator(PL), code(0x14), sub_code(0x0000)
> [Thu Jun 12 20:07:57 2025] XFS (sdi): metadata I/O error in
> "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x484022c68 len 8 error 5
> [Thu Jun 12 20:07:59 2025] sd 8:0:9:0: Power-on or device reset occurred
> [Thu Jun 12 20:07:59 2025] sdi: writeback error on inode 12885147175,
> offset 1285156864, sector 13979483112
> 
>
Yafang Shao June 16, 2025, 12:40 p.m. UTC | #9
On Mon, Jun 16, 2025 at 10:28 AM Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 6/16/25 11:13, Yafang Shao wrote:
> > On Mon, Jun 9, 2025 at 3:09 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>
> >> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote:
> >>>
> >>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which
> >>> I think promted this.
> >>
> >> Thank you for the information and for addressing this so quickly!
> >>
> >>>
> >>> Yafang, can you check if this makes the writeback errors you're seeing
> >>> go away?
> >>
> >> I’m happy to test the fix and will share the results as soon as I have them.
> >
> > We’ve confirmed that the DID_SOFT_ERROR issue no longer occurs after
> > applying this patch series to our production servers. Since our
> > production servers only use mpt3sas drives, we can verify the fix
> > specifically for this driver:
> >
> > Tested-by: Yafang Shao <laoar.shao@gmail.com>
>
> Thansk. I tested with the mpi3mr driver.
>
>
> > We’ve encountered another instance of DID_SOFT_ERROR triggered by a
> > reset on an mpt3sas drive. Do you have any insights into what might be
> > causing this failure? Below are the error details:
> >
> > [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610):
> > originator(PL), code(0x11), sub_code(0x0610)
>
> This decodes to:
>
> Value           31110610h
> Type:           30000000h       SAS
> Origin:         01000000h       PL
> Code:           00110000h       PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE)
> Sub Code:       00000600h       PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET
>
> So it looks like a non-NCQ command failed. What were you doing when this happened ?

After reviewing the logs, we were unable to identify any relevant
information regarding the issue.
We appreciate your continued support in resolving this matter.

>
> > [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610):
> > originator(PL), code(0x11), sub_code(0x0610)
> > [Thu Jun 12 17:57:35 2025] sd 8:0:9:0: Power-on or device reset occurred
>
> And this command failure is trigerring a device reset (the HBA may be doing
> that, or the drive did not like what you were doing and reset).
>
> > [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: In func: _ctl_do_mpt_command
> > [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: Command Timeout
>
> This looks like an ioctl() to the adapter diver, and it never got its reply
> apparently. This is 3 hours after the above power-on-reset, so these 2 events
> are likely not related...
>
> > [Thu Jun 12 20:07:53 2025] mf:
>
> This is dumping the mpi_request bytes. Maybe you can try to decode that to get
> hints as to what action triggered this.
>
> I would love to get feedback from the Broadcom folks on this kind of problems,
> but apparently, debugging issues with their HBA does not seem to be high on
> their to-do list.
>
> Broadcom folks,
>
> Could you please comment on these issues ? Not the first time I ask. A reply
> would be welcome so that we all know that you at least care about issues with
> your drivers/HBAs. Thank you.

We have reached out to Broadcom's team directly and hope they will
respond both to our inquiry and to this email correspondence.
Martin K. Petersen June 16, 2025, 8:51 p.m. UTC | #10
Damien,

> I am working in the dark here, with zero information on how your HBA
> handle ATA NCQ collateral aborts. I am patching against what I am
> seeing, which may be only a partial picture of the problem. So please
> check this !

Broadcom: Please review!