Message ID | 20250606052747.742998-1-dlemoal@kernel.org |
---|---|
Headers | show |
Series | Improve ATA NCQ command error in mpt3sas and mpi3mr | expand |
Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which I think promted this. Yafang, can you check if this makes the writeback errors you're seeing go away?
On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote: > > Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which > I think promted this. Thank you for the information and for addressing this so quickly! > > Yafang, can you check if this makes the writeback errors you're seeing > go away? I’m happy to test the fix and will share the results as soon as I have them.
On 6/9/25 16:09, Yafang Shao wrote: > On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote: >> >> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which >> I think promted this. Note that cdl-tools test suite has many test cases that do not pass without the mpi3mr patch. CDL makes it easy to trigger the issue. > > Thank you for the information and for addressing this so quickly! > >> >> Yafang, can you check if this makes the writeback errors you're seeing >> go away? > > I’m happy to test the fix and will share the results as soon as I have them. Thanks. And my apologies for forgetting to CC you on these patches.
On Mon, Jun 9, 2025 at 3:18 PM Damien Le Moal <dlemoal@kernel.org> wrote: > > On 6/9/25 16:09, Yafang Shao wrote: > > On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote: > >> > >> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which > >> I think promted this. > > Note that cdl-tools test suite has many test cases that do not pass without the > mpi3mr patch. CDL makes it easy to trigger the issue. > > > > > Thank you for the information and for addressing this so quickly! > > > >> > >> Yafang, can you check if this makes the writeback errors you're seeing > >> go away? > > > > I’m happy to test the fix and will share the results as soon as I have them. > > Thanks. And my apologies for forgetting to CC you on these patches. We have developed and deployed a hotfix to hundreds of our production servers equipped with this drive, and it has been functioning as expected so far. We are currently rolling it out to the remaining production servers, though the process may take some time to complete. Additionally, we encountered a writeback error in the libata driver, which appears to be related to DID_TIME_OUT. Do you have any insights into this issue? [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 CDB: Write(16) 8a 00 00 00 00 08 2c eb 4a 58 00 00 02 c0 00 00 [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113355864 op 0x1:(WRITE) flags 0x100000 phys_seg 88 prio class 2 [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#29 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#29 CDB: Write(16) 8a 00 00 00 00 08 2c eb 45 18 00 00 05 40 00 00 [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113354520 op 0x1:(WRITE) flags 0x104000 phys_seg 168 prio class 2 [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#28 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#28 CDB: Write(16) 8a 00 00 00 00 08 2c eb 42 58 00 00 02 c0 00 00 [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113353816 op 0x1:(WRITE) flags 0x100000 phys_seg 88 prio class 2 [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#22 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#22 CDB: Read(16) 88 00 00 00 00 03 d8 33 44 18 00 00 02 c0 00 00 [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512140312 op 0x0:(READ) flags 0x80700 phys_seg 88 prio class 2 [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#21 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#21 CDB: Read(16) 88 00 00 00 00 03 d8 33 3e d8 00 00 05 40 00 00 [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512138968 op 0x0:(READ) flags 0x84700 phys_seg 168 prio class 2 [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#20 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#20 CDB: Read(16) 88 00 00 00 00 03 d8 33 3c 18 00 00 02 c0 00 00 [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512138264 op 0x0:(READ) flags 0x80700 phys_seg 88 prio class 2 [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#19 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#19 CDB: Write(16) 8a 00 00 00 00 08 2c eb 3d 18 00 00 05 40 00 00 [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113352472 op 0x1:(WRITE) flags 0x104000 phys_seg 168 prio class 2 [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#15 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#15 CDB: Read(16) 88 00 00 00 00 03 2c e9 4e b8 00 00 00 98 00 00 [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 13638389432 op 0x0:(READ) flags 0x80700 phys_seg 19 prio class 2 [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#14 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#14 CDB: Read(16) 88 00 00 00 00 03 2c e9 4c b8 00 00 02 00 00 00 [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 13638388920 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 2 [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#13 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#13 CDB: Read(16) 88 00 00 00 00 03 d8 33 36 d8 00 00 05 40 00 00 [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512136920 op 0x0:(READ) flags 0x84700 phys_seg 168 prio class 2 [Tue Jun 10 13:27:47 2025] XFS (sdl): metadata I/O error in "xfs_imap_to_bp+0x4f/0x70 [xfs]" at daddr 0x4804c4398 len 32 error 5 [Tue Jun 10 13:27:48 2025] sdl: writeback error on inode 32213499599, offset 0, sector 35113332208
On 6/11/25 12:27 PM, Yafang Shao wrote: > On Mon, Jun 9, 2025 at 3:18 PM Damien Le Moal <dlemoal@kernel.org> wrote: >> >> On 6/9/25 16:09, Yafang Shao wrote: >>> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote: >>>> >>>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which >>>> I think promted this. >> >> Note that cdl-tools test suite has many test cases that do not pass without the >> mpi3mr patch. CDL makes it easy to trigger the issue. >> >>> >>> Thank you for the information and for addressing this so quickly! >>> >>>> >>>> Yafang, can you check if this makes the writeback errors you're seeing >>>> go away? >>> >>> I’m happy to test the fix and will share the results as soon as I have them. >> >> Thanks. And my apologies for forgetting to CC you on these patches. > > We have developed and deployed a hotfix to hundreds of our production > servers equipped with this drive, and it has been functioning as > expected so far. We are currently rolling it out to the remaining > production servers, though the process may take some time to complete. > > Additionally, we encountered a writeback error in the libata driver, > which appears to be related to DID_TIME_OUT. Do you have any insights > into this issue? > > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 FAILED Result: > hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s No idea. Timeout is hard to debug... A bus trace would help. It could be that the drives are really bad and taking too long per command, thus causing time out for commands that are at the end of the queue. E.g., if you hit many bad sectors with read commands, the drive internal retry mechanism may cause such read command to take several seconds. So if you are running the drives at high queue depth, this can easily cause timeouts for the commands that are at the end of the queue. > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 CDB: Write(16) 8a > 00 00 00 00 08 2c eb 4a 58 00 00 02 c0 00 00 > [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113355864 op > 0x1:(WRITE) flags 0x100000 phys_seg 88 prio class 2 > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#29 FAILED Result: > hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#29 CDB: Write(16) 8a > 00 00 00 00 08 2c eb 45 18 00 00 05 40 00 00 > [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113354520 op > 0x1:(WRITE) flags 0x104000 phys_seg 168 prio class 2 > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#28 FAILED Result: > hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#28 CDB: Write(16) 8a > 00 00 00 00 08 2c eb 42 58 00 00 02 c0 00 00 > [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113353816 op > 0x1:(WRITE) flags 0x100000 phys_seg 88 prio class 2 > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#22 FAILED Result: > hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#22 CDB: Read(16) 88 > 00 00 00 00 03 d8 33 44 18 00 00 02 c0 00 00 > [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512140312 op > 0x0:(READ) flags 0x80700 phys_seg 88 prio class 2 > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#21 FAILED Result: > hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#21 CDB: Read(16) 88 > 00 00 00 00 03 d8 33 3e d8 00 00 05 40 00 00 > [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512138968 op > 0x0:(READ) flags 0x84700 phys_seg 168 prio class 2 > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#20 FAILED Result: > hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#20 CDB: Read(16) 88 > 00 00 00 00 03 d8 33 3c 18 00 00 02 c0 00 00 > [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512138264 op > 0x0:(READ) flags 0x80700 phys_seg 88 prio class 2 > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#19 FAILED Result: > hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#19 CDB: Write(16) 8a > 00 00 00 00 08 2c eb 3d 18 00 00 05 40 00 00 > [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 35113352472 op > 0x1:(WRITE) flags 0x104000 phys_seg 168 prio class 2 > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#15 FAILED Result: > hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#15 CDB: Read(16) 88 > 00 00 00 00 03 2c e9 4e b8 00 00 00 98 00 00 > [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 13638389432 op > 0x0:(READ) flags 0x80700 phys_seg 19 prio class 2 > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#14 FAILED Result: > hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#14 CDB: Read(16) 88 > 00 00 00 00 03 2c e9 4c b8 00 00 02 00 00 00 > [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 13638388920 op > 0x0:(READ) flags 0x80700 phys_seg 64 prio class 2 > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#13 FAILED Result: > hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#13 CDB: Read(16) 88 > 00 00 00 00 03 d8 33 36 d8 00 00 05 40 00 00 > [Tue Jun 10 13:27:47 2025] I/O error, dev sdl, sector 16512136920 op > 0x0:(READ) flags 0x84700 phys_seg 168 prio class 2 > [Tue Jun 10 13:27:47 2025] XFS (sdl): metadata I/O error in > "xfs_imap_to_bp+0x4f/0x70 [xfs]" at daddr 0x4804c4398 len 32 error 5 > [Tue Jun 10 13:27:48 2025] sdl: writeback error on inode 32213499599, > offset 0, sector 35113332208 >
On Wed, Jun 11, 2025 at 11:59 AM Damien Le Moal <dlemoal@kernel.org> wrote: > > On 6/11/25 12:27 PM, Yafang Shao wrote: > > On Mon, Jun 9, 2025 at 3:18 PM Damien Le Moal <dlemoal@kernel.org> wrote: > >> > >> On 6/9/25 16:09, Yafang Shao wrote: > >>> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote: > >>>> > >>>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which > >>>> I think promted this. > >> > >> Note that cdl-tools test suite has many test cases that do not pass without the > >> mpi3mr patch. CDL makes it easy to trigger the issue. > >> > >>> > >>> Thank you for the information and for addressing this so quickly! > >>> > >>>> > >>>> Yafang, can you check if this makes the writeback errors you're seeing > >>>> go away? > >>> > >>> I’m happy to test the fix and will share the results as soon as I have them. > >> > >> Thanks. And my apologies for forgetting to CC you on these patches. > > > > We have developed and deployed a hotfix to hundreds of our production > > servers equipped with this drive, and it has been functioning as > > expected so far. We are currently rolling it out to the remaining > > production servers, though the process may take some time to complete. > > > > Additionally, we encountered a writeback error in the libata driver, > > which appears to be related to DID_TIME_OUT. Do you have any insights > > into this issue? > > > > [Tue Jun 10 13:27:47 2025] sd 14:0:0:0: [sdl] tag#30 FAILED Result: > > hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s > > No idea. Timeout is hard to debug... A bus trace would help. It could be that > the drives are really bad and taking too long per command, thus causing time > out for commands that are at the end of the queue. E.g., if you hit many bad > sectors with read commands, the drive internal retry mechanism may cause such > read command to take several seconds. So if you are running the drives at high > queue depth, this can easily cause timeouts for the commands that are at the > end of the queue. Thank you for the detailed explanation. We will investigate this further.
On Mon, Jun 9, 2025 at 3:09 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which > > I think promted this. > > Thank you for the information and for addressing this so quickly! > > > > > Yafang, can you check if this makes the writeback errors you're seeing > > go away? > > I’m happy to test the fix and will share the results as soon as I have them. We’ve confirmed that the DID_SOFT_ERROR issue no longer occurs after applying this patch series to our production servers. Since our production servers only use mpt3sas drives, we can verify the fix specifically for this driver: Tested-by: Yafang Shao <laoar.shao@gmail.com> Hello Damien, We’ve encountered another instance of DID_SOFT_ERROR triggered by a reset on an mpt3sas drive. Do you have any insights into what might be causing this failure? Below are the error details: [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610): originator(PL), code(0x11), sub_code(0x0610) [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610): originator(PL), code(0x11), sub_code(0x0610) [Thu Jun 12 17:57:35 2025] sd 8:0:9:0: Power-on or device reset occurred [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: In func: _ctl_do_mpt_command [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: Command Timeout [Thu Jun 12 20:07:53 2025] mf: [Thu Jun 12 20:07:53 2025] 00000013 [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] 9fcda615 [Thu Jun 12 20:07:53 2025] 00fc0000 [Thu Jun 12 20:07:53 2025] 00000018 [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] 000000ff [Thu Jun 12 20:07:53 2025] [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] 00000006 [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] 02000000 [Thu Jun 12 20:07:53 2025] [Thu Jun 12 20:07:53 2025] 00000012 [Thu Jun 12 20:07:53 2025] 000000ff [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] 00000000 [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: issue target reset: handle = (0x0013) [Thu Jun 12 20:07:56 2025] scsi_io_completion_action: 22 callbacks suppressed [Thu Jun 12 20:07:56 2025] blk_print_req_error: 22 callbacks suppressed [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 190811280 op 0x0:(READ) flags 0x80700 phys_seg 28 prio class 2 [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2305 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2336 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=16s [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2305 CDB: Read(16) 88 00 00 00 00 05 26 e3 68 40 00 00 04 00 00 00 [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2158 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127274048 op 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2 [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2158 CDB: Read(16) 88 00 00 00 00 03 0d 08 dc 38 00 00 04 00 00 00 [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 13103586360 op 0x0:(READ) flags 0x84700 phys_seg 128 prio class 2 [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2369 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2369 CDB: Read(16) 88 00 00 00 00 01 50 ee 50 48 00 00 04 00 00 00 [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 5652762696 op 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2 [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2368 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2368 CDB: Read(16) 88 00 00 00 00 05 26 e3 64 10 00 00 00 20 00 00 [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127272976 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2 [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2304 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2157 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2304 CDB: Read(16) 88 00 00 00 00 05 26 e3 64 30 00 00 04 10 00 00 [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2157 CDB: Read(16) 88 00 00 00 00 03 0d 08 d8 38 00 00 04 00 00 00 [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127273008 op 0x0:(READ) flags 0x84700 phys_seg 128 prio class 2 [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 13103585336 op 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2 [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2400 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2400 CDB: Read(16) 88 00 00 00 00 01 50 ee 58 48 00 00 04 00 00 00 [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 5652764744 op 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2 [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2309 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2309 CDB: Read(16) 88 00 00 00 00 05 26 e3 78 10 00 00 02 c0 00 00 [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127278096 op 0x0:(READ) flags 0x80700 phys_seg 88 prio class 2 [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2376 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2376 CDB: Read(16) 88 00 00 00 00 03 80 11 3e c8 00 00 00 20 00 00 [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 15033515720 op 0x0:(READ) flags 0x80700 phys_seg 3 prio class 2 [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000): originator(PL), code(0x14), sub_code(0x0000) [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000): originator(PL), code(0x14), sub_code(0x0000) [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2336 CDB: Read(16) 88 00 00 00 00 01 50 ee 96 50 00 00 05 18 00 00 [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000): originator(PL), code(0x14), sub_code(0x0000) [Thu Jun 12 20:07:57 2025] XFS (sdi): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x484022c68 len 8 error 5 [Thu Jun 12 20:07:59 2025] sd 8:0:9:0: Power-on or device reset occurred [Thu Jun 12 20:07:59 2025] sdi: writeback error on inode 12885147175, offset 1285156864, sector 13979483112
On 6/16/25 11:13, Yafang Shao wrote: > On Mon, Jun 9, 2025 at 3:09 PM Yafang Shao <laoar.shao@gmail.com> wrote: >> >> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote: >>> >>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which >>> I think promted this. >> >> Thank you for the information and for addressing this so quickly! >> >>> >>> Yafang, can you check if this makes the writeback errors you're seeing >>> go away? >> >> I’m happy to test the fix and will share the results as soon as I have them. > > We’ve confirmed that the DID_SOFT_ERROR issue no longer occurs after > applying this patch series to our production servers. Since our > production servers only use mpt3sas drives, we can verify the fix > specifically for this driver: > > Tested-by: Yafang Shao <laoar.shao@gmail.com> Thansk. I tested with the mpi3mr driver. > We’ve encountered another instance of DID_SOFT_ERROR triggered by a > reset on an mpt3sas drive. Do you have any insights into what might be > causing this failure? Below are the error details: > > [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610): > originator(PL), code(0x11), sub_code(0x0610) This decodes to: Value 31110610h Type: 30000000h SAS Origin: 01000000h PL Code: 00110000h PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE) Sub Code: 00000600h PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET So it looks like a non-NCQ command failed. What were you doing when this happened ? > [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610): > originator(PL), code(0x11), sub_code(0x0610) > [Thu Jun 12 17:57:35 2025] sd 8:0:9:0: Power-on or device reset occurred And this command failure is trigerring a device reset (the HBA may be doing that, or the drive did not like what you were doing and reset). > [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: In func: _ctl_do_mpt_command > [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: Command Timeout This looks like an ioctl() to the adapter diver, and it never got its reply apparently. This is 3 hours after the above power-on-reset, so these 2 events are likely not related... > [Thu Jun 12 20:07:53 2025] mf: This is dumping the mpi_request bytes. Maybe you can try to decode that to get hints as to what action triggered this. I would love to get feedback from the Broadcom folks on this kind of problems, but apparently, debugging issues with their HBA does not seem to be high on their to-do list. Broadcom folks, Could you please comment on these issues ? Not the first time I ask. A reply would be welcome so that we all know that you at least care about issues with your drivers/HBAs. Thank you. > > [Thu Jun 12 20:07:53 2025] 00000013 > [Thu Jun 12 20:07:53 2025] 00000000 > [Thu Jun 12 20:07:53 2025] 00000000 > [Thu Jun 12 20:07:53 2025] 9fcda615 > [Thu Jun 12 20:07:53 2025] 00fc0000 > [Thu Jun 12 20:07:53 2025] 00000018 > [Thu Jun 12 20:07:53 2025] 00000000 > [Thu Jun 12 20:07:53 2025] 000000ff > [Thu Jun 12 20:07:53 2025] > > [Thu Jun 12 20:07:53 2025] 00000000 > [Thu Jun 12 20:07:53 2025] 00000006 > [Thu Jun 12 20:07:53 2025] 00000000 > [Thu Jun 12 20:07:53 2025] 00000000 > [Thu Jun 12 20:07:53 2025] 00000000 > [Thu Jun 12 20:07:53 2025] 00000000 > [Thu Jun 12 20:07:53 2025] 00000000 > [Thu Jun 12 20:07:53 2025] 02000000 > [Thu Jun 12 20:07:53 2025] > > [Thu Jun 12 20:07:53 2025] 00000012 > [Thu Jun 12 20:07:53 2025] 000000ff > [Thu Jun 12 20:07:53 2025] 00000000 > [Thu Jun 12 20:07:53 2025] 00000000 > [Thu Jun 12 20:07:53 2025] 00000000 > [Thu Jun 12 20:07:53 2025] 00000000 > [Thu Jun 12 20:07:53 2025] 00000000 > [Thu Jun 12 20:07:53 2025] 00000000 > > [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: issue target reset: handle = (0x0013) > [Thu Jun 12 20:07:56 2025] scsi_io_completion_action: 22 callbacks suppressed > [Thu Jun 12 20:07:56 2025] blk_print_req_error: 22 callbacks suppressed > [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 190811280 op > 0x0:(READ) flags 0x80700 phys_seg 28 prio class 2 > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2305 FAILED Result: > hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2336 FAILED Result: > hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=16s > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2305 CDB: Read(16) 88 > 00 00 00 00 05 26 e3 68 40 00 00 04 00 00 00 > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2158 FAILED Result: > hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s > [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127274048 op > 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2 > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2158 CDB: Read(16) 88 > 00 00 00 00 03 0d 08 dc 38 00 00 04 00 00 00 > [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 13103586360 op > 0x0:(READ) flags 0x84700 phys_seg 128 prio class 2 > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2369 FAILED Result: > hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2369 CDB: Read(16) 88 > 00 00 00 00 01 50 ee 50 48 00 00 04 00 00 00 > [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 5652762696 op > 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2 > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2368 FAILED Result: > hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2368 CDB: Read(16) 88 > 00 00 00 00 05 26 e3 64 10 00 00 00 20 00 00 > [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127272976 op > 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2 > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2304 FAILED Result: > hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2157 FAILED Result: > hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2304 CDB: Read(16) 88 > 00 00 00 00 05 26 e3 64 30 00 00 04 10 00 00 > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2157 CDB: Read(16) 88 > 00 00 00 00 03 0d 08 d8 38 00 00 04 00 00 00 > [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127273008 op > 0x0:(READ) flags 0x84700 phys_seg 128 prio class 2 > [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 13103585336 op > 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2 > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2400 FAILED Result: > hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2400 CDB: Read(16) 88 > 00 00 00 00 01 50 ee 58 48 00 00 04 00 00 00 > [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 5652764744 op > 0x0:(READ) flags 0x80700 phys_seg 128 prio class 2 > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2309 FAILED Result: > hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2309 CDB: Read(16) 88 > 00 00 00 00 05 26 e3 78 10 00 00 02 c0 00 00 > [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 22127278096 op > 0x0:(READ) flags 0x80700 phys_seg 88 prio class 2 > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2376 FAILED Result: > hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=17s > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2376 CDB: Read(16) 88 > 00 00 00 00 03 80 11 3e c8 00 00 00 20 00 00 > [Thu Jun 12 20:07:56 2025] I/O error, dev sdi, sector 15033515720 op > 0x0:(READ) flags 0x80700 phys_seg 3 prio class 2 > [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000): > originator(PL), code(0x14), sub_code(0x0000) > [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000): > originator(PL), code(0x14), sub_code(0x0000) > [Thu Jun 12 20:07:56 2025] sd 8:0:9:0: [sdi] tag#2336 CDB: Read(16) 88 > 00 00 00 00 01 50 ee 96 50 00 00 05 18 00 00 > [Thu Jun 12 20:07:56 2025] mpt3sas_cm0: log_info(0x31140000): > originator(PL), code(0x14), sub_code(0x0000) > [Thu Jun 12 20:07:57 2025] XFS (sdi): metadata I/O error in > "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x484022c68 len 8 error 5 > [Thu Jun 12 20:07:59 2025] sd 8:0:9:0: Power-on or device reset occurred > [Thu Jun 12 20:07:59 2025] sdi: writeback error on inode 12885147175, > offset 1285156864, sector 13979483112 > >
On Mon, Jun 16, 2025 at 10:28 AM Damien Le Moal <dlemoal@kernel.org> wrote: > > On 6/16/25 11:13, Yafang Shao wrote: > > On Mon, Jun 9, 2025 at 3:09 PM Yafang Shao <laoar.shao@gmail.com> wrote: > >> > >> On Mon, Jun 9, 2025 at 1:50 PM Christoph Hellwig <hch@infradead.org> wrote: > >>> > >>> Adding Yafang Shao <laoar.shao@gmail.com>, who has a test case, which > >>> I think promted this. > >> > >> Thank you for the information and for addressing this so quickly! > >> > >>> > >>> Yafang, can you check if this makes the writeback errors you're seeing > >>> go away? > >> > >> I’m happy to test the fix and will share the results as soon as I have them. > > > > We’ve confirmed that the DID_SOFT_ERROR issue no longer occurs after > > applying this patch series to our production servers. Since our > > production servers only use mpt3sas drives, we can verify the fix > > specifically for this driver: > > > > Tested-by: Yafang Shao <laoar.shao@gmail.com> > > Thansk. I tested with the mpi3mr driver. > > > > We’ve encountered another instance of DID_SOFT_ERROR triggered by a > > reset on an mpt3sas drive. Do you have any insights into what might be > > causing this failure? Below are the error details: > > > > [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610): > > originator(PL), code(0x11), sub_code(0x0610) > > This decodes to: > > Value 31110610h > Type: 30000000h SAS > Origin: 01000000h PL > Code: 00110000h PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE) > Sub Code: 00000600h PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET > > So it looks like a non-NCQ command failed. What were you doing when this happened ? After reviewing the logs, we were unable to identify any relevant information regarding the issue. We appreciate your continued support in resolving this matter. > > > [Thu Jun 12 17:57:35 2025] mpt3sas_cm0: log_info(0x31110610): > > originator(PL), code(0x11), sub_code(0x0610) > > [Thu Jun 12 17:57:35 2025] sd 8:0:9:0: Power-on or device reset occurred > > And this command failure is trigerring a device reset (the HBA may be doing > that, or the drive did not like what you were doing and reset). > > > [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: In func: _ctl_do_mpt_command > > [Thu Jun 12 20:07:53 2025] mpt3sas_cm0: Command Timeout > > This looks like an ioctl() to the adapter diver, and it never got its reply > apparently. This is 3 hours after the above power-on-reset, so these 2 events > are likely not related... > > > [Thu Jun 12 20:07:53 2025] mf: > > This is dumping the mpi_request bytes. Maybe you can try to decode that to get > hints as to what action triggered this. > > I would love to get feedback from the Broadcom folks on this kind of problems, > but apparently, debugging issues with their HBA does not seem to be high on > their to-do list. > > Broadcom folks, > > Could you please comment on these issues ? Not the first time I ask. A reply > would be welcome so that we all know that you at least care about issues with > your drivers/HBAs. Thank you. We have reached out to Broadcom's team directly and hope they will respond both to our inquiry and to this email correspondence.
Damien, > I am working in the dark here, with zero information on how your HBA > handle ATA NCQ collateral aborts. I am patching against what I am > seeing, which may be only a partial picture of the problem. So please > check this ! Broadcom: Please review!