diff mbox series

[v8,04/23] scsi: sd: Differentiate system and runtime start/stop management

Message ID 20230927141828.90288-5-dlemoal@kernel.org
State New
Headers show
Series Fix libata suspend/resume handling and code cleanup | expand

Commit Message

Damien Le Moal Sept. 27, 2023, 2:18 p.m. UTC
The underlying device and driver of a SCSI disk may have different
system and runtime power mode control requirements. This is because
runtime power management affects only the SCSI disk, while sustem level
power management affects all devices, including the controller for the
SCSI disk.

For instance, issuing a START STOP UNIT command when a SCSI disk is
runtime suspended and resumed is fine: the command is translated to a
STANDBY IMMEDIATE command to spin down the ATA disk and to a VERIFY
command to wake it up. The SCSI disk runtime operations have no effect
on the ata port device used to connect the ATA disk. However, for
system suspend/resume operations, the ATA port used to connect the
device will also be suspended and resumed, with the resume operation
requiring re-validating the device link and the device itself. In this
case, issuing a VERIFY command to spinup the disk must be done before
starting to revalidate the device, when the ata port is being resumed.
In such case, we must not allow the SCSI disk driver to issue START STOP
UNIT commands.

Allow a low level driver to refine the SCSI disk start/stop management
by differentiating system and runtime cases with two new SCSI device
flags: manage_system_start_stop and manage_runtime_start_stop. These new
flags replace the current manage_start_stop flag. Drivers setting the
manage_start_stop are modifed to set both new flags, thus preserving the
existing start/stop management behavior. For backward compatibility, the
old manage_start_stop sysfs device attribute is kept as a read-only
attribute showing a value of 1 for devices enabling both new flags and 0
otherwise.

Fixes: 0a8589055936 ("ata,scsi: do not issue START STOP UNIT on resume")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/ata/libata-scsi.c  |  3 +-
 drivers/firewire/sbp2.c    |  9 ++--
 drivers/scsi/sd.c          | 90 ++++++++++++++++++++++++++++++--------
 include/scsi/scsi_device.h |  5 ++-
 4 files changed, 84 insertions(+), 23 deletions(-)

Comments

Phillip Susi Oct. 10, 2023, 1:09 p.m. UTC | #1
Damien Le Moal <dlemoal@kernel.org> writes:

> system suspend/resume operations, the ATA port used to connect the
> device will also be suspended and resumed, with the resume operation
> requiring re-validating the device link and the device itself. In this
> case, issuing a VERIFY command to spinup the disk must be done before
> starting to revalidate the device, when the ata port is being resumed.
> In such case, we must not allow the SCSI disk driver to issue START STOP
> UNIT commands.

Why must a VERIFY be issued to spinup the disk before revalidating?
Before these patches, by default, manage_start_stop was on, and so sd
would cause a VERIFY in the system resume path.  That resume however (
sd and its issuing START UNIT ), would have happened AFTER the link was
resumed and the ATA device was revalidated, woudldn't it?  So at that
point, the drive would already be spinning.  And if manage_start_stop
was disabled, then there would be no VERIFY at all, and this did not
seem to cause a problem before.

If this VERIFY were skipped, the next thing that would happen is for
ata_dev_revalidate() to issue IDENTIFY, which would wait for the drive
to spin up before returning wouldn't it? ( unless the drive has PuiS
enabled ).
Damien Le Moal Oct. 10, 2023, 2:04 p.m. UTC | #2
On 10/10/23 22:09, Phillip Susi wrote:
> Damien Le Moal <dlemoal@kernel.org> writes:
> 
>> system suspend/resume operations, the ATA port used to connect the
>> device will also be suspended and resumed, with the resume operation
>> requiring re-validating the device link and the device itself. In this
>> case, issuing a VERIFY command to spinup the disk must be done before
>> starting to revalidate the device, when the ata port is being resumed.
>> In such case, we must not allow the SCSI disk driver to issue START STOP
>> UNIT commands.
> 
> Why must a VERIFY be issued to spinup the disk before revalidating?

We can most likely move that VERIFY to after revalidation. That should shorten
delays on first access after resume as many drive do actually spinup with the
reset done before revalidating (but note that the specs say that even a reset
shall not take a drive out of standby mode, without specifying the reset type,
so this is rather loosely defined).

> Before these patches, by default, manage_start_stop was on, and so sd
> would cause a VERIFY in the system resume path.  That resume however (
> sd and its issuing START UNIT ), would have happened AFTER the link was
> resumed and the ATA device was revalidated, woudldn't it?  So at that

In theory, yes, that was the intent. In practice, the verify was issued from
scsi PM resume context while the actual drive port reset + revalidation is done
in libata EH context, triggered from ATA port resume context which itself was
not synchronized/ordered with the scsi disk resume. So we ended up with the
verify command execution sometimes being attempted with the drive not even
revalidated yet, or with the port/link not even active sometimes (depending on
timing). So problems all over and deadlocks due to scsi revalidate using the
device lock, which PM use too.

> point, the drive would already be spinning.  And if manage_start_stop
> was disabled, then there would be no VERIFY at all, and this did not
> seem to cause a problem before.

See above. With the switch to async PM ops in scsi in kernel 5.16, things broke
badly due to the lack of synchronization that sync PM provided before that.

> 
> If this VERIFY were skipped, the next thing that would happen is for
> ata_dev_revalidate() to issue IDENTIFY, which would wait for the drive
> to spin up before returning wouldn't it? ( unless the drive has PuiS
> enabled ).

ACS defines that only media access commands can get a drive out of standby mode
back into active mode. So an IDENTIFY command would not (normally) spinup a
drive. Nor would READ LOG. Normally, IDENTIFY, READ LOG etc done in revalidate
can be processed with the drive spun down (*but* I have seen drives that react
badly to some of these commands when spun down). Hence why I put it first,
before revalidate. Now that all the synchronization is in place and libata EH
does its own manage_start_stop for system suspend/resume, I will see if moving
the verify to the end of revalidate works.
Phillip Susi Oct. 15, 2023, 4:14 p.m. UTC | #3
For SCSI disks that are runtime suspended, it looks like they skip
waking the disk on system resume, leaving them in runtime suspend.
After these patches, it looks like libata always wakes up the disk, but
I don't see any calls to pm_runtime_disable/set_active/enable to mark
the scsi disk as active after the system resume.  That should result in
a disk that is spinning, but runtime pm thinks is not, and so will not
put it into suspend after the inactivity timeout.
Damien Le Moal Oct. 15, 2023, 10:44 p.m. UTC | #4
On 10/16/23 01:14, Phillip Susi wrote:
> For SCSI disks that are runtime suspended, it looks like they skip
> waking the disk on system resume, leaving them in runtime suspend.
> After these patches, it looks like libata always wakes up the disk, but
> I don't see any calls to pm_runtime_disable/set_active/enable to mark
> the scsi disk as active after the system resume.  That should result in
> a disk that is spinning, but runtime pm thinks is not, and so will not
> put it into suspend after the inactivity timeout.

Yes, correct, but this does not create any issues in practice beside the
undesired disk spinup.

Fixing that is not trivial because using runtime suspend/resume on the SCSI disk
is just that, it will affect *only* the SCSI disk and not the ATA device and its
port. In other words, a runtime suspend of the SCSI disk will spin down the
drive but it will not runtime suspend the ATA port. So if you suspend the
system, on resume, the ATA port will not be runtime suspended and so it will be
resumed. The SCSI disk will not be resumed, but the ATA port resume will have
spun up the disk, which we do not really want in that case.

I am looking into this. Again, that is not a trivial fix. The other thing to
notice here is that ATA port runtime suspend/resume is in fact broken: it does
not track accesses to the device(s) connected to the port. And given that more
than one device may be connected to a port, we need PM runtime reference
counting to be done for this to work correctly. That is missing. Solutions are:
fix everything or simply do not support ATA port runtime suspend/resume (i.e.
remove code doing it). I am leaning toward the latter as it seems that no one
actually noticed these issues because no one is actually using ATA port runtime
suspend/resume...
Phillip Susi Oct. 16, 2023, 12:39 p.m. UTC | #5
Damien Le Moal <dlemoal@kernel.org> writes:

> Yes, correct, but this does not create any issues in practice beside the
> undesired disk spinup.

The issue it creates is the opposite of that: it breaks the desired spin
down.  After some period of inactivity, the disk should be suspended,
but after a system resume, the kernel thinks that it already is, and so
won't suspend it again.

> Fixing that is not trivial because using runtime suspend/resume on the SCSI disk
> is just that, it will affect *only* the SCSI disk and not the ATA device and its
> port. In other words, a runtime suspend of the SCSI disk will spin down the
> drive but it will not runtime suspend the ATA port. So if you suspend
> the

I tested this last week and it appeared to work.  I enabled runtime pm
on the disk, as well as the ata port, and as soon as the disk suspended,
the port did as well.

> system, on resume, the ATA port will not be runtime suspended and so it will be
> resumed. The SCSI disk will not be resumed, but the ATA port resume will have
> spun up the disk, which we do not really want in that case.

Right, I would rather the disk stay asleep if it has PuiS enabled, and
I'm working on a patch for that.  In the process of doing that though, I
noticed that despite waking the disk up, it does not inform runtime pm
about that.

> I am looking into this. Again, that is not a trivial fix. The other thing to
> notice here is that ATA port runtime suspend/resume is in fact broken: it does
> not track accesses to the device(s) connected to the port. And given that more
> than one device may be connected to a port, we need PM runtime reference
> counting to be done for this to work correctly. That is
> missing. Solutions are:

Again, it seems to me that the child reference counting IS working.

> fix everything or simply do not support ATA port runtime suspend/resume (i.e.
> remove code doing it). I am leaning toward the latter as it seems that no one
> actually noticed these issues because no one is actually using ATA port runtime
> suspend/resume...

Probably nobody is using it yes, but that doesn't mean we shouldn't try
to get it working.  It would be nice to have the drive go into deep
SLEEP instead of standby, as well as suspend the ata port, and possibly
even the whole AHCI controller rather than relying on the old APM drive
internal auto standby mode.
Damien Le Moal Oct. 16, 2023, 12:55 p.m. UTC | #6
On 10/16/23 21:39, Phillip Susi wrote:
> Damien Le Moal <dlemoal@kernel.org> writes:
> 
>> Yes, correct, but this does not create any issues in practice beside the
>> undesired disk spinup.
> 
> The issue it creates is the opposite of that: it breaks the desired spin
> down.  After some period of inactivity, the disk should be suspended,
> but after a system resume, the kernel thinks that it already is, and so
> won't suspend it again.

That one should be fixable, though it I do not see an elegant method to do it.
It would be easy with ugly code, e.g. tweaking the scsi device runtime pm state
from libata... Not great.

> 
>> Fixing that is not trivial because using runtime suspend/resume on the SCSI disk
>> is just that, it will affect *only* the SCSI disk and not the ATA device and its
>> port. In other words, a runtime suspend of the SCSI disk will spin down the
>> drive but it will not runtime suspend the ATA port. So if you suspend
>> the
> 
> I tested this last week and it appeared to work.  I enabled runtime pm
> on the disk, as well as the ata port, and as soon as the disk suspended,
> the port did as well.

Never saw that in my tests when enabling runtime pm on the scsi disk only. Which
is the important point here: there is no propagation of the suspend state down
to the device parent it seems.

> 
>> system, on resume, the ATA port will not be runtime suspended and so it will be
>> resumed. The SCSI disk will not be resumed, but the ATA port resume will have
>> spun up the disk, which we do not really want in that case.
> 
> Right, I would rather the disk stay asleep if it has PuiS enabled, and
> I'm working on a patch for that.  In the process of doing that though, I
> noticed that despite waking the disk up, it does not inform runtime pm
> about that.

But there are no runtime PM operations defined for ATA devices, only for ports.
So not sure that matters... I am probably still missing something about runtime
PM and devices ancestry. I focused a lot on system suspend/resume to fix the
issues. runtime suspend/resume is next.

> 
>> I am looking into this. Again, that is not a trivial fix. The other thing to
>> notice here is that ATA port runtime suspend/resume is in fact broken: it does
>> not track accesses to the device(s) connected to the port. And given that more
>> than one device may be connected to a port, we need PM runtime reference
>> counting to be done for this to work correctly. That is
>> missing. Solutions are:
> 
> Again, it seems to me that the child reference counting IS working.

I am not sure of that, especially with cases of ATA ports with multiple disks
(e.g. pmp or IDE).

> 
>> fix everything or simply do not support ATA port runtime suspend/resume (i.e.
>> remove code doing it). I am leaning toward the latter as it seems that no one
>> actually noticed these issues because no one is actually using ATA port runtime
>> suspend/resume...
> 
> Probably nobody is using it yes, but that doesn't mean we shouldn't try
> to get it working.  It would be nice to have the drive go into deep
> SLEEP instead of standby, as well as suspend the ata port, and possibly
> even the whole AHCI controller rather than relying on the old APM drive
> internal auto standby mode.
>
Phillip Susi Oct. 17, 2023, 6:03 p.m. UTC | #7
Damien Le Moal <dlemoal@kernel.org> writes:

> That one should be fixable, though it I do not see an elegant method to do it.
> It would be easy with ugly code, e.g. tweaking the scsi device runtime pm state
> from libata... Not great.

What would be not great about it?  libata already takes over the system
suspend/resume from sd.  I'm currently testing having libata do just
this right now.  I just got ahold of some jumpers today to put the
drives back into PuiS and do some further testing tonight.

> Never saw that in my tests when enabling runtime pm on the scsi disk only. Which
> is the important point here: there is no propagation of the suspend state down
> to the device parent it seems.

Last night I again saw the port auto suspend when the scsi disk was
runtime suspended.  Tonight I'll test with PuiS, as well as with system
resume while runtime suspended.  Maybe I'll even try to get the whole
AHCI controller to auto suspend.  It seems like it should once all of
the ports do.

> I am not sure of that, especially with cases of ATA ports with multiple disks
> (e.g. pmp or IDE).

Good point.  I have an eSATA dock with PMP.  I'll check tonight if the
children are counted properly.
Damien Le Moal Oct. 17, 2023, 11:32 p.m. UTC | #8
On 10/18/23 03:03, Phillip Susi wrote:
> Damien Le Moal <dlemoal@kernel.org> writes:
> 
>> That one should be fixable, though it I do not see an elegant method to do it.
>> It would be easy with ugly code, e.g. tweaking the scsi device runtime pm state
>> from libata... Not great.
> 
> What would be not great about it?  libata already takes over the system

To have to add code in libata that touches the scsi device PM status is what's
not going to be great. Such code should stay in sd.c and scsi_pm.c. But not sure
we actually need anything.

> suspend/resume from sd.  I'm currently testing having libata do just
> this right now.  I just got ahold of some jumpers today to put the
> drives back into PuiS and do some further testing tonight.
> 
>> Never saw that in my tests when enabling runtime pm on the scsi disk only. Which
>> is the important point here: there is no propagation of the suspend state down
>> to the device parent it seems.
> 
> Last night I again saw the port auto suspend when the scsi disk was
> runtime suspended.  Tonight I'll test with PuiS, as well as with system
> resume while runtime suspended.  Maybe I'll even try to get the whole
> AHCI controller to auto suspend.  It seems like it should once all of
> the ports do.

With my tests, I never set the ata port power/control to "auto", which may be
why I did not see the port being runtime suspended when the scsi disk was
runtime suspended. Will try again.

>> I am not sure of that, especially with cases of ATA ports with multiple disks
>> (e.g. pmp or IDE).
> 
> Good point.  I have an eSATA dock with PMP.  I'll check tonight if the
> children are counted properly.

With the device links in place between port and scsi devices, we should be OK.
But still need to check that we do not need runtime_get/put calls added.
Ideally, we should have the chain:

scsi disk -> scsi target -> scsi host -> ata port

for runtime suspend, and the reverse for runtime resume. If there is a system
suspend/resume between runtime suspend/resume, the port should not be resumed if
it is runtime suspended.
Damien Le Moal Oct. 18, 2023, 6:16 a.m. UTC | #9
On 10/18/23 03:03, Phillip Susi wrote:
> Damien Le Moal <dlemoal@kernel.org> writes:
> 
>> That one should be fixable, though it I do not see an elegant method to do it.
>> It would be easy with ugly code, e.g. tweaking the scsi device runtime pm state
>> from libata... Not great.
> 
> What would be not great about it?  libata already takes over the system
> suspend/resume from sd.  I'm currently testing having libata do just
> this right now.  I just got ahold of some jumpers today to put the
> drives back into PuiS and do some further testing tonight.
> 
>> Never saw that in my tests when enabling runtime pm on the scsi disk only. Which
>> is the important point here: there is no propagation of the suspend state down
>> to the device parent it seems.
> 
> Last night I again saw the port auto suspend when the scsi disk was
> runtime suspended.  Tonight I'll test with PuiS, as well as with system
> resume while runtime suspended.  Maybe I'll even try to get the whole
> AHCI controller to auto suspend.  It seems like it should once all of
> the ports do.
> 
>> I am not sure of that, especially with cases of ATA ports with multiple disks
>> (e.g. pmp or IDE).
> 
> Good point.  I have an eSATA dock with PMP.  I'll check tonight if the
> children are counted properly.

On my system, I see:

cat /sys/class/ata_port/ata1/power/runtime_active_kids
0

and same for port 10 which is a PMP box with 3 drives. So it means that the
children will be ignored, which is wrong. Note that the corresponding scsi_host
device also shows 0. So to be safe with port runtime PM, we need to fix that
first. Otherwise, the port may end up being runtime suspended with running
drives still attached to it.

/sys/class/ata_port/ata1/power/control is set to "auto" by default.
Phillip Susi Oct. 20, 2023, 7 p.m. UTC | #10
Damien Le Moal <dlemoal@kernel.org> writes:

> With the device links in place between port and scsi devices, we should be OK.
> But still need to check that we do not need runtime_get/put calls added.
> Ideally, we should have the chain:
>
> scsi disk -> scsi target -> scsi host -> ata port

It looks to me like there is an additional generic block device that
sits on top and that is what actually has the idle timeout.  Or maybe
that's the scsi disk, since it's name incldues the SCSI LUN, but in the
structure, its called sdev_gendev.  But then there's also sdev_dev, and
sdev_target.

> for runtime suspend, and the reverse for runtime resume. If there is a system
> suspend/resume between runtime suspend/resume, the port should not be resumed if
> it is runtime suspended.

I'm not sure about it.  The port has to be resumed so that we can
attempt to revalidate the devices on it.  For disks that have spun up on
their own, we should not leave then marked as runtime suspended, but
really they are spinning.  I suppose we could put them to sleep, though
I was leaning to just marking them as active, and leaving the runtime pm
timer to put them to sleep later, which then could allow the port to
suspend again.
Phillip Susi Oct. 20, 2023, 9:23 p.m. UTC | #11
Damien Le Moal <dlemoal@kernel.org> writes:

> On my system, I see:
>
> cat /sys/class/ata_port/ata1/power/runtime_active_kids
> 0

I see a 1 there, which is the single scsi_host.  The scsi_host has 2
active kids; the two disks.  When I enabled runtime pm, only when the
second disk was suspended did that allow the scsi_host to suspend, which
then allowed the port to suspend.  Everything looked fine there so far.
Then I tried:

echo 1 > /sys/block/sdf/device/delete

And the SCSI EH appears to have tried to wake up the disk, and hung in
the process.

[  314.246282] sd 7:0:0:0: [sde] Synchronizing SCSI cache
[  314.246445] sd 7:0:0:0: [sde] Stopping disk

First disk suspends.

[  388.518295] sd 7:1:0:0: [sdf] Synchronizing SCSI cache
[  388.518519] sd 7:1:0:0: [sdf] Stopping disk

Second disk suspends some time later.

[  388.930428] ata8.00: Entering standby power mode
[  389.330651] ata8.01: Entering standby power mode

That allowed the port to suspend.  This is when I tried to detach the
disk driver, which I think tried to resume the disk before detaching,
which resumed the port.

[  467.511878] ata8.15: SATA link down (SStatus 0 SControl 310)
[  468.142726] ata8.15: failed to read PMP GSCR[0] (Emask=0x100)
[  468.142741] ata8.15: PMP revalidation failed (errno=-5)

I ran hdparm -C on the other disk at this point.  I just noticed that
the ata8.15 that represents the PMP itself was NOT suspended along with
the two drive links, and then maybe was not resumed before trying to
revalidate the PMP?  And that's why it failed?

[  473.172792] ata8.15: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  473.486860] ata8.00: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  473.802139] ata8.01: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

It seems like it ended up recovering here though?  And yet the scsi_eh
remained hung, as did the hdparm -C:

[  605.566814] INFO: task scsi_eh_7:173 blocked for more than 120 seconds.
[  605.566829]       Not tainted 6.6.0-rc5+ #5
[  605.566834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.566838] task:scsi_eh_7       state:D stack:0     pid:173   ppid:2      flags:0x00004000
[  605.566850] Call Trace:
[  605.566853]  <TASK>
[  605.566860]  __schedule+0x37c/0xb70
[  605.566878]  schedule+0x61/0xd0
[  605.566888]  rpm_resume+0x156/0x760
[  605.566896]  ? sched_energy_aware_handler+0xb0/0xb0
[  605.566907]  rpm_resume+0x255/0x760
[  605.566915]  rpm_resume+0x255/0x760
[  605.566923]  rpm_resume+0x255/0x760
[  605.566931]  __pm_runtime_resume+0x4e/0x80
[  605.566941]  ata_eh_recover+0x695/0x1060 [libata]
[  605.567001]  ? ata_port_pm_suspend+0x50/0x50 [libata]
[  605.567048]  ? ahci_do_softreset+0x2d0/0x2d0 [libahci]
[  605.567067]  ? ata_host_release+0x80/0x80 [libata]
[  605.567108]  ? ata_port_runtime_idle+0x110/0x110 [libata]
[  605.567151]  ? sata_pmp_configure+0x72/0x210 [libata]
[  605.567204]  sata_pmp_error_handler+0x357/0xac0 [libata]
[  605.567249]  ? ata_port_pm_suspend+0x50/0x50 [libata]
[  605.567291]  ? ahci_stop_engine+0xe0/0xe0 [libahci]
[  605.567309]  ? ahci_do_hardreset+0x140/0x140 [libahci]
[  605.567325]  ? ahci_do_softreset+0x2d0/0x2d0 [libahci]
[  605.567344]  ? _raw_spin_unlock_irqrestore+0x27/0x40
[  605.567355]  ahci_error_handler+0x36/0x60 [libahci]
[  605.567373]  ata_scsi_port_error_handler+0x3de/0x8a0 [libata]
[  605.567424]  ? scsi_eh_get_sense+0x250/0x250 [scsi_mod]
[  605.567464]  ata_scsi_error+0x95/0xc0 [libata]
[  605.567511]  scsi_error_handler+0xb9/0x580 [scsi_mod]
[  605.567547]  ? preempt_count_add+0x6c/0xa0
[  605.567556]  ? scsi_eh_get_sense+0x250/0x250 [scsi_mod]
[  605.567587]  kthread+0xf2/0x120
[  605.567594]  ? kthread_complete_and_exit+0x20/0x20
[  605.567602]  ret_from_fork+0x31/0x50
[  605.567611]  ? kthread_complete_and_exit+0x20/0x20
[  605.567617]  ret_from_fork_asm+0x11/0x20
[  605.567630]  </TASK>
[  605.567663] INFO: task bash:1305 blocked for more than 120 seconds.
[  605.567670]       Not tainted 6.6.0-rc5+ #5
[  605.567675] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.567678] task:bash            state:D stack:0     pid:1305  ppid:1300   flags:0x00004004
[  605.567687] Call Trace:
[  605.567689]  <TASK>
[  605.567693]  __schedule+0x37c/0xb70
[  605.567703]  ? try_to_wake_up+0xb2/0x5e0
[  605.567715]  schedule+0x61/0xd0
[  605.567725]  ata_port_wait_eh+0x7c/0xf0 [libata]
[  605.567776]  ? sched_energy_aware_handler+0xb0/0xb0
[  605.567784]  ? ata_sas_port_resume+0x30/0x30 [libata]
[  605.567829]  ata_port_runtime_resume+0x27/0x30 [libata]
[  605.567870]  __rpm_callback+0x41/0x110
[  605.567879]  ? ata_sas_port_resume+0x30/0x30 [libata]
[  605.567917]  rpm_callback+0x35/0x70
[  605.567925]  rpm_resume+0x513/0x760
[  605.567931]  ? _raw_read_lock_irqsave+0x28/0x50
[  605.567938]  ? _raw_read_unlock_irqrestore+0x2a/0x40
[  605.567944]  ? ep_poll_callback+0x269/0x2d0
[  605.567955]  rpm_resume+0x255/0x760
[  605.567962]  ? __slab_free+0xc7/0x320
[  605.567972]  rpm_resume+0x255/0x760
[  605.567978]  ? kernfs_should_drain_open_files+0x38/0x50
[  605.567989]  rpm_resume+0x255/0x760
[  605.567994]  ? kernfs_should_drain_open_files+0x38/0x50
[  605.568002]  ? kernfs_drain+0xec/0x120
[  605.568010]  __pm_runtime_resume+0x4e/0x80
[  605.568018]  device_release_driver_internal+0xa8/0x200
[  605.568028]  bus_remove_device+0xc0/0x120
[  605.568035]  device_del+0x158/0x3d0
[  605.568045]  ? mutex_lock+0x12/0x30
[  605.568051]  __scsi_remove_device+0x12b/0x180 [scsi_mod]
[  605.568095]  sdev_store_delete+0x6a/0xd0 [scsi_mod]
[  605.568132]  kernfs_fop_write_iter+0x129/0x1c0
[  605.568141]  vfs_write+0x2d3/0x3f0
[  605.568155]  ksys_write+0x63/0xe0
[  605.568165]  do_syscall_64+0x5a/0xb0
[  605.568173]  ? syscall_exit_to_user_mode+0x2b/0x40
[  605.568178]  ? do_syscall_64+0x67/0xb0
[  605.568184]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  605.568193] RIP: 0033:0x7f0c1e9b6473
[  605.568200] RSP: 002b:00007ffe01b0bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  605.568208] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f0c1e9b6473
[  605.568213] RDX: 0000000000000002 RSI: 000056115e4b86f0 RDI: 0000000000000001
[  605.568216] RBP: 000056115e4b86f0 R08: 000000000000000a R09: 00007f0c1ea5a0c0
[  605.568220] R10: 00007f0c1ea59fc0 R11: 0000000000000246 R12: 0000000000000002
[  605.568224] R13: 00007f0c1ea9a6a0 R14: 0000000000000002 R15: 00007f0c1ea95880
[  605.568232]  </TASK>
Damien Le Moal Oct. 23, 2023, 5:51 a.m. UTC | #12
On 10/21/23 06:23, Phillip Susi wrote:
> Damien Le Moal <dlemoal@kernel.org> writes:
> 
>> On my system, I see:
>>
>> cat /sys/class/ata_port/ata1/power/runtime_active_kids
>> 0
> 
> I see a 1 there, which is the single scsi_host.  The scsi_host has 2
> active kids; the two disks.  When I enabled runtime pm, only when the
> second disk was suspended did that allow the scsi_host to suspend, which
> then allowed the port to suspend.  Everything looked fine there so far.
> Then I tried:
> 
> echo 1 > /sys/block/sdf/device/delete
> 
> And the SCSI EH appears to have tried to wake up the disk, and hung in
> the process.
> 
> [  314.246282] sd 7:0:0:0: [sde] Synchronizing SCSI cache
> [  314.246445] sd 7:0:0:0: [sde] Stopping disk
> 
> First disk suspends.
> 
> [  388.518295] sd 7:1:0:0: [sdf] Synchronizing SCSI cache
> [  388.518519] sd 7:1:0:0: [sdf] Stopping disk
> 
> Second disk suspends some time later.
> 
> [  388.930428] ata8.00: Entering standby power mode
> [  389.330651] ata8.01: Entering standby power mode
> 
> That allowed the port to suspend.  This is when I tried to detach the
> disk driver, which I think tried to resume the disk before detaching,
> which resumed the port.
> 
> [  467.511878] ata8.15: SATA link down (SStatus 0 SControl 310)
> [  468.142726] ata8.15: failed to read PMP GSCR[0] (Emask=0x100)
> [  468.142741] ata8.15: PMP revalidation failed (errno=-5)
> 
> I ran hdparm -C on the other disk at this point.  I just noticed that
> the ata8.15 that represents the PMP itself was NOT suspended along with
> the two drive links, and then maybe was not resumed before trying to
> revalidate the PMP?  And that's why it failed?
> 
> [  473.172792] ata8.15: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> [  473.486860] ata8.00: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> [  473.802139] ata8.01: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> 
> It seems like it ended up recovering here though?  And yet the scsi_eh
> remained hung, as did the hdparm -C:
> 
> [  605.566814] INFO: task scsi_eh_7:173 blocked for more than 120 seconds.
> [  605.566829]       Not tainted 6.6.0-rc5+ #5
> [  605.566834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  605.566838] task:scsi_eh_7       state:D stack:0     pid:173   ppid:2      flags:0x00004000
> [  605.566850] Call Trace:
> [  605.566853]  <TASK>
> [  605.566860]  __schedule+0x37c/0xb70
> [  605.566878]  schedule+0x61/0xd0
> [  605.566888]  rpm_resume+0x156/0x760

Looks like a deadlock somewhere, likely with the device remove that you
triggered with the "echo 1 > /sys/block/sdf/device/delete".

Can you send the exact list of commands & events you executed to get to that
point ? Also please share your kernel config.
Phillip Susi Oct. 26, 2023, 9:21 p.m. UTC | #13
Damien Le Moal <dlemoal@kernel.org> writes:

> On 10/21/23 06:23, Phillip Susi wrote:
> Looks like a deadlock somewhere, likely with the device remove that you
> triggered with the "echo 1 > /sys/block/sdf/device/delete".
>
> Can you send the exact list of commands & events you executed to get to that
> point ? Also please share your kernel config.

I wrote auto to /sys/block/sd[ef]/device/power/config and 10000 to
/sys/block/sd[ef]/device/power/auto_suspend_delay_ms, and auto to the
control file of their corresponding ata8 port ( both drives are behind
an PMP in an eSATA dock on ata8 ).  As I said before, it the dmesg
showed that the ata port only suspended after BOTH drives had suspended,
which is as it should be.  The problem seems to be after I echo 1 >
/sys/block/sdf/device/delete, when this happens:

Oct 26 16:43:00 faldara kernel: ata8.15: failed to read PMP GSCR[0] (Emask=0x100)
Oct 26 16:43:00 faldara kernel: ata8.15: PMP revalidation failed (errno=-5)
Oct 26 16:43:05 faldara kernel: ata8.15: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 26 16:43:05 faldara kernel: ata8.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Oct 26 16:43:05 faldara kernel: ata8.01: SATA link up 3.0 Gbps (SStatus
123 SControl 320)

Then I get the hung task.  I reverted the PuiS patch that I have been
working on, and the hang no longer happens, however, the above errors do
still happen.  I think that is a problem in itself, and may or may not
be related to the hang.  I see no reason why the PMP revalidation should
fail after the two disks and the port are runtime pm suspended, but for
some reason, with my patch applied, that then leads to the hang.

Here is my git log showing your two patches applied on the upstream
kernel, plus my patch:
commit bb5db8bb505171fd2b8b67c3f22b16d8355d2a8b
Author: Phillip Susi <phill@thesusis.net>
Date:   Mon Oct 16 17:03:51 2023 -0400

    Olibata: don't start disks on resume
    
    Disks with Power Up In Standby enabled that required the
    SET FEATURES command to start up were being issued the
    command during resume.  Suppress this until the disk
    is actually accessed.

commit 4f1a1a9da4ff833417fa520d097b3f07c20e3c5d
Author: Damien Le Moal <dlemoal@kernel.org>
Date:   Thu Oct 12 16:18:00 2023 +0900

    [PATCH 2/2] ata: libata-core: Improve ata_dev_power_set_active()
    
    Improve the function ata_dev_power_set_active() by having it do nothing
    for a disk that is already in the active power state. To do that,
    introduce the function ata_dev_power_is_active() to test the current
    power state of the disk and return true if the disk is in the PM0:
    active or PM1: idle state (0xff value for the count field of the CHECK
    POWER MODE command output).
    
    To preserve the existing behavior, if the CHECK POWER MODE command
    issued in ata_dev_power_is_active() fails, the drive is assumed to be in
    standby mode and false is returned.
    
    With this change, issuing the VERIFY command to access the disk media to
    spin it up becomes unnecessary most of the time during system resume as
    the port reset done by libata-eh on resume often result in the drive to
    spin-up (this behavior is not clearly defined by the ACS specifications
    and may thus vary between disk models).
    
    Signed-off-by: Damien Le Moal <dlemoal@kernel.org>

commit bb7e1fcd9e3a207820e4b828e9f4f02986942d8d
Author: Damien Le Moal <dlemoal@kernel.org>
Date:   Thu Oct 12 16:17:59 2023 +0900

    ata: libata-eh: Spinup disk on resume after revalidation
    
    Move the call to ata_dev_power_set_active() to transition a disk in
    standby power mode to the active power mode from
    ata_eh_revalidate_and_attach() before doing revalidation to the end of
    ata_eh_recover(), after the link speed for the device is reconfigured
    (if that was necessary). This is safer as this ensure that the VERIFY
    command executed to spinup the disk is executed with the drive properly
    reconfigured first.
    
    Signed-off-by: Damien Le Moal <dlemoal@kernel.org>

commit 727fb83765049981e342db4c5a8b51aca72201d8
Merge: 8cb1f10d8c4b 5c15c60e7be6
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Oct 13 23:19:16 2023 -0700

    Merge tag 'input-for-v6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
    
    Pull input fixes from Dmitry Torokhov:
    
     - a reworked way for handling reset delay on SMBus-connected Synaptics
       touchpads (the original one, while being correct, uncovered an old
       bug in fallback to PS/2 code that was fixed separately; the new one
       however avoids having delay in serio port "fast" resume, and instead
       has the wait in the RMI4 code)
    
     - a fix for potential crashes when devices with Elan controllers (and
       Synaptics) fall back to PS/2 code. Can't be hit without the original
       patch above, but still good to have it fixed
    
     - a couple new device IDs in xpad Xbox driver
    
     - another quirk for Goodix driver to deal with stuff vendors put in
       ACPI tables
    
     - a fix for use-after-free on disconnect for powermate driver
    
     - a quirk to not initialize PS/2 mouse port on Fujitsu Lifebook E5411
       laptop as it makes keyboard not usable and the device uses
       hid-over-i2c touchpad anyways
    
    * tag 'input-for-v6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
      Input: powermate - fix use-after-free in powermate_config_complete
      Input: xpad - add PXN V900 support
      Input: synaptics-rmi4 - handle reset delay when using SMBus trsnsport
      Input: psmouse - fix fast_reconnect function for PS/2 mode
      Revert "Input: psmouse - add delay when deactivating for SMBus mode"
      Input: goodix - ensure int GPIO is in input for gpio_count == 1 && gpio_int_idx == 0 case
      Input: i8042 - add Fujitsu Lifebook E5411 to i8042 quirk table
      Input: xpad - add HyperX Clutch Gladiate Support
And here is my patch, which basically checks for PuiS during system
resume, and either forces the disk to suspend or resume depending on
whether it is PuiS.  Since I have not even engaged in any system
suspend/resume for this test, this patch should not have any effect.

So far in my testing of this patch on my 3 internal drives that I have
enabled PuiS on, it appears to work in so much as after a
suspend/resume, the runtime status of the disks is suspended ( as long
as I have *enabled* runtime pm on them, otherwise not ).  Another
problem seems to be that while the disks are left suspended after system
resume, they very quickly are resumed due to begnign IO eminating from
either ext4 or udsisks2 that does not cause a drive to spin up when it
is suspended with hdparm -y.  This would be the case of either FLUSH
CASH or CHECK POWER CONDITION not causing the drive to spin itself up
when given the commands, but the runtime pm core does not know that
these commands can be completed without resuming the disk, so it resumes
them first.
diff mbox series

Patch

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 8b43290ca2cd..73428ad0c8d2 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1056,7 +1056,8 @@  int ata_scsi_dev_config(struct scsi_device *sdev, struct ata_device *dev)
 		 * will be woken up by ata_port_pm_resume() with a port reset
 		 * and device revalidation.
 		 */
-		sdev->manage_start_stop = 1;
+		sdev->manage_system_start_stop = true;
+		sdev->manage_runtime_start_stop = true;
 		sdev->no_start_on_resume = 1;
 	}
 
diff --git a/drivers/firewire/sbp2.c b/drivers/firewire/sbp2.c
index 26db5b8dfc1e..749868b9e80d 100644
--- a/drivers/firewire/sbp2.c
+++ b/drivers/firewire/sbp2.c
@@ -81,7 +81,8 @@  MODULE_PARM_DESC(exclusive_login, "Exclusive login to sbp2 device "
  *
  * - power condition
  *   Set the power condition field in the START STOP UNIT commands sent by
- *   sd_mod on suspend, resume, and shutdown (if manage_start_stop is on).
+ *   sd_mod on suspend, resume, and shutdown (if manage_system_start_stop or
+ *   manage_runtime_start_stop is on).
  *   Some disks need this to spin down or to resume properly.
  *
  * - override internal blacklist
@@ -1517,8 +1518,10 @@  static int sbp2_scsi_slave_configure(struct scsi_device *sdev)
 
 	sdev->use_10_for_rw = 1;
 
-	if (sbp2_param_exclusive_login)
-		sdev->manage_start_stop = 1;
+	if (sbp2_param_exclusive_login) {
+		sdev->manage_system_start_stop = true;
+		sdev->manage_runtime_start_stop = true;
+	}
 
 	if (sdev->type == TYPE_ROM)
 		sdev->use_10_for_ms = 1;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index c92a317ba547..a1ef4eef904f 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -201,18 +201,32 @@  cache_type_store(struct device *dev, struct device_attribute *attr,
 }
 
 static ssize_t
-manage_start_stop_show(struct device *dev, struct device_attribute *attr,
-		       char *buf)
+manage_start_stop_show(struct device *dev,
+		       struct device_attribute *attr, char *buf)
 {
 	struct scsi_disk *sdkp = to_scsi_disk(dev);
 	struct scsi_device *sdp = sdkp->device;
 
-	return sprintf(buf, "%u\n", sdp->manage_start_stop);
+	return sprintf(buf, "%u\n",
+		       sdp->manage_system_start_stop &&
+		       sdp->manage_runtime_start_stop);
 }
+static DEVICE_ATTR_RO(manage_start_stop);
 
 static ssize_t
-manage_start_stop_store(struct device *dev, struct device_attribute *attr,
-			const char *buf, size_t count)
+manage_system_start_stop_show(struct device *dev,
+			      struct device_attribute *attr, char *buf)
+{
+	struct scsi_disk *sdkp = to_scsi_disk(dev);
+	struct scsi_device *sdp = sdkp->device;
+
+	return sprintf(buf, "%u\n", sdp->manage_system_start_stop);
+}
+
+static ssize_t
+manage_system_start_stop_store(struct device *dev,
+			       struct device_attribute *attr,
+			       const char *buf, size_t count)
 {
 	struct scsi_disk *sdkp = to_scsi_disk(dev);
 	struct scsi_device *sdp = sdkp->device;
@@ -224,11 +238,42 @@  manage_start_stop_store(struct device *dev, struct device_attribute *attr,
 	if (kstrtobool(buf, &v))
 		return -EINVAL;
 
-	sdp->manage_start_stop = v;
+	sdp->manage_system_start_stop = v;
 
 	return count;
 }
-static DEVICE_ATTR_RW(manage_start_stop);
+static DEVICE_ATTR_RW(manage_system_start_stop);
+
+static ssize_t
+manage_runtime_start_stop_show(struct device *dev,
+			       struct device_attribute *attr, char *buf)
+{
+	struct scsi_disk *sdkp = to_scsi_disk(dev);
+	struct scsi_device *sdp = sdkp->device;
+
+	return sprintf(buf, "%u\n", sdp->manage_runtime_start_stop);
+}
+
+static ssize_t
+manage_runtime_start_stop_store(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf, size_t count)
+{
+	struct scsi_disk *sdkp = to_scsi_disk(dev);
+	struct scsi_device *sdp = sdkp->device;
+	bool v;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (kstrtobool(buf, &v))
+		return -EINVAL;
+
+	sdp->manage_runtime_start_stop = v;
+
+	return count;
+}
+static DEVICE_ATTR_RW(manage_runtime_start_stop);
 
 static ssize_t
 allow_restart_show(struct device *dev, struct device_attribute *attr, char *buf)
@@ -560,6 +605,8 @@  static struct attribute *sd_disk_attrs[] = {
 	&dev_attr_FUA.attr,
 	&dev_attr_allow_restart.attr,
 	&dev_attr_manage_start_stop.attr,
+	&dev_attr_manage_system_start_stop.attr,
+	&dev_attr_manage_runtime_start_stop.attr,
 	&dev_attr_protection_type.attr,
 	&dev_attr_protection_mode.attr,
 	&dev_attr_app_tag_own.attr,
@@ -3771,13 +3818,20 @@  static void sd_shutdown(struct device *dev)
 		sd_sync_cache(sdkp, NULL);
 	}
 
-	if (system_state != SYSTEM_RESTART && sdkp->device->manage_start_stop) {
+	if (system_state != SYSTEM_RESTART &&
+	    sdkp->device->manage_system_start_stop) {
 		sd_printk(KERN_NOTICE, sdkp, "Stopping disk\n");
 		sd_start_stop_device(sdkp, 0);
 	}
 }
 
-static int sd_suspend_common(struct device *dev, bool ignore_stop_errors)
+static inline bool sd_do_start_stop(struct scsi_device *sdev, bool runtime)
+{
+	return (sdev->manage_system_start_stop && !runtime) ||
+		(sdev->manage_runtime_start_stop && runtime);
+}
+
+static int sd_suspend_common(struct device *dev, bool runtime)
 {
 	struct scsi_disk *sdkp = dev_get_drvdata(dev);
 	struct scsi_sense_hdr sshdr;
@@ -3809,12 +3863,12 @@  static int sd_suspend_common(struct device *dev, bool ignore_stop_errors)
 		}
 	}
 
-	if (sdkp->device->manage_start_stop) {
+	if (sd_do_start_stop(sdkp->device, runtime)) {
 		if (!sdkp->device->silence_suspend)
 			sd_printk(KERN_NOTICE, sdkp, "Stopping disk\n");
 		/* an error is not worth aborting a system sleep */
 		ret = sd_start_stop_device(sdkp, 0);
-		if (ignore_stop_errors)
+		if (!runtime)
 			ret = 0;
 	}
 
@@ -3826,23 +3880,23 @@  static int sd_suspend_system(struct device *dev)
 	if (pm_runtime_suspended(dev))
 		return 0;
 
-	return sd_suspend_common(dev, true);
+	return sd_suspend_common(dev, false);
 }
 
 static int sd_suspend_runtime(struct device *dev)
 {
-	return sd_suspend_common(dev, false);
+	return sd_suspend_common(dev, true);
 }
 
-static int sd_resume(struct device *dev)
+static int sd_resume(struct device *dev, bool runtime)
 {
 	struct scsi_disk *sdkp = dev_get_drvdata(dev);
-	int ret = 0;
+	int ret;
 
 	if (!sdkp)	/* E.g.: runtime resume at the start of sd_probe() */
 		return 0;
 
-	if (!sdkp->device->manage_start_stop)
+	if (!sd_do_start_stop(sdkp->device, runtime))
 		return 0;
 
 	if (!sdkp->device->no_start_on_resume) {
@@ -3860,7 +3914,7 @@  static int sd_resume_system(struct device *dev)
 	if (pm_runtime_suspended(dev))
 		return 0;
 
-	return sd_resume(dev);
+	return sd_resume(dev, false);
 }
 
 static int sd_resume_runtime(struct device *dev)
@@ -3887,7 +3941,7 @@  static int sd_resume_runtime(struct device *dev)
 				  "Failed to clear sense data\n");
 	}
 
-	return sd_resume(dev);
+	return sd_resume(dev, true);
 }
 
 static const struct dev_pm_ops sd_pm_ops = {
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index b9230b6add04..fd41fdac0a8e 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -161,6 +161,10 @@  struct scsi_device {
 				 * pass settings from slave_alloc to scsi
 				 * core. */
 	unsigned int eh_timeout; /* Error handling timeout */
+
+	bool manage_system_start_stop; /* Let HLD (sd) manage system start/stop */
+	bool manage_runtime_start_stop; /* Let HLD (sd) manage runtime start/stop */
+
 	unsigned removable:1;
 	unsigned changed:1;	/* Data invalid due to media change */
 	unsigned busy:1;	/* Used to prevent races */
@@ -193,7 +197,6 @@  struct scsi_device {
 	unsigned use_192_bytes_for_3f:1; /* ask for 192 bytes from page 0x3f */
 	unsigned no_start_on_add:1;	/* do not issue start on add */
 	unsigned allow_restart:1; /* issue START_UNIT in error handler */
-	unsigned manage_start_stop:1;	/* Let HLD (sd) manage start/stop */
 	unsigned no_start_on_resume:1; /* Do not issue START_STOP_UNIT on resume */
 	unsigned start_stop_pwr_cond:1;	/* Set power cond. in START_STOP_UNIT */
 	unsigned no_uld_attach:1; /* disable connecting to upper level drivers */