mbox series

[v8,00/23] Fix libata suspend/resume handling and code cleanup

Message ID 20230927141828.90288-1-dlemoal@kernel.org
Headers show
Series Fix libata suspend/resume handling and code cleanup | expand

Message

Damien Le Moal Sept. 27, 2023, 2:18 p.m. UTC
The first 9 patches of this series fix several issues with suspend/resume
power management operations in scsi and libata. The most significant
changes introduced are in patch 4 and 5, where the manage_start_stop
flag of scsi devices is split into the manage_system_start_stop and
manage_runtime_start_stop flags to allow keeping scsi runtime power
operations for spining up/down ATA devices but have libata do its own
system suspend/resume device power state management using EH.

The remaining patches are code cleanup that do not introduce any
significant functional change.

This series was tested on qemu and on various PCs and servers. I am
CC-ing people who recently reported issues with suspend/resume.
Additional testing would be much appreciated.

Changes from v7:
 * Fixed patch 2 commit message incorrect mention of WARN_ON() removal
 * Changed patch 4 to use bool type for the new manage_xxx_start_top
   flags
 * Changed patch 12 to keep the early return for when start stop
   management on resume is not necessary.
 * Added review tags

Changes from v6:
 * Changed patch 9 to use a bool for the new suspended flag.

Changes from v5:
 * Typo and style corrections in patch 4 commit message
 * Changed patch 9 to use a new flag to track a disk suspended state
   instead of using the scsi device state
 * Added review tags

Changes from v4:
 * Remove ata_scsi_dev_alloc() function in patch 3, coding it directly
   in ata_scsi_slave_alloc()
 * Correct typo in patch 19 commit message
 * Added Tested and review tags

Changes from v3:
 * Corrected patch 1 (typo in commit message and WARN_ON() removal)
 * Changed path 3 as suggested by Niklas (moved definition of
   ->slave_alloc)
 * Rebased on rc2
 * Added review tags

Changes from v2:
 * Added patch 4 as simply disabling manage_start_stop from libata was
   breaking individual disk runtime suspend/autosuspend. Patch 5 was
   reworked accordingly to the changes in patch 4.
 * Fixed patch 3: applying the link creation was missing and the link
   creation itself was also incorrect, preventing sd probe to execute
   correctly. Thanks to Geert for testing and reporting this issue.
 * Split the "Fix delayed scsi_rescan_device() execution" patch into
   patch 6 (scsi part) and patch 7 (ata part).
 * Modified patch 9 to not call sd_shutdown() from sd_remove() for
   devices that are not running.
 * Added Chia-Lin Tested tag to unchanged patches

Changes from v1:
 * Added patch 8 and 9 to fix compilation warnings with W=1
 * Addressed John comment in patch 19
 * Fixed patch 20 commit message (Sergei)
 * Added Hannes Review tag

Damien Le Moal (23):
  ata: libata-core: Fix ata_port_request_pm() locking
  ata: libata-core: Fix port and device removal
  ata: libata-scsi: link ata port and scsi device
  scsi: sd: Differentiate system and runtime start/stop management
  ata: libata-scsi: Disable scsi device manage_system_start_stop
  scsi: Do not attempt to rescan suspended devices
  ata: libata-scsi: Fix delayed scsi_rescan_device() execution
  ata: libata-core: Do not register PM operations for SAS ports
  scsi: sd: Do not issue commands to suspended disks on shutdown
  ata: libata-core: Fix compilation warning in ata_dev_config_ncq()
  ata: libata-eh: Fix compilation warning in ata_eh_link_report()
  scsi: Remove scsi device no_start_on_resume flag
  ata: libata-scsi: Cleanup ata_scsi_start_stop_xlat()
  ata: libata-core: Synchronize ata_port_detach() with hotplug
  ata: libata-core: Detach a port devices on shutdown
  ata: libata-core: Remove ata_port_suspend_async()
  ata: libata-core: Remove ata_port_resume_async()
  ata: libata-core: Do not poweroff runtime suspended ports
  ata: libata-core: Do not resume runtime suspended ports
  ata: libata-sata: Improve ata_sas_slave_configure()
  ata: libata-eh: Improve reset error messages
  ata: libata-eh: Reduce "disable device" message verbosity
  ata: libata: Cleanup inline DMA helper functions

 drivers/ata/libata-core.c      | 242 +++++++++++++++++++++++++--------
 drivers/ata/libata-eh.c        |  76 +++++++++--
 drivers/ata/libata-sata.c      |   5 +-
 drivers/ata/libata-scsi.c      | 142 ++++++++++---------
 drivers/ata/libata-transport.c |   9 +-
 drivers/ata/libata.h           |   6 +
 drivers/firewire/sbp2.c        |   9 +-
 drivers/scsi/scsi_scan.c       |  18 ++-
 drivers/scsi/sd.c              | 102 +++++++++++---
 drivers/scsi/sd.h              |   1 +
 include/linux/libata.h         |  26 ++--
 include/scsi/scsi_device.h     |   4 +-
 include/scsi/scsi_host.h       |   2 +-
 13 files changed, 457 insertions(+), 185 deletions(-)

Comments

Martin K. Petersen Sept. 27, 2023, 7:51 p.m. UTC | #1
Damien,

> scsi_rescan_device() takes a scsi device lock before executing a
> device handler and device driver rescan methods. Waiting for the
> completion of any command issued to the device by these methods will
> thus be done with the device lock held. As a result, there is a risk
> of deadlocking within the power management code if
> scsi_rescan_device() is called to handle a device resume with the
> associated scsi device not yet resumed.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Phillip Susi Oct. 2, 2023, 11:39 p.m. UTC | #2
I went to test these patches tonight and it looks like Linus already
merged them ( or mostly? ).  I enabled runtime pm on the scsi target and
the ata port, and the disk spins down and the port does too.

I noticed though, that when entering system suspend, a disk that has
already been runtime suspended is resumed only to immediately be
suspended again before the system suspend.  That shouldn't happen should
it?
Phillip Susi Oct. 3, 2023, 12:27 a.m. UTC | #3
Phillip Susi <phill@thesusis.net> writes:

> I noticed though, that when entering system suspend, a disk that has
> already been runtime suspended is resumed only to immediately be
> suspended again before the system suspend.  That shouldn't happen should
> it?

It seems that it is /sys/power/sync_on_suspend.  The problem went away
when I unmounted the disk, or I could make the disk wake up by running
sync.  I thought that it used to be that as long as you mounted an ext4
filesystem with -o relatime, it wouldn't keep dirtying the cache as long
as you weren't actually writing to the filesystem.  Either I'm
misremembering something, or this is no longer the case.  Also if there
are dirty pages in the cache, I thought the default was for them to be
written out after 5 seconds, which would keep the disk awake, rather
than waiting until system suspend to sync.

I guess I could mount the filesystem readonly.  It's probably not a good
idea to disable sync_on_suspend for the whole system.
Damien Le Moal Oct. 3, 2023, 12:32 a.m. UTC | #4
On 10/3/23 08:39, Phillip Susi wrote:
> 
> I went to test these patches tonight and it looks like Linus already
> merged them ( or mostly? ).  I enabled runtime pm on the scsi target and
> the ata port, and the disk spins down and the port does too.
> 
> I noticed though, that when entering system suspend, a disk that has
> already been runtime suspended is resumed only to immediately be
> suspended again before the system suspend.  That shouldn't happen should
> it?

Indeed. Will look at this. Geert also reported seeing an issue with resume (one
time only, so I suspect there is still a race with libata-eh). So looks like
something is still missing.
Damien Le Moal Oct. 3, 2023, 12:44 a.m. UTC | #5
On 10/3/23 09:27, Phillip Susi wrote:
> 
> Phillip Susi <phill@thesusis.net> writes:
> 
>> I noticed though, that when entering system suspend, a disk that has
>> already been runtime suspended is resumed only to immediately be
>> suspended again before the system suspend.  That shouldn't happen should
>> it?
> 
> It seems that it is /sys/power/sync_on_suspend.  The problem went away
> when I unmounted the disk, or I could make the disk wake up by running
> sync.  I thought that it used to be that as long as you mounted an ext4
> filesystem with -o relatime, it wouldn't keep dirtying the cache as long
> as you weren't actually writing to the filesystem.  Either I'm
> misremembering something, or this is no longer the case.  Also if there
> are dirty pages in the cache, I thought the default was for them to be
> written out after 5 seconds, which would keep the disk awake, rather
> than waiting until system suspend to sync.
> 
> I guess I could mount the filesystem readonly.  It's probably not a good
> idea to disable sync_on_suspend for the whole system.

Hmmm... So this could be the fs suspend then, which issues a sync but the device
is already suspended and was synced already. In that case, we should turn that
sync into a nop to not wakeup the drive unnecessarily. The fix may be needed on
scsi sd side rather than libata.
Let me check.
Phillip Susi Oct. 3, 2023, 9:22 p.m. UTC | #6
On Tue, Oct 03, 2023 at 09:44:50AM +0900, Damien Le Moal wrote:
> Hmmm... So this could be the fs suspend then, which issues a sync but the device
> is already suspended and was synced already. In that case, we should turn that
> sync into a nop to not wakeup the drive unnecessarily. The fix may be needed on
> scsi sd side rather than libata.

I did some tracing today on a test ext4 fs I created on a loopback device, and it
seems that the superblocks are written every time you sync, even if no files on the
filesystem have even been opened for read access.
Damien Le Moal Oct. 3, 2023, 11:46 p.m. UTC | #7
On 10/4/23 06:22, Phillip Susi wrote:
> On Tue, Oct 03, 2023 at 09:44:50AM +0900, Damien Le Moal wrote:
>> Hmmm... So this could be the fs suspend then, which issues a sync but the device
>> is already suspended and was synced already. In that case, we should turn that
>> sync into a nop to not wakeup the drive unnecessarily. The fix may be needed on
>> scsi sd side rather than libata.
> 
> I did some tracing today on a test ext4 fs I created on a loopback device, and it
> seems that the superblocks are written every time you sync, even if no files on the
> filesystem have even been opened for read access.

OK. So a fix would need to be on the FS side then if one wants to avoid that
useless resume. However, this may clash with the FS need to record stuff in its
sb and so we may not be able to avoid that.
Phillip Susi Oct. 4, 2023, 9:01 p.m. UTC | #8
Damien Le Moal <dlemoal@kernel.org> writes:

>> I did some tracing today on a test ext4 fs I created on a loopback device, and it
>> seems that the superblocks are written every time you sync, even if no files on the
>> filesystem have even been opened for read access.
>
> OK. So a fix would need to be on the FS side then if one wants to avoid that
> useless resume. However, this may clash with the FS need to record stuff in its
> sb and so we may not be able to avoid that.

Ok, this is very strange.  I went back to my distro kernel, without
runtime pm, mounted the filesystems rw again, used hdparm -y to suspend
the disk, verified with hdparm -C that they were in suspend, and and
suspended the system.  In dmesg I see:

Filesystems sync: 0.007 seconds

Now, if it were writing the superblocks to the disk there, I would
expect that to take more like 3 seconds while it woke the disks back up,
like it did when I was testing the latest kernel with runtime pm.

Another odd thing I noticed with the runtime pm was that sometimes the
drives would randomly start up even though I was not accessing them.
This never happens when I am normally using the debian kernel with no
runtime pm and just running hdparm -y to put the drives to sleep.  I can
check them hours later and they are still in standby.

I just tried running sync and blktrace and it looks like it is writing
the superblock to the drive, and yet, hdparm -C still says it is in
standby.  This makes no sense.  Here is what blktrace said when I ran
sync:

  8,0    0        1     0.000000000 34004  Q FWS [sync]
  8,0    0        2     0.000001335 34004  G FWS [sync]
  8,0    0        3     0.000004327 31088  D  FN [kworker/0:2H]
  8,0    0        4     0.000068945     0  C  FN 0 [0]
  8,0    0        5     0.000069466     0  C  WS 0 [0]

I just noticed that this trace doesn't show the 0+8 that I saw when I
was testing running sync with a fresh, empty ext4 filesystem on a loop
device.  That showed 0+8 indicating the first 4k block of the disk, as
well as 1023+8, and one or two more offsets that I thought were the
backup superblocks.

What the heck is this sync actually writing, and why does it not cause
the disk to take itself out of standby, but with runtime pm, it does?
Could this just be a FLUSH of some sort, which when the disk is in
standby, it ignores, but the kernel runtime pm decides it must wake the
disk up before dispatching the command, even though it is useless?
Damien Le Moal Oct. 4, 2023, 10:33 p.m. UTC | #9
On 10/5/23 06:01, Phillip Susi wrote:
> Damien Le Moal <dlemoal@kernel.org> writes:
> 
>>> I did some tracing today on a test ext4 fs I created on a loopback device, and it
>>> seems that the superblocks are written every time you sync, even if no files on the
>>> filesystem have even been opened for read access.
>>
>> OK. So a fix would need to be on the FS side then if one wants to avoid that
>> useless resume. However, this may clash with the FS need to record stuff in its
>> sb and so we may not be able to avoid that.
> 
> Ok, this is very strange.  I went back to my distro kernel, without
> runtime pm, mounted the filesystems rw again, used hdparm -y to suspend
> the disk, verified with hdparm -C that they were in suspend, and and
> suspended the system.  In dmesg I see:
> 
> Filesystems sync: 0.007 seconds
> 
> Now, if it were writing the superblocks to the disk there, I would
> expect that to take more like 3 seconds while it woke the disks back up,
> like it did when I was testing the latest kernel with runtime pm.

Hmm... May be there was nothing to sync: hdparm -y putting the drive in standby
mode should have synced the write cache already and the FS issued sync may have
ended up not causing any media write, as long as the FS did not issue any new
write (which would have spun up the drive). The specs are clear about this:

In STANDBY IMMEDIATE description:

Processing a STANDBY IMMEDIATE command shall cause the device to prepare for a
power cycle (e.g., flush volatile write cache) prior to returning command
completion.

So this is all does not seem that strange to me.

> Another odd thing I noticed with the runtime pm was that sometimes the
> drives would randomly start up even though I was not accessing them.

Some random access from user space, e.g. systemd doing its perodic "something"
with passthrough commands ?

> This never happens when I am normally using the debian kernel with no
> runtime pm and just running hdparm -y to put the drives to sleep.  I can
> check them hours later and they are still in standby.

Same user space in that case ?

> 
> I just tried running sync and blktrace and it looks like it is writing
> the superblock to the drive, and yet, hdparm -C still says it is in
> standby.  This makes no sense.  Here is what blktrace said when I ran
> sync:
> 
>   8,0    0        1     0.000000000 34004  Q FWS [sync]
>   8,0    0        2     0.000001335 34004  G FWS [sync]
>   8,0    0        3     0.000004327 31088  D  FN [kworker/0:2H]
>   8,0    0        4     0.000068945     0  C  FN 0 [0]
>   8,0    0        5     0.000069466     0  C  WS 0 [0]
> 
> I just noticed that this trace doesn't show the 0+8 that I saw when I
> was testing running sync with a fresh, empty ext4 filesystem on a loop
> device.  That showed 0+8 indicating the first 4k block of the disk, as
> well as 1023+8, and one or two more offsets that I thought were the
> backup superblocks.

Then as mentioned above, nothing may be written, which results in the drive not
waking up since the write cache is clean already (synced already before spin down).

> What the heck is this sync actually writing, and why does it not cause
> the disk to take itself out of standby, but with runtime pm, it does?

Not sure. But it may be a write FUA vs actual sync. With runtime pm, any command
issued to the disk while it is suspended will cause a call to pm runtime resume
which issues a verify command to spinup the drive, regardless if the command
issued by the user needs the drive to spin up. So that is normal. With hdparm
-y, the driver thinks the drive is running and so does not issue that verify
command to the drive. The drive spinning up or not then depends on the command
being issued and the drive state (and also likely the drive model and FW
implementation... Some may be more intelligent than others in this area).

> Could this just be a FLUSH of some sort, which when the disk is in
> standby, it ignores, but the kernel runtime pm decides it must wake the
> disk up before dispatching the command, even though it is useless?

Given your description, that is my thinking exactly. The problem here for the
second part (spinning up the disk for "useless" commands) is that determining if
a command needs the drive to spinup or not is not an easy thing to do, and
potentially dangerous if mishandled. One possible micro optimization would be to
ignore flush commands to suspended disks. But not sure that is a high win change
beside *may be* avoiding a spinup on system suspend witha drive already runtime
suspended.
Phillip Susi Oct. 5, 2023, 12:38 p.m. UTC | #10
Damien Le Moal <dlemoal@kernel.org> writes:

>> This never happens when I am normally using the debian kernel with no
>> runtime pm and just running hdparm -y to put the drives to sleep.  I can
>> check them hours later and they are still in standby.
>
> Same user space in that case ?

Yes.  I'll try to leave a blktrace running to see what causes the
spinup.  I suppose it could be another flush or other command that
doesn't require media access, but triggers runtime pm to spin up the disk.

> Given your description, that is my thinking exactly. The problem here for the
> second part (spinning up the disk for "useless" commands) is that determining if
> a command needs the drive to spinup or not is not an easy thing to do, and
> potentially dangerous if mishandled. One possible micro optimization would be to
> ignore flush commands to suspended disks. But not sure that is a high win change
> beside *may be* avoiding a spinup on system suspend witha drive already runtime
> suspended.

One of the things my patch series from a decade ago did was to use the
SLEEP flag in libata to decide to complete certain commands without
sending them to the drive so it could remain asleep.  I'm not sure if
it's even possible for the driver to evaluate the command before the pm
core orders a resume though.

I wonder if libata could leave the EH pending and return success from
the runtime resume, and then actually run the EH and wake up the drive
later, when actual IO is done.

On another note, I've been looking over your patches, and I still do not
understand why you added the VERIFY command.  The only effect it seems
to have is moving the delay while the drive spins up from the first real
IO to the resume path.  Why does that matter?