diff mbox series

[v1,4/5] PM: sleep: Do breadth first suspend/resume for async suspend/resume

Message ID 20241114220921.2529905-5-saravanak@google.com
State New
Headers show
Series Optimize async device suspend/resume | expand

Commit Message

Saravana Kannan Nov. 14, 2024, 10:09 p.m. UTC
The dpm_list used for suspend/resume ensures that all superior devices
(parents and suppliers) precede subordinate devices (children and
consumers).

Current async resume logic:
-------------------------------
* For each resume phase (except the "complete" phase, which is always
  serialized), the resume logic first queues all async devices in the
  dpm_list. It then loops through the dpm_list again to resume the sync
  devices one by one.

* Async devices wait for all their superior devices to resume before
  starting their own resume operation.

* This process results in multiple sleep and wake-up cycles before an
  async device actually resumes. This sleeping also causes kworker
  threads to stall with work for a period. Consequently, the workqueue
  framework spins up more kworker threads to handle the other async
  devices.

* The end result is excessive thread creation, wake-ups, sleeps, and
  context switches for every async device. This overhead makes a full
  async resume (with all devices marked as async-capable) much slower
  than a synchronous resume.

Current async suspend logic:
--------------------------------
* The async suspend logic differs from the async resume logic. The
  suspend logic loops through the dpm_list. When it finds an async
  device, it queues the work and moves on. However, if it encounters a
  sync device, it waits until the sync device (and all its subordinate
  devices) have suspended before proceeding to the next device.
  Therefore, an async suspend device can be left waiting on an
  unrelated device before even being queued.

* Once queued, an async device experiences the same inefficiencies as
  in the resume logic (thread creation, wake-ups, sleeps, and context
  switches).

On a Pixel 6, averaging over 100 suspend/resume cycles, the data is as
follows:

+---------------------------+-----------+------------+----------+
| Phase			    | Full sync | Full async | % change |
+---------------------------+-----------+------------+----------+
| Total dpm_suspend*() time |	 107 ms |      72 ms |     -33% |
+---------------------------+-----------+------------+----------+
| Total dpm_resume*() time  |	  75 ms |      90 ms |     +20% |
+---------------------------+-----------+------------+----------+
| Sum			    |	 182 ms |     162 ms |     -11% |
+---------------------------+-----------+------------+----------+

This shows that full async suspend/resume is not a viable option. It
makes the user-visible resume phase slower and only improves the
overall time by 11%.

To fix all this, this patches introduces a new async suspend/resume
logic.

New suspend/resume logic:
-------------------------
* For each suspend/resume phase (except "complete" and "prepare,"
  which are always serialized), the logic first queues only the async
  devices that don't have to wait for any subordinates (for suspend)
  or superiors (for resume). It then loops through the dpm_list again
  to suspend/resume the sync devices one by one.

* When a device (sync or async) successfully suspends/resumes, it
  examines its superiors/subordinates and queues only the async
  devices that don't need to wait for any subordinates/superiors.

With this new logic:

* Queued async devices don't have to wait for completion and are
  always ready to perform their suspend/resume operation.

* The queue of async devices remains short.

* kworkers never sleep for extended periods, and the workqueue
  framework doesn't spin up many new threads to handle a backlog of
  async devices.

* The result is approximately NCPU kworker threads running in parallel
  without sleeping until all async devices finish.

On a Pixel 6, averaging over 100 suspend/resume cycles, the new logic
yields improved results:
+---------------------------+-----------+------------+------------------+
| Phase			    | Old full sync | New full async | % change |
+---------------------------+-----------+------------+------------------+
| Total dpm_suspend*() time |        107 ms |          60 ms |     -44% |
+---------------------------+-----------+------------+------------------+
| Total dpm_resume*() time  |         75 ms |          74 ms |      -1% |
+---------------------------+-----------+------------+------------------+
| Sum			    |        182 ms |         134 ms |     -26% |
+---------------------------+-----------+------------+------------------+

Signed-off-by: Saravana Kannan <saravanak@google.com>
---
 drivers/base/power/main.c | 242 ++++++++++++++++++++++++++++++++------
 1 file changed, 205 insertions(+), 37 deletions(-)

Comments

Geert Uytterhoeven March 11, 2025, 11:25 a.m. UTC | #1
Hi Saravana,

On Thu, 14 Nov 2024 at 23:09, Saravana Kannan <saravanak@google.com> wrote:
> The dpm_list used for suspend/resume ensures that all superior devices
> (parents and suppliers) precede subordinate devices (children and
> consumers).
>
> Current async resume logic:
> -------------------------------
> * For each resume phase (except the "complete" phase, which is always
>   serialized), the resume logic first queues all async devices in the
>   dpm_list. It then loops through the dpm_list again to resume the sync
>   devices one by one.
>
> * Async devices wait for all their superior devices to resume before
>   starting their own resume operation.
>
> * This process results in multiple sleep and wake-up cycles before an
>   async device actually resumes. This sleeping also causes kworker
>   threads to stall with work for a period. Consequently, the workqueue
>   framework spins up more kworker threads to handle the other async
>   devices.
>
> * The end result is excessive thread creation, wake-ups, sleeps, and
>   context switches for every async device. This overhead makes a full
>   async resume (with all devices marked as async-capable) much slower
>   than a synchronous resume.
>
> Current async suspend logic:
> --------------------------------
> * The async suspend logic differs from the async resume logic. The
>   suspend logic loops through the dpm_list. When it finds an async
>   device, it queues the work and moves on. However, if it encounters a
>   sync device, it waits until the sync device (and all its subordinate
>   devices) have suspended before proceeding to the next device.
>   Therefore, an async suspend device can be left waiting on an
>   unrelated device before even being queued.
>
> * Once queued, an async device experiences the same inefficiencies as
>   in the resume logic (thread creation, wake-ups, sleeps, and context
>   switches).
>
> On a Pixel 6, averaging over 100 suspend/resume cycles, the data is as
> follows:
>
> +---------------------------+-----------+------------+----------+
> | Phase                     | Full sync | Full async | % change |
> +---------------------------+-----------+------------+----------+
> | Total dpm_suspend*() time |    107 ms |      72 ms |     -33% |
> +---------------------------+-----------+------------+----------+
> | Total dpm_resume*() time  |     75 ms |      90 ms |     +20% |
> +---------------------------+-----------+------------+----------+
> | Sum                       |    182 ms |     162 ms |     -11% |
> +---------------------------+-----------+------------+----------+
>
> This shows that full async suspend/resume is not a viable option. It
> makes the user-visible resume phase slower and only improves the
> overall time by 11%.
>
> To fix all this, this patches introduces a new async suspend/resume
> logic.
>
> New suspend/resume logic:
> -------------------------
> * For each suspend/resume phase (except "complete" and "prepare,"
>   which are always serialized), the logic first queues only the async
>   devices that don't have to wait for any subordinates (for suspend)
>   or superiors (for resume). It then loops through the dpm_list again
>   to suspend/resume the sync devices one by one.
>
> * When a device (sync or async) successfully suspends/resumes, it
>   examines its superiors/subordinates and queues only the async
>   devices that don't need to wait for any subordinates/superiors.
>
> With this new logic:
>
> * Queued async devices don't have to wait for completion and are
>   always ready to perform their suspend/resume operation.
>
> * The queue of async devices remains short.
>
> * kworkers never sleep for extended periods, and the workqueue
>   framework doesn't spin up many new threads to handle a backlog of
>   async devices.
>
> * The result is approximately NCPU kworker threads running in parallel
>   without sleeping until all async devices finish.
>
> On a Pixel 6, averaging over 100 suspend/resume cycles, the new logic
> yields improved results:
> +---------------------------+-----------+------------+------------------+
> | Phase                     | Old full sync | New full async | % change |
> +---------------------------+-----------+------------+------------------+
> | Total dpm_suspend*() time |        107 ms |          60 ms |     -44% |
> +---------------------------+-----------+------------+------------------+
> | Total dpm_resume*() time  |         75 ms |          74 ms |      -1% |
> +---------------------------+-----------+------------+------------------+
> | Sum                       |        182 ms |         134 ms |     -26% |
> +---------------------------+-----------+------------+------------------+
>
> Signed-off-by: Saravana Kannan <saravanak@google.com>

Thanks for your patch!

On Renesas Gray Hawk Single (R-Car V4M) during s2idle:

PM: suspend entry (s2idle)
Filesystems sync: 0.055 seconds
Freezing user space processes
Freezing user space processes completed (elapsed 0.004 seconds)
OOM killer disabled.
Freezing remaining freezable tasks
Freezing remaining freezable tasks completed (elapsed 0.003 seconds)

================================
WARNING: inconsistent lock state
6.14.0-rc5-rcar3-05904-g1ec95427acf9 #261 Tainted: G        W
--------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
s2idle/764 [HC0[0]:SC0[0]:HE1:SE1] takes:
ffffff844392c190 (&dev->power.lock){?.-.}-{3:3}, at: dpm_async_fn+0x24/0xa8
{IN-HARDIRQ-W} state was registered at:
  lock_acquire+0x26c/0x2c4
  _raw_spin_lock_irqsave+0x54/0x70
  pm_suspend_timer_fn+0x20/0x78
  __hrtimer_run_queues+0x204/0x330
  hrtimer_interrupt+0xa8/0x1b0
  arch_timer_handler_virt+0x28/0x3c
  handle_percpu_devid_irq+0x64/0x110
  handle_irq_desc+0x3c/0x50
  generic_handle_domain_irq+0x18/0x20
  gic_handle_irq+0x50/0xbc
  call_on_irq_stack+0x24/0x34
  do_interrupt_handler+0x60/0x88
  el1_interrupt+0x30/0x48
  el1h_64_irq_handler+0x14/0x1c
  el1h_64_irq+0x70/0x74
  cpuidle_enter_state+0x1a4/0x2d0
  cpuidle_enter+0x34/0x48
  do_idle+0x21c/0x240
  cpu_startup_entry+0x30/0x34
  kernel_init+0x0/0x124
  console_on_rootfs+0x0/0x64
  __primary_switched+0x88/0x90
irq event stamp: 17055
hardirqs last  enabled at (17055): [<ffffffc080a0782c>]
_raw_spin_unlock_irqrestore+0x34/0x54
hardirqs last disabled at (17054): [<ffffffc080a075a4>]
_raw_spin_lock_irqsave+0x28/0x70
softirqs last  enabled at (14360): [<ffffffc080096bcc>]
handle_softirqs+0x1b0/0x3b4
softirqs last disabled at (14355): [<ffffffc080010168>] __do_softirq+0x10/0x18

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&dev->power.lock);
  <Interrupt>
    lock(&dev->power.lock);

 *** DEADLOCK ***

5 locks held by s2idle/764:
 #0: ffffff84400983f0 (sb_writers#5){.+.+}-{0:0}, at:
file_start_write.isra.0+0x24/0x30
 #1: ffffff8446802288 (&of->mutex#2){+.+.}-{4:4}, at:
kernfs_fop_write_iter+0xf8/0x180
 #2: ffffff8440d016e8 (kn->active#39){.+.+}-{0:0}, at:
kernfs_fop_write_iter+0x100/0x180
 #3: ffffffc0812f4780 (system_transition_mutex){+.+.}-{4:4}, at:
pm_suspend+0x84/0x248
 #4: ffffffc084bb7f78 (dpm_list_mtx){+.+.}-{4:4}, at: dpm_suspend+0x84/0x1a8

stack backtrace:
CPU: 0 UID: 0 PID: 764 Comm: s2idle Tainted: G        W
6.14.0-rc5-rcar3-05904-g1ec95427acf9 #261
Tainted: [W]=WARN
Hardware name: Renesas Gray Hawk Single board based on r8a779h0 (DT)
Call trace:
 show_stack+0x14/0x1c (C)
 dump_stack_lvl+0x78/0xa8
 dump_stack+0x14/0x1c
 print_usage_bug+0x1dc/0x1f8
 mark_lock+0x1c4/0x3a4
 __lock_acquire+0x560/0x1038
 lock_acquire+0x26c/0x2c4
 _raw_spin_lock+0x40/0x54
 dpm_async_fn+0x24/0xa8
 dpm_async_queue_suspend_ready_fn+0x40/0x50
 dpm_async_suspend_loop+0x48/0x50
 dpm_suspend+0x94/0x1a8
 dpm_suspend_start+0x68/0x70
 suspend_devices_and_enter+0xd8/0x59c
 pm_suspend+0x214/0x248
 state_store+0xa8/0xe8
 kobj_attr_store+0x14/0x24
 sysfs_kf_write+0x4c/0x64
 kernfs_fop_write_iter+0x138/0x180
 vfs_write+0x148/0x1b4
 ksys_write+0x78/0xe0
 __arm64_sys_write+0x14/0x1c
 invoke_syscall+0x68/0xf0
 el0_svc_common.constprop.0+0xb0/0xcc
 do_el0_svc+0x18/0x20
 el0_svc+0x38/0x90
 el0t_64_sync_handler+0x80/0x130
 el0t_64_sync+0x158/0x15c


Gr{oetje,eeting}s,

                        Geert
diff mbox series

Patch

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index aa1470ef6ac0..65c195be48b8 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -121,6 +121,12 @@  void device_pm_unlock(void)
 	mutex_unlock(&dpm_list_mtx);
 }
 
+static bool is_async(struct device *dev)
+{
+	return dev->power.async_suspend && pm_async_enabled
+		&& !pm_trace_is_enabled();
+}
+
 /**
  * device_pm_add - Add a device to the PM core's list of active devices.
  * @dev: Device to add to the list.
@@ -287,6 +293,9 @@  static int dpm_for_each_superior(struct device *dev, void *data,
 
 static bool dpm_wait_for_superior(struct device *dev, bool async)
 {
+	if (is_async(dev))
+		return true;
+
 	dpm_for_each_superior(dev, &async, dpm_wait_fn);
 
 	/*
@@ -334,9 +343,22 @@  static int dpm_for_each_subordinate(struct device *dev, void *data,
 
 static void dpm_wait_for_subordinate(struct device *dev, bool async)
 {
+	if (is_async(dev))
+		return;
+
 	dpm_for_each_subordinate(dev, &async, dpm_wait_fn);
 }
 
+static int dpm_check_fn(struct device *dev, void *data)
+{
+	return completion_done(&dev->power.completion) ? 0 : -EBUSY;
+}
+
+static int dpm_check_subordinate(struct device *dev)
+{
+	return dpm_for_each_subordinate(dev, NULL, dpm_check_fn);
+}
+
 /**
  * pm_op - Return the PM operation appropriate for given PM event.
  * @ops: PM operations to choose from.
@@ -578,33 +600,39 @@  bool dev_pm_skip_resume(struct device *dev)
 	return !dev->power.must_resume;
 }
 
-static bool is_async(struct device *dev)
+static void dpm_async_fn(struct device *dev, async_func_t func)
 {
-	return dev->power.async_suspend && pm_async_enabled
-		&& !pm_trace_is_enabled();
-}
-
-static bool dpm_async_fn(struct device *dev, async_func_t func)
-{
-	reinit_completion(&dev->power.completion);
-
-	if (is_async(dev)) {
-		dev->power.async_in_progress = true;
-
-		get_device(dev);
+	/*
+	 * Multiple async devices could trigger the async queuing of this
+	 * device. Make sure not to double queue it.
+	 */
+	spin_lock(&dev->power.lock);
+	if (dev->power.async_in_progress) {
+		spin_unlock(&dev->power.lock);
+		return;
+	}
+	dev->power.async_in_progress = true;
+	spin_unlock(&dev->power.lock);
 
-		if (async_schedule_dev_nocall(func, dev))
-			return true;
+	get_device(dev);
 
-		put_device(dev);
-	}
 	/*
-	 * Because async_schedule_dev_nocall() above has returned false or it
-	 * has not been called at all, func() is not running and it is safe to
-	 * update the async_in_progress flag without extra synchronization.
+	 * We aren't going to call any callbacks if the device has none.  Also,
+	 * direct_complete means all the resume and suspend callbacks should be
+	 * skipped and the device should go straight to dpm_complete().  In both
+	 * of these case, calling the func() synchronously will be a lot quicker
+	 * than even queuing the async work. So, do that.
 	 */
-	dev->power.async_in_progress = false;
-	return false;
+	if (dev->power.no_pm_callbacks || dev->power.direct_complete) {
+		func(dev, 0);
+		return;
+	}
+
+	if (async_schedule_dev_nocall(func, dev))
+		return;
+
+	WARN(1, "Unable to schedule async suspend/resume calls!\n");
+	put_device(dev);
 }
 
 /**
@@ -692,12 +720,47 @@  static void device_resume_noirq(struct device *dev, pm_message_t state, bool asy
 	}
 }
 
+static void dpm_reinit_dev_state(struct list_head *list)
+{
+	struct device *dev;
+
+	list_for_each_entry(dev, list, power.entry) {
+		reinit_completion(&dev->power.completion);
+		dev->power.async_in_progress = false;
+	}
+}
+
+static int dpm_check_superior(struct device *dev)
+{
+	return dpm_for_each_superior(dev, NULL, dpm_check_fn);
+}
+
+static int dpm_async_queue_resume_ready_fn(struct device *dev, void *data)
+{
+	if (!is_async(dev) || dpm_check_superior(dev))
+		return 0;
+
+	dpm_async_fn(dev, data);
+	return 0;
+}
+
+static void dpm_async_resume_loop(struct list_head *from, async_func_t func)
+{
+	struct device *dev;
+
+	list_for_each_entry(dev, from, power.entry)
+		dpm_async_queue_resume_ready_fn(dev, func);
+}
+
 static void async_resume_noirq(void *data, async_cookie_t cookie)
 {
 	struct device *dev = data;
 
 	device_resume_noirq(dev, pm_transition, true);
 	put_device(dev);
+
+	dpm_for_each_subordinate(dev, async_resume_noirq,
+				 dpm_async_queue_resume_ready_fn);
 }
 
 static void dpm_noirq_resume_devices(pm_message_t state)
@@ -712,23 +775,26 @@  static void dpm_noirq_resume_devices(pm_message_t state)
 
 	mutex_lock(&dpm_list_mtx);
 
+	dpm_reinit_dev_state(&dpm_noirq_list);
+
 	/*
 	 * Trigger the resume of "async" devices upfront so they don't have to
 	 * wait for the "non-async" ones they don't depend on.
 	 */
-	list_for_each_entry(dev, &dpm_noirq_list, power.entry)
-		dpm_async_fn(dev, async_resume_noirq);
+	dpm_async_resume_loop(&dpm_noirq_list, async_resume_noirq);
 
 	while (!list_empty(&dpm_noirq_list)) {
 		dev = to_device(dpm_noirq_list.next);
 		list_move_tail(&dev->power.entry, &dpm_late_early_list);
 
-		if (!dev->power.async_in_progress) {
+		if (!is_async(dev)) {
 			get_device(dev);
 
 			mutex_unlock(&dpm_list_mtx);
 
 			device_resume_noirq(dev, state, false);
+			dpm_for_each_subordinate(dev, async_resume_noirq,
+						 dpm_async_queue_resume_ready_fn);
 
 			put_device(dev);
 
@@ -834,6 +900,9 @@  static void async_resume_early(void *data, async_cookie_t cookie)
 
 	device_resume_early(dev, pm_transition, true);
 	put_device(dev);
+
+	dpm_for_each_subordinate(dev, async_resume_early,
+				 dpm_async_queue_resume_ready_fn);
 }
 
 /**
@@ -852,23 +921,26 @@  void dpm_resume_early(pm_message_t state)
 
 	mutex_lock(&dpm_list_mtx);
 
+	dpm_reinit_dev_state(&dpm_late_early_list);
+
 	/*
 	 * Trigger the resume of "async" devices upfront so they don't have to
 	 * wait for the "non-async" ones they don't depend on.
 	 */
-	list_for_each_entry(dev, &dpm_late_early_list, power.entry)
-		dpm_async_fn(dev, async_resume_early);
+	dpm_async_resume_loop(&dpm_late_early_list, async_resume_early);
 
 	while (!list_empty(&dpm_late_early_list)) {
 		dev = to_device(dpm_late_early_list.next);
 		list_move_tail(&dev->power.entry, &dpm_suspended_list);
 
-		if (!dev->power.async_in_progress) {
+		if (!is_async(dev)) {
 			get_device(dev);
 
 			mutex_unlock(&dpm_list_mtx);
 
 			device_resume_early(dev, state, false);
+			dpm_for_each_subordinate(dev, async_resume_early,
+					dpm_async_queue_resume_ready_fn);
 
 			put_device(dev);
 
@@ -996,6 +1068,9 @@  static void async_resume(void *data, async_cookie_t cookie)
 
 	device_resume(dev, pm_transition, true);
 	put_device(dev);
+
+	dpm_for_each_subordinate(dev, async_resume,
+				 dpm_async_queue_resume_ready_fn);
 }
 
 /**
@@ -1018,23 +1093,26 @@  void dpm_resume(pm_message_t state)
 
 	mutex_lock(&dpm_list_mtx);
 
+	dpm_reinit_dev_state(&dpm_suspended_list);
+
 	/*
 	 * Trigger the resume of "async" devices upfront so they don't have to
 	 * wait for the "non-async" ones they don't depend on.
 	 */
-	list_for_each_entry(dev, &dpm_suspended_list, power.entry)
-		dpm_async_fn(dev, async_resume);
+	dpm_async_resume_loop(&dpm_suspended_list, async_resume);
 
 	while (!list_empty(&dpm_suspended_list)) {
 		dev = to_device(dpm_suspended_list.next);
 		list_move_tail(&dev->power.entry, &dpm_prepared_list);
 
-		if (!dev->power.async_in_progress) {
+		if (!is_async(dev)) {
 			get_device(dev);
 
 			mutex_unlock(&dpm_list_mtx);
 
 			device_resume(dev, state, false);
+			dpm_for_each_subordinate(dev, async_resume,
+					dpm_async_queue_resume_ready_fn);
 
 			put_device(dev);
 
@@ -1274,12 +1352,35 @@  static int device_suspend_noirq(struct device *dev, pm_message_t state, bool asy
 	return error;
 }
 
+static int dpm_async_queue_suspend_ready_fn(struct device *dev, void *data)
+{
+	if (!is_async(dev) || dpm_check_subordinate(dev))
+		return 0;
+
+	dpm_async_fn(dev, data);
+	return 0;
+}
+
+static void dpm_async_suspend_loop(struct list_head *from, async_func_t func)
+{
+	struct device *dev;
+
+	list_for_each_entry_reverse(dev, from, power.entry)
+		dpm_async_queue_suspend_ready_fn(dev, func);
+}
+
 static void async_suspend_noirq(void *data, async_cookie_t cookie)
 {
 	struct device *dev = data;
 
 	device_suspend_noirq(dev, pm_transition, true);
 	put_device(dev);
+
+	if (async_error)
+		return;
+
+	dpm_for_each_superior(dev, async_suspend_noirq,
+			      dpm_async_queue_suspend_ready_fn);
 }
 
 static int dpm_noirq_suspend_devices(pm_message_t state)
@@ -1294,12 +1395,20 @@  static int dpm_noirq_suspend_devices(pm_message_t state)
 
 	mutex_lock(&dpm_list_mtx);
 
+	dpm_reinit_dev_state(&dpm_late_early_list);
+
+	/*
+	 * Trigger the "async" devices upfront so they don't have to wait for
+	 * the "non-async" ones they don't depend on.
+	 */
+	dpm_async_suspend_loop(&dpm_late_early_list, async_suspend_noirq);
+
 	while (!list_empty(&dpm_late_early_list)) {
 		struct device *dev = to_device(dpm_late_early_list.prev);
 
 		list_move(&dev->power.entry, &dpm_noirq_list);
 
-		if (dpm_async_fn(dev, async_suspend_noirq))
+		if (is_async(dev))
 			continue;
 
 		get_device(dev);
@@ -1307,13 +1416,20 @@  static int dpm_noirq_suspend_devices(pm_message_t state)
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_suspend_noirq(dev, state, false);
+		if (!error && !async_error)
+			dpm_for_each_superior(dev, async_suspend_noirq,
+					      dpm_async_queue_suspend_ready_fn);
 
 		put_device(dev);
 
 		mutex_lock(&dpm_list_mtx);
 
-		if (error || async_error)
+		if (error || async_error) {
+			/* See explanation in dpm_suspend() */
+			list_splice_init(&dpm_late_early_list, &dpm_noirq_list);
 			break;
+		}
+
 	}
 
 	mutex_unlock(&dpm_list_mtx);
@@ -1447,6 +1563,12 @@  static void async_suspend_late(void *data, async_cookie_t cookie)
 
 	device_suspend_late(dev, pm_transition, true);
 	put_device(dev);
+
+	if (async_error)
+		return;
+
+	dpm_for_each_superior(dev, async_suspend_late,
+			      dpm_async_queue_suspend_ready_fn);
 }
 
 /**
@@ -1467,12 +1589,20 @@  int dpm_suspend_late(pm_message_t state)
 
 	mutex_lock(&dpm_list_mtx);
 
+	dpm_reinit_dev_state(&dpm_suspended_list);
+
+	/*
+	 * Trigger the "async" devices upfront so they don't have to wait for
+	 * the "non-async" ones they don't depend on.
+	 */
+	dpm_async_suspend_loop(&dpm_suspended_list, async_suspend_late);
+
 	while (!list_empty(&dpm_suspended_list)) {
 		struct device *dev = to_device(dpm_suspended_list.prev);
 
 		list_move(&dev->power.entry, &dpm_late_early_list);
 
-		if (dpm_async_fn(dev, async_suspend_late))
+		if (is_async(dev))
 			continue;
 
 		get_device(dev);
@@ -1480,13 +1610,19 @@  int dpm_suspend_late(pm_message_t state)
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_suspend_late(dev, state, false);
+		if (!error && !async_error)
+			dpm_for_each_superior(dev, async_suspend_late,
+					      dpm_async_queue_suspend_ready_fn);
 
 		put_device(dev);
 
 		mutex_lock(&dpm_list_mtx);
 
-		if (error || async_error)
+		if (error || async_error) {
+			/* See explanation in dpm_suspend() */
+			list_splice_init(&dpm_suspended_list, &dpm_late_early_list);
 			break;
+		}
 	}
 
 	mutex_unlock(&dpm_list_mtx);
@@ -1712,6 +1848,12 @@  static void async_suspend(void *data, async_cookie_t cookie)
 
 	device_suspend(dev, pm_transition, true);
 	put_device(dev);
+
+	if (async_error)
+		return;
+
+	dpm_for_each_superior(dev, async_suspend,
+			      dpm_async_queue_suspend_ready_fn);
 }
 
 /**
@@ -1734,12 +1876,20 @@  int dpm_suspend(pm_message_t state)
 
 	mutex_lock(&dpm_list_mtx);
 
+	dpm_reinit_dev_state(&dpm_prepared_list);
+
+	/*
+	 * Trigger the "async" devices upfront so they don't have to wait for
+	 * the "non-async" ones they don't depend on.
+	 */
+	dpm_async_suspend_loop(&dpm_prepared_list, async_suspend);
+
 	while (!list_empty(&dpm_prepared_list)) {
 		struct device *dev = to_device(dpm_prepared_list.prev);
 
 		list_move(&dev->power.entry, &dpm_suspended_list);
 
-		if (dpm_async_fn(dev, async_suspend))
+		if (is_async(dev))
 			continue;
 
 		get_device(dev);
@@ -1747,13 +1897,31 @@  int dpm_suspend(pm_message_t state)
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_suspend(dev, state, false);
+		if (!error && !async_error)
+			dpm_for_each_superior(dev, async_suspend,
+					      dpm_async_queue_suspend_ready_fn);
 
 		put_device(dev);
 
 		mutex_lock(&dpm_list_mtx);
 
-		if (error || async_error)
+		if (error || async_error) {
+			/*
+			 * async devices that come after the current sync device
+			 * in the list might have already suspended. We need to
+			 * make sure those async devices are resumed again. By
+			 * moving all devices to the next list, we make sure the
+			 * error handling path resumes all suspended devices.
+			 *
+			 * However, we also need to avoid resuming devices that
+			 * haven't been suspended yet. Fortunately, the existing
+			 * is_suspended/is_noirq_suspended/is_late_suspended
+			 * flags take care of skipping the resume of a device if
+			 * it hasn't been suspended yet.
+			 */
+			list_splice_init(&dpm_prepared_list, &dpm_suspended_list);
 			break;
+		}
 	}
 
 	mutex_unlock(&dpm_list_mtx);