diff mbox series

[v2,3/4] PM: hibernate: allow wait_for_device_probe() to timeout when resuming from hibernation

Message ID 2646e8a3-cc9f-c2c5-e4d6-c86de6e1b739@I-love.SAKURA.ne.jp
State New
Headers show
Series [v2,1/4] char: misc: allow calling open() callback without misc_mtx held | expand

Commit Message

Tetsuo Handa July 10, 2022, 2:25 a.m. UTC
syzbot is reporting hung task at misc_open() [1], for there is a race
window of AB-BA deadlock which involves probe_count variable.

Even with "char: misc: allow calling open() callback without misc_mtx
held" and "PM: hibernate: call wait_for_device_probe() without
system_transition_mutex held", wait_for_device_probe() from snapshot_open()
can sleep forever if probe_count cannot become 0.

Since snapshot_open() is a userland-driven hibernation/resume request,
it should be acceptable to fail if something is wrong. Users would not
want to wait for hours if device stopped responding. Therefore, introduce
killable version of wait_for_device_probe() with timeout.

According to Oliver Neukum, there are SCSI commands that can run for more
than 60 seconds. Therefore, this patch choose 5 minutes for timeout.

Link: https://syzkaller.appspot.com/bug?extid=358c9ab4c93da7b7238c [1]
Reported-by: syzbot <syzbot+358c9ab4c93da7b7238c@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Wedson Almeida Filho <wedsonaf@google.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 drivers/base/dd.c             | 14 ++++++++++++++
 include/linux/device/driver.h |  1 +
 kernel/power/user.c           |  9 +++++++--
 3 files changed, 22 insertions(+), 2 deletions(-)

Comments

Rafael J. Wysocki July 11, 2022, 6:13 p.m. UTC | #1
On Sun, Jul 10, 2022 at 4:25 AM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> syzbot is reporting hung task at misc_open() [1], for there is a race
> window of AB-BA deadlock which involves probe_count variable.
>
> Even with "char: misc: allow calling open() callback without misc_mtx
> held" and "PM: hibernate: call wait_for_device_probe() without
> system_transition_mutex held", wait_for_device_probe() from snapshot_open()
> can sleep forever if probe_count cannot become 0.
>
> Since snapshot_open() is a userland-driven hibernation/resume request,
> it should be acceptable to fail if something is wrong.

Not really.

If you are resuming from hibernation and the image cannot be reached
(which is the situation described above), failing and continuing to
boot means discarding the image and possible user data loss.

There is no "graceful failure" in this case.

> Users would not want to wait for hours if device stopped responding.

If the device holding the image is not responding, we should better
wait for it or panic().  Or let the user make the system reboot.
diff mbox series

Patch

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 70f79fc71539..3136b8403bb0 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -724,6 +724,20 @@  void wait_for_device_probe(void)
 }
 EXPORT_SYMBOL_GPL(wait_for_device_probe);
 
+void wait_for_device_probe_killable_timeout(unsigned long timeout)
+{
+	/* wait for probe timeout */
+	wait_event(probe_timeout_waitqueue, !driver_deferred_probe_timeout);
+
+	/* wait for the deferred probe workqueue to finish */
+	flush_work(&deferred_probe_work);
+
+	/* wait for the known devices to complete their probing */
+	wait_event_killable_timeout(probe_waitqueue,
+				    atomic_read(&probe_count) == 0, timeout);
+	async_synchronize_full();
+}
+
 static int __driver_probe_device(struct device_driver *drv, struct device *dev)
 {
 	int ret = 0;
diff --git a/include/linux/device/driver.h b/include/linux/device/driver.h
index 7acaabde5396..4ee909144470 100644
--- a/include/linux/device/driver.h
+++ b/include/linux/device/driver.h
@@ -129,6 +129,7 @@  extern struct device_driver *driver_find(const char *name,
 					 struct bus_type *bus);
 extern int driver_probe_done(void);
 extern void wait_for_device_probe(void);
+extern void wait_for_device_probe_killable_timeout(unsigned long timeout);
 void __init wait_for_init_devices_probe(void);
 
 /* sysfs interface for exporting driver attributes */
diff --git a/kernel/power/user.c b/kernel/power/user.c
index db98a028dfdd..32dd5a855e8c 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -58,8 +58,13 @@  static int snapshot_open(struct inode *inode, struct file *filp)
 		/* The image device should be already ready. */
 		break;
 	default: /* Resuming */
-		/* We may need to wait for the image device to appear. */
-		wait_for_device_probe();
+		/*
+		 * Since the image device might not be ready, try to wait up to
+		 * 5 minutes. We should not wait forever, for we might get stuck
+		 * due to unresponsive devices and/or new probe events which
+		 * are irrelevant to the image device keep coming in.
+		 */
+		wait_for_device_probe_killable_timeout(300 * HZ);
 		break;
 	}