diff mbox series

scsi: core: cleanup request queue before releasing gendisk

Message ID 20210915092547.990285-1-ming.lei@redhat.com
State New
Headers show
Series scsi: core: cleanup request queue before releasing gendisk | expand

Commit Message

Ming Lei Sept. 15, 2021, 9:25 a.m. UTC
gendisk instance has to be released after request queue is cleaned up
because bdi is referred from gendisk since commit edb0872f44ec ("block:
move the bdi from the request_queue to the gendisk").

For sd and sr, gendisk can be removed in the release handler(sd_remove/
sr_remove) of sdev->sdev_gendev, which is triggered in device_del(sdev->sdev_gendev)
in __scsi_remove_device(), when the request queue isn't cleaned up yet.

So kernel oops could be triggered when referring bdi via gendisk.

Fix the issue by moving blk_cleanup_queue() into sd_remove() and
sr_remove().

Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_priv.h   | 6 ++++++
 drivers/scsi/scsi_sysfs.c  | 3 ++-
 drivers/scsi/sd.c          | 2 ++
 drivers/scsi/sr.c          | 4 +++-
 include/scsi/scsi_device.h | 1 +
 5 files changed, 14 insertions(+), 2 deletions(-)

Comments

Ming Lei Sept. 17, 2021, 3:39 a.m. UTC | #1
On Thu, Sep 16, 2021 at 04:20:09PM +0200, Christoph Hellwig wrote:
> On Thu, Sep 16, 2021 at 08:38:16PM +0800, Ming Lei wrote:

> > > > and it may cause other trouble at least for scsi disk since sd_shutdown()

> > > > follows del_gendisk() and has to be called before blk_cleanup_queue().

> > > 

> > > Yes.  So we need to move the bits of blk_cleanup_queue that deal with

> > > the file system I/O state to del_gendisk, and keep blk_cleanup_queue

> > > for anything actually needed for the low-level queue.

> > 

> > Can you explain what the bits are in blk_cleanup_queue() for dealing with FS

> > I/O state? blk_cleanup_queue() drains and shutdown the queue basically,

> > all shouldn't be related with gendisk, and it is fine to implement one

> > queue without gendisk involved, such as nvme admin, connect queue or

> > sort of stuff.

> > 

> > Wrt. this reported issue, rq_qos_exit() needs to run before releasing

> > gendisk, but queue has to put into freezing before calling

> > rq_qos_exit(),

> 

> I was curious what you hit, but yes rq_qos_exit is obvious.

> blk_flush_integrity also is very much about fs I/O state.

> 

> 

> 

> > so looks you suggest to move the following code into

> > del_gendisk()?

> 

> something like that.  I think we need to split the dying flag into

> one for the gendisk and one for the queue first, and make sure the

> queue freeze in del_gendisk is released again so that passthrough

> still works after.


If we do that, q->disk is really unnecessary, so looks the fix of
'd152c682f03c block: add an explicit ->disk backpointer to the request_queue'
isn't good. The original issue added in 'edb0872f44ec block: move the
bdi from the request_queue to the gendisk' can be fixed simply by moving
the two lines code in blk_unregister_queue() to blk_cleanup_queue():

        kobject_uevent(&q->kobj, KOBJ_REMOVE);
        kobject_del(&q->kobj);


Thanks,
Ming
Christoph Hellwig Sept. 17, 2021, 6:53 a.m. UTC | #2
On Fri, Sep 17, 2021 at 11:39:48AM +0800, Ming Lei wrote:
> If we do that, q->disk is really unnecessary, so looks the fix of

> 'd152c682f03c block: add an explicit ->disk backpointer to the request_queue'

> isn't good.

> The original issue added in 'edb0872f44ec block: move the

> bdi from the request_queue to the gendisk' can be fixed simply by moving

> the two lines code in blk_unregister_queue() to blk_cleanup_queue():

> 

>         kobject_uevent(&q->kobj, KOBJ_REMOVE);

>         kobject_del(&q->kobj);


Well, it is still useful to not do the device model dance, especially as
uses of queue->disk will grow in 5.16.

Anyway, can you give this patch a spin?  I've done some basic testing
including nvme surprise removal locally.  Note that just like
blk_cleanup_queue this doesn't actually wait for the freeze to finish.
Eventually we should do that, but I'm sure this will show tons of issues
with drivers not properly cleaning up.

diff --git a/block/blk-core.c b/block/blk-core.c
index 5454db2fa263b..18287c443ae81 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -49,7 +49,6 @@
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
 #include "blk-pm.h"
-#include "blk-rq-qos.h"
 
 struct dentry *blk_debugfs_root;
 
@@ -385,13 +384,8 @@ void blk_cleanup_queue(struct request_queue *q)
 	 */
 	blk_freeze_queue(q);
 
-	rq_qos_exit(q);
-
 	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
 
-	/* for synchronous bio-based driver finish in-flight integrity i/o */
-	blk_flush_integrity();
-
 	blk_sync_queue(q);
 	if (queue_is_mq(q))
 		blk_mq_exit_queue(q);
@@ -470,19 +464,26 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 
 static inline int bio_queue_enter(struct bio *bio)
 {
-	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
 	bool nowait = bio->bi_opf & REQ_NOWAIT;
 	int ret;
 
-	ret = blk_queue_enter(q, nowait ? BLK_MQ_REQ_NOWAIT : 0);
+	ret = blk_queue_enter(disk->queue, nowait ? BLK_MQ_REQ_NOWAIT : 0);
 	if (unlikely(ret)) {
-		if (nowait && !blk_queue_dying(q))
+		if (nowait && !blk_queue_dying(disk->queue))
 			bio_wouldblock_error(bio);
 		else
 			bio_io_error(bio);
+		return ret;
 	}
 
-	return ret;
+	if (unlikely(!disk_live(disk))) {
+		blk_queue_exit(disk->queue);
+		bio_io_error(bio);
+		return -ENODEV;
+	}
+
+	return 0;
 }
 
 void blk_queue_exit(struct request_queue *q)
diff --git a/block/genhd.c b/block/genhd.c
index 7b6e5e1cf9564..8abe12dca76f2 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -26,6 +26,7 @@
 #include <linux/badblocks.h>
 
 #include "blk.h"
+#include "blk-rq-qos.h"
 
 static struct kobject *block_depr;
 
@@ -559,6 +560,8 @@ EXPORT_SYMBOL(device_add_disk);
  */
 void del_gendisk(struct gendisk *disk)
 {
+	struct request_queue *q = disk->queue;
+
 	might_sleep();
 
 	if (WARN_ON_ONCE(!disk_live(disk) && !(disk->flags & GENHD_FL_HIDDEN)))
@@ -572,6 +575,21 @@ void del_gendisk(struct gendisk *disk)
 	blk_drop_partitions(disk);
 	mutex_unlock(&disk->open_mutex);
 
+	/*
+	 * Prevent new I/O from crossing bio_queue_enter().
+	 */
+	blk_freeze_queue_start(q);
+	if (queue_is_mq(q))
+		blk_mq_wake_waiters(q);
+	/* Make blk_queue_enter() reexamine the DYING flag. */
+	wake_up_all(&q->mq_freeze_wq);
+
+	rq_qos_exit(q);
+	blk_sync_queue(q);
+	blk_flush_integrity();
+	/* allow using passthrough request again after the queue is torn down */
+	blk_mq_unfreeze_queue(q);
+
 	fsync_bdev(disk->part0);
 	__invalidate_device(disk->part0, true);
Ming Lei Sept. 17, 2021, 7:41 a.m. UTC | #3
On Fri, Sep 17, 2021 at 08:53:05AM +0200, Christoph Hellwig wrote:
> On Fri, Sep 17, 2021 at 11:39:48AM +0800, Ming Lei wrote:

> > If we do that, q->disk is really unnecessary, so looks the fix of

> > 'd152c682f03c block: add an explicit ->disk backpointer to the request_queue'

> > isn't good.

> > The original issue added in 'edb0872f44ec block: move the

> > bdi from the request_queue to the gendisk' can be fixed simply by moving

> > the two lines code in blk_unregister_queue() to blk_cleanup_queue():

> > 

> >         kobject_uevent(&q->kobj, KOBJ_REMOVE);

> >         kobject_del(&q->kobj);

> 

> Well, it is still useful to not do the device model dance, especially as

> uses of queue->disk will grow in 5.16.


It isn't an device model dance, since 

> 

> Anyway, can you give this patch a spin?  I've done some basic testing

> including nvme surprise removal locally.  Note that just like

> blk_cleanup_queue this doesn't actually wait for the freeze to finish.

> Eventually we should do that, but I'm sure this will show tons of issues

> with drivers not properly cleaning up.

> 

> diff --git a/block/blk-core.c b/block/blk-core.c

> index 5454db2fa263b..18287c443ae81 100644

> --- a/block/blk-core.c

> +++ b/block/blk-core.c

> @@ -49,7 +49,6 @@

>  #include "blk-mq.h"

>  #include "blk-mq-sched.h"

>  #include "blk-pm.h"

> -#include "blk-rq-qos.h"

>  

>  struct dentry *blk_debugfs_root;

>  

> @@ -385,13 +384,8 @@ void blk_cleanup_queue(struct request_queue *q)

>  	 */

>  	blk_freeze_queue(q);

>  

> -	rq_qos_exit(q);

> -

>  	blk_queue_flag_set(QUEUE_FLAG_DEAD, q);

>  

> -	/* for synchronous bio-based driver finish in-flight integrity i/o */

> -	blk_flush_integrity();

> -

>  	blk_sync_queue(q);

>  	if (queue_is_mq(q))

>  		blk_mq_exit_queue(q);

> @@ -470,19 +464,26 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)

>  

>  static inline int bio_queue_enter(struct bio *bio)

>  {

> -	struct request_queue *q = bio->bi_bdev->bd_disk->queue;

> +	struct gendisk *disk = bio->bi_bdev->bd_disk;

>  	bool nowait = bio->bi_opf & REQ_NOWAIT;

>  	int ret;

>  

> -	ret = blk_queue_enter(q, nowait ? BLK_MQ_REQ_NOWAIT : 0);

> +	ret = blk_queue_enter(disk->queue, nowait ? BLK_MQ_REQ_NOWAIT : 0);

>  	if (unlikely(ret)) {

> -		if (nowait && !blk_queue_dying(q))

> +		if (nowait && !blk_queue_dying(disk->queue))

>  			bio_wouldblock_error(bio);

>  		else

>  			bio_io_error(bio);

> +		return ret;

>  	}

>  

> -	return ret;

> +	if (unlikely(!disk_live(disk))) {

> +		blk_queue_exit(disk->queue);

> +		bio_io_error(bio);

> +		return -ENODEV;

> +	}


Is it safe to fail IO in this way? There is still opened bdev, and
usually IO can be done successfully even when disk is being deleted.

Not mention it adds one extra check in real fast path.

> +

> +	return 0;

>  }

>  

>  void blk_queue_exit(struct request_queue *q)

> diff --git a/block/genhd.c b/block/genhd.c

> index 7b6e5e1cf9564..8abe12dca76f2 100644

> --- a/block/genhd.c

> +++ b/block/genhd.c

> @@ -26,6 +26,7 @@

>  #include <linux/badblocks.h>

>  

>  #include "blk.h"

> +#include "blk-rq-qos.h"

>  

>  static struct kobject *block_depr;

>  

> @@ -559,6 +560,8 @@ EXPORT_SYMBOL(device_add_disk);

>   */

>  void del_gendisk(struct gendisk *disk)

>  {

> +	struct request_queue *q = disk->queue;

> +

>  	might_sleep();

>  

>  	if (WARN_ON_ONCE(!disk_live(disk) && !(disk->flags & GENHD_FL_HIDDEN)))

> @@ -572,6 +575,21 @@ void del_gendisk(struct gendisk *disk)

>  	blk_drop_partitions(disk);

>  	mutex_unlock(&disk->open_mutex);

>  

> +	/*

> +	 * Prevent new I/O from crossing bio_queue_enter().

> +	 */

> +	blk_freeze_queue_start(q);

> +	if (queue_is_mq(q))

> +		blk_mq_wake_waiters(q);

> +	/* Make blk_queue_enter() reexamine the DYING flag. */

> +	wake_up_all(&q->mq_freeze_wq);

> +

> +	rq_qos_exit(q);


rq_qos_exit() requires the queue to be frozen, otherwise kernel oops
can be triggered. There may be old submitted bios not completed yet,
and rq_qos_done() can be referring to q->rq_qos.

But if waiting for queue frozen is added, one extra freeze is added,
which will slow done remove disk/queue.


Thanks,
Ming
Christoph Hellwig Sept. 17, 2021, 7:56 a.m. UTC | #4
On Fri, Sep 17, 2021 at 03:41:05PM +0800, Ming Lei wrote:
> >  

> > -	return ret;

> > +	if (unlikely(!disk_live(disk))) {

> > +		blk_queue_exit(disk->queue);

> > +		bio_io_error(bio);

> > +		return -ENODEV;

> > +	}

> 

> Is it safe to fail IO in this way? There is still opened bdev, and

> usually IO can be done successfully even when disk is being deleted.


"normal" I/O should not really happen by the time it is deleted.  That
being said we should do this only after the fsync is done. While no
one should rely on that I'm pretty sure some file systems do.
So we'll actually need a deleted flag.

> Not mention it adds one extra check in real fast path.


I'm not really woried about the check itself.  That being
sais this inode cache line is not hot right now, so moving it to
disk->state will help as we need to check the read-only flag in
the the I/O submission path anyway.

> > +	/*

> > +	 * Prevent new I/O from crossing bio_queue_enter().

> > +	 */

> > +	blk_freeze_queue_start(q);

> > +	if (queue_is_mq(q))

> > +		blk_mq_wake_waiters(q);

> > +	/* Make blk_queue_enter() reexamine the DYING flag. */

> > +	wake_up_all(&q->mq_freeze_wq);

> > +

> > +	rq_qos_exit(q);

> 

> rq_qos_exit() requires the queue to be frozen, otherwise kernel oops

> can be triggered. There may be old submitted bios not completed yet,

> and rq_qos_done() can be referring to q->rq_qos.


Yes.  I actually misread the old code - it atually does two
blk_freeze_queue_start, but it also includes the wait.

> But if waiting for queue frozen is added, one extra freeze is added,

> which will slow done remove disk/queue.


Is it?  For the typical case the second free in blk_cleanp_queue will
be bsically free because there is no pending I/O.  The only case
where that matters if if there is pending passthrough I/O again,
which can only happen with SCSI, and even there is highly unusual.
Ming Lei Sept. 17, 2021, 8:32 a.m. UTC | #5
On Fri, Sep 17, 2021 at 09:56:50AM +0200, Christoph Hellwig wrote:
> On Fri, Sep 17, 2021 at 03:41:05PM +0800, Ming Lei wrote:

> > >  

> > > -	return ret;

> > > +	if (unlikely(!disk_live(disk))) {

> > > +		blk_queue_exit(disk->queue);

> > > +		bio_io_error(bio);

> > > +		return -ENODEV;

> > > +	}

> > 

> > Is it safe to fail IO in this way? There is still opened bdev, and

> > usually IO can be done successfully even when disk is being deleted.

> 

> "normal" I/O should not really happen by the time it is deleted.  That

> being said we should do this only after the fsync is done. While no

> one should rely on that I'm pretty sure some file systems do.

> So we'll actually need a deleted flag.

> 

> > Not mention it adds one extra check in real fast path.

> 

> I'm not really woried about the check itself.  That being

> sais this inode cache line is not hot right now, so moving it to

> disk->state will help as we need to check the read-only flag in

> the the I/O submission path anyway.


When the deleted flag is set in del_gendisk(), it may not be observed
in time in bio_submit() because of memory/compiler re-order, so the
check is still sort of relax constraint. I guess the same is true for
read-only check.

> 

> > > +	/*

> > > +	 * Prevent new I/O from crossing bio_queue_enter().

> > > +	 */

> > > +	blk_freeze_queue_start(q);

> > > +	if (queue_is_mq(q))

> > > +		blk_mq_wake_waiters(q);

> > > +	/* Make blk_queue_enter() reexamine the DYING flag. */

> > > +	wake_up_all(&q->mq_freeze_wq);

> > > +

> > > +	rq_qos_exit(q);

> > 

> > rq_qos_exit() requires the queue to be frozen, otherwise kernel oops

> > can be triggered. There may be old submitted bios not completed yet,

> > and rq_qos_done() can be referring to q->rq_qos.

> 

> Yes.  I actually misread the old code - it atually does two

> blk_freeze_queue_start, but it also includes the wait.


Only blk_freeze_queue() includes the wait, and blk_freeze_queue_start()
doesn't.

> 

> > But if waiting for queue frozen is added, one extra freeze is added,

> > which will slow done remove disk/queue.

> 

> Is it?  For the typical case the second free in blk_cleanp_queue will

> be bsically free because there is no pending I/O.  The only case

> where that matters if if there is pending passthrough I/O again,

> which can only happen with SCSI, and even there is highly unusual.


RCU grace period is involved in blk_freeze_queue(). One way you can
avoid it is to keep the percpu ref at atomic mode when running
blk_mq_unfreeze_queue() in del_gendisk().



Thanks,
Ming
Christoph Hellwig Sept. 17, 2021, 12:37 p.m. UTC | #6
On Fri, Sep 17, 2021 at 04:32:15PM +0800, Ming Lei wrote:
> > Is it?  For the typical case the second free in blk_cleanp_queue will

> > be bsically free because there is no pending I/O.  The only case

> > where that matters if if there is pending passthrough I/O again,

> > which can only happen with SCSI, and even there is highly unusual.

> 

> RCU grace period is involved in blk_freeze_queue().


where?

> One way you can

> avoid it is to keep the percpu ref at atomic mode when running

> blk_mq_unfreeze_queue() in del_gendisk().

> 

> 

> 

> Thanks,

> Ming

---end quoted text---
Ming Lei Sept. 17, 2021, 1:41 p.m. UTC | #7
On Fri, Sep 17, 2021 at 02:37:19PM +0200, Christoph Hellwig wrote:
> On Fri, Sep 17, 2021 at 04:32:15PM +0800, Ming Lei wrote:

> > > Is it?  For the typical case the second free in blk_cleanp_queue will

> > > be bsically free because there is no pending I/O.  The only case

> > > where that matters if if there is pending passthrough I/O again,

> > > which can only happen with SCSI, and even there is highly unusual.

> > 

> > RCU grace period is involved in blk_freeze_queue().

> 

> where?


__percpu_ref_switch_to_atomic().

-- 
Ming
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index 6d9152031a40..6aef88d772a3 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -145,6 +145,12 @@  extern void __scsi_remove_device(struct scsi_device *);
 extern struct bus_type scsi_bus_type;
 extern const struct attribute_group *scsi_sysfs_shost_attr_groups[];
 
+static inline void scsi_sysfs_cleanup_sdev(struct scsi_device *sdev)
+{
+	sdev->request_queue_cleaned = true;
+	blk_cleanup_queue(sdev->request_queue);
+}
+
 /* scsi_netlink.c */
 #ifdef CONFIG_SCSI_NETLINK
 extern void scsi_netlink_init(void);
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 86793259e541..9d54fc9c89ea 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1463,7 +1463,8 @@  void __scsi_remove_device(struct scsi_device *sdev)
 	scsi_device_set_state(sdev, SDEV_DEL);
 	mutex_unlock(&sdev->state_mutex);
 
-	blk_cleanup_queue(sdev->request_queue);
+	if (!sdev->request_queue_cleaned)
+		blk_cleanup_queue(sdev->request_queue);
 	cancel_work_sync(&sdev->requeue_work);
 
 	if (sdev->host->hostt->slave_destroy)
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index cbd9999f93a6..8132f9833488 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3498,6 +3498,8 @@  static int sd_remove(struct device *dev)
 	del_gendisk(sdkp->disk);
 	sd_shutdown(dev);
 
+	scsi_sysfs_cleanup_sdev(to_scsi_device(dev));
+
 	free_opal_dev(sdkp->opal_dev);
 
 	mutex_lock(&sd_ref_mutex);
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 8b17b35283aa..ef85940f3168 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -64,7 +64,7 @@ 
 
 #include "scsi_logging.h"
 #include "sr.h"
-
+#include "scsi_priv.h"
 
 MODULE_DESCRIPTION("SCSI cdrom (sr) driver");
 MODULE_LICENSE("GPL");
@@ -1045,6 +1045,8 @@  static int sr_remove(struct device *dev)
 	del_gendisk(cd->disk);
 	dev_set_drvdata(dev, NULL);
 
+	scsi_sysfs_cleanup_sdev(to_scsi_device(dev));
+
 	mutex_lock(&sr_ref_mutex);
 	kref_put(&cd->kref, sr_kref_release);
 	mutex_unlock(&sr_ref_mutex);
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 09a17f6e93a7..d102bc2c423b 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -207,6 +207,7 @@  struct scsi_device {
 	unsigned rpm_autosuspend:1;	/* Enable runtime autosuspend at device
 					 * creation time */
 	unsigned ignore_media_change:1; /* Ignore MEDIA CHANGE on resume */
+	unsigned request_queue_cleaned:1; /* Request queue has been cleaned up */
 
 	bool offline_already;		/* Device offline message logged */