diff mbox

[v2,1/1] block: fix blk_queue_split() resource exhaustion

Message ID CAD9gYJ+U7drRjswfOf310v2aSgUKAnYT4STJcfxwqPhBzXEccg@mail.gmail.com
State New
Headers show

Commit Message

Jack Wang Jan. 5, 2017, 10:54 a.m. UTC
2017-01-04 19:50 GMT+01:00 Mike Snitzer <snitzer@redhat.com>:
> On Wed, Jan 04 2017 at 12:12am -0500,

> NeilBrown <neilb@suse.com> wrote:

>

>> On Tue, Jan 03 2017, Jack Wang wrote:

>>

>> > 2016-12-23 12:45 GMT+01:00 Lars Ellenberg <lars.ellenberg@linbit.com>:

>> >> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:

>> >>> Dear Maintainers

>> >>>

>> >>> I'd like to ask for the status of this patch since we hit the

>> >>> issue too during our testing on md raid1.

>> >>>

>> >>> Split remainder bio_A was queued ahead, following by bio_B for

>> >>> lower device, at this moment raid start freezing, the loop take

>> >>> out bio_A firstly and deliver it, which will hung since raid is

>> >>> freezing, while the freezing never end since it waiting for

>> >>> bio_B to finish, and bio_B is still on the queue, waiting for

>> >>> bio_A to finish...

>> >>>

>> >>> We're looking for a good solution and we found this patch

>> >>> already progressed a lot, but we can't find it on linux-next,

>> >>> so we'd like to ask are we still planning to have this fix

>> >>> in upstream?

>> >>

>> >> I don't see why not, I'd even like to have it in older kernels,

>> >> but did not have the time and energy to push it.

>> >>

>> >> Thanks for the bump.

>> >>

>> >>         Lars

>> >>

>> > Hi folks,

>> >

>> > As Michael mentioned, we hit a bug this patch is trying to fix.

>> > Neil suggested another way to fix it.  I attached below.

>> > I personal prefer Neil's version as it's less code change, and straight forward.

>> >

>> > Could you share your comments, we can get one fix into mainline.

>> >

>> > Thanks,

>> > Jinpu

>> > From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001

>> > From: NeilBrown <neilb@suse.com>

>> > Date: Wed, 14 Dec 2016 16:55:52 +0100

>> > Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()

>> >

>> > When we call wait_barrier, we might have some bios waiting

>> > in current->bio_list, which prevents the array_freeze call to

>> > complete. Those can only be internal READs, which have already

>> > passed the wait_barrier call (thus incrementing nr_pending), but

>> > still were not submitted to the lower level, due to generic_make_request

>> > logic to avoid recursive calls. In such case, we have a deadlock:

>> > - array_frozen is already set to 1, so wait_barrier unconditionally waits, so

>> > - internal READ bios will not be submitted, thus freeze_array will

>> > never completes.

>> >

>> > To fix this, modify generic_make_request to always sort bio_list_on_stack

>> > first with lowest level, then higher, until same level.

>> >

>> > Sent to linux-raid mail list:

>> > https://marc.info/?l=linux-raid&m=148232453107685&w=2

>> >

>>

>> This should probably also have

>>

>>   Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>

>>

>> or something that, as I was building on Lars' ideas when I wrote this.

>>

>> It would also be worth noting in the description that this addresses

>> issues with dm and drbd as well as md.

>

> I never saw this patch but certainly like the relative simplicity of the

> solution when compared with other approaches taken, e.g. (5 topmost

> commits on this branch):

> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

>

>> In fact, I think that with this patch in place, much of the need for the

>> rescue_workqueue won't exist any more.  I cannot promise it can be

>> removed completely, but it should be to hard to make it optional and

>> only enabled for those few block devices that will still need it.

>> The rescuer should only be needed for a bioset which can be allocated

>> From twice in the one call the ->make_request_fn.  This would include

>> raid0 for example, though raid0_make_reqest could be re-written to not

>> use a loop and to just call generic_make_request(bio) if bio != split.

>

> Mikulas, would you be willing to try the below patch with the

> dm-snapshot deadlock scenario and report back on whether it fixes that?

>

> Patch below looks to be the same as here:

> https://marc.info/?l=linux-raid&m=148232453107685&q=p3

>

> Neil and/or others if that isn't the patch that should be tested please

> provide a pointer to the latest.

>

> Thanks,

> Mike


Thanks Mike,

I've rebased the patch on to Linux-4.10-rc2, and updated the
description as Neil suggested.
If Mikulas get possitive feedback, then we can go with it.

Cheers,
Jinpu
diff mbox

Patch

From 4ffaefb719c129ed51f9fcb235b945caf56de8d1 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.com>
Date: Wed, 14 Dec 2016 16:55:52 +0100
Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()

When we call wait_barrier, we might have some bios waiting
in current->bio_list, which prevents the array_freeze call to
complete. Those can only be internal READs, which have already
passed the wait_barrier call (thus incrementing nr_pending), but
still were not submitted to the lower level, due to generic_make_request
logic to avoid recursive calls. In such case, we have a deadlock:
- array_frozen is already set to 1, so wait_barrier unconditionally waits, so
- internal READ bios will not be submitted, thus freeze_array will
never completes.

To fix this, modify generic_make_request to always sort bio_list_on_stack
first with lowest level, then higher, until same level.

This would address issuses with dm and drbd as well as md.

Sent to linux-raid mail list:
https://marc.info/?l=linux-raid&m=148232453107685&w=2

Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 block/blk-core.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c..2f74129 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2019,9 +2019,30 @@  blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			struct bio_list lower, same, hold;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			bio_list_init(&hold);
+			bio_list_merge(&hold, &bio_list_on_stack);
+			bio_list_init(&bio_list_on_stack);
+
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
 
 			bio = bio_list_pop(current->bio_list);
 		} else {
-- 
2.7.4