[RFC] blk-mq: Don't IPI requests on PREEMPT_RT

Message ID	20201023110400.bx3uzsb7xy5jtsea@linutronix.de
State	New
Headers	show Return-Path: <SRS0=2Uay=D6=vger.kernel.org=linux-rt-users-owner@kernel.org> Date: Fri, 23 Oct 2020 13:04:00 +0200 From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> To: David Runge <dave@sleepmap.de> Cc: linux-rt-users@vger.kernel.org, Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Peter Zijlstra <peterz@infradead.org>, Thomas Gleixner <tglx@linutronix.de>, Daniel Wagner <dwagner@suse.de> Subject: [PATCH RFC] blk-mq: Don't IPI requests on PREEMPT_RT Message-ID: <20201023110400.bx3uzsb7xy5jtsea@linutronix.de> References: <20201021175059.GA4989@hmbx> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20201021175059.GA4989@hmbx> Precedence: bulk
Series	[RFC] blk-mq: Don't IPI requests on PREEMPT_RT \| expand [RFC] blk-mq: Don't IPI requests on PREEMPT_RT

Sebastian Andrzej Siewior Oct. 23, 2020, 11:04 a.m. UTC

blk_mq_complete_request_remote() will dispatch request completion to
another CPU via IPI if the CPU belongs to a different cache domain.

This breaks on PREEMPT_RT because the IPI function will complete the
request in IRQ context which includes acquiring spinlock_t typed locks.
Completing the IPI request in softirq on the remote CPU is probably less
efficient because it would require to wake ksoftirqd for this task
(which runs at SCHED_OTHER).

Ignoring the IPI request and completing the request locally is probably
the best option. It be completed either in the IRQ-thread or at the end
of the routine in softirq context.

Let blk_mq_complete_need_ipi() return that there is no need for IPI on
PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Christoph Hellwig Oct. 23, 2020, 11:21 a.m. UTC | #1

> -	if (!IS_ENABLED(CONFIG_SMP) ||
> +	if (!IS_ENABLED(CONFIG_SMP) || IS_ENABLED(CONFIG_PREEMPT_RT) ||
>  	    !test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))

This needs a big fat comment explaining your rationale.  And probably
a separate if statement to make it obvious as well.

Sebastian Andrzej Siewior Oct. 23, 2020, 1:52 p.m. UTC | #2

On 2020-10-23 12:21:30 [+0100], Christoph Hellwig wrote:
> > -	if (!IS_ENABLED(CONFIG_SMP) ||
> > +	if (!IS_ENABLED(CONFIG_SMP) || IS_ENABLED(CONFIG_PREEMPT_RT) ||
> >  	    !test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))
> 
> This needs a big fat comment explaining your rationale.  And probably
> a separate if statement to make it obvious as well.

Okay.
How much difference does it make between completing in-softirq vs
in-IPI? I'm asking because acquiring a spinlock_t in an IPI shouldn't be
done (as per Documentation/locking/locktypes.rst). We don't have
anything in lockdep that will complain here on !RT and we the above we
avoid the case on RT.

Sebastian

Christoph Hellwig Oct. 27, 2020, 9:26 a.m. UTC | #3

On Fri, Oct 23, 2020 at 03:52:19PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-10-23 12:21:30 [+0100], Christoph Hellwig wrote:
> > > -	if (!IS_ENABLED(CONFIG_SMP) ||
> > > +	if (!IS_ENABLED(CONFIG_SMP) || IS_ENABLED(CONFIG_PREEMPT_RT) ||
> > >  	    !test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))
> > 
> > This needs a big fat comment explaining your rationale.  And probably
> > a separate if statement to make it obvious as well.
> 
> Okay.
> How much difference does it make between completing in-softirq vs
> in-IPI?

For normal non-RT builds?  This introduces another context switch, which
for the latencies we are aiming for is noticable.

> I'm asking because acquiring a spinlock_t in an IPI shouldn't be
> done (as per Documentation/locking/locktypes.rst). We don't have
> anything in lockdep that will complain here on !RT and we the above we
> avoid the case on RT.

At least for NVMe we aren't taking locks, but with the number of drivers

Sebastian Andrzej Siewior Oct. 27, 2020, 10:11 a.m. UTC | #4

On 2020-10-27 09:26:06 [+0000], Christoph Hellwig wrote:
> On Fri, Oct 23, 2020 at 03:52:19PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2020-10-23 12:21:30 [+0100], Christoph Hellwig wrote:
> > > > -	if (!IS_ENABLED(CONFIG_SMP) ||
> > > > +	if (!IS_ENABLED(CONFIG_SMP) || IS_ENABLED(CONFIG_PREEMPT_RT) ||
> > > >  	    !test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))
> > > 
> > > This needs a big fat comment explaining your rationale.  And probably
> > > a separate if statement to make it obvious as well.
> > 
> > Okay.
> > How much difference does it make between completing in-softirq vs
> > in-IPI?
> 
> For normal non-RT builds?  This introduces another context switch, which
> for the latencies we are aiming for is noticable.

There should be no context switch. The pending softirq should be
executed on irq_exit() from that IPI, that is
  irq_exit()
  -> __irq_exit_rcu()
    -> invoke_softirq()
      -> __do_softirq() || do_softirq_own_stack() 

unlike with the command line switch `threadirqs' enabled,
invoke_softirq() woukd wakeup the `ksoftirqd' thread which would involve
a context switch.

> > I'm asking because acquiring a spinlock_t in an IPI shouldn't be
> > done (as per Documentation/locking/locktypes.rst). We don't have
> > anything in lockdep that will complain here on !RT and we the above we
> > avoid the case on RT.
> 
> At least for NVMe we aren't taking locks, but with the number of drivers

Right. I found this David Runge's log:

|BUG: scheduling while atomic: swapper/19/0/0x00010002
|CPU: 19 PID: 0 Comm: swapper/19 Not tainted 5.9.1-rt18-1-rt #1
|Hardware name: System manufacturer System Product Name/Pro WS X570-ACE, BIOS 1302 01/20/2020
|Call Trace:
| <IRQ>
| dump_stack+0x6b/0x88
| __schedule_bug.cold+0x89/0x97
| __schedule+0x6a4/0xa10
| preempt_schedule_lock+0x23/0x40
| rt_spin_lock_slowlock_locked+0x117/0x2c0
| rt_spin_lock_slowlock+0x58/0x80
| rt_spin_lock+0x2a/0x40
| test_clear_page_writeback+0xcd/0x310
| end_page_writeback+0x43/0x70
| end_bio_extent_buffer_writepage+0xb2/0x100 [btrfs]
| btrfs_end_bio+0x83/0x140 [btrfs]
| clone_endio+0x84/0x1f0 [dm_mod]
| blk_update_request+0x254/0x470
| blk_mq_end_request+0x1c/0x130
| flush_smp_call_function_queue+0xd5/0x1a0
| __sysvec_call_function_single+0x36/0x150
| asm_call_irq_on_stack+0x12/0x20
| </IRQ>

so the NVME driver isn't taking any locks but lock_page_memcg() (and
xa_lock_irqsave()) in test_clear_page_writeback() is.

Sebastian

Christoph Hellwig Oct. 27, 2020, 4:07 p.m. UTC | #5

On Tue, Oct 27, 2020 at 11:11:02AM +0100, Sebastian Andrzej Siewior wrote:
> Right. I found this David Runge's log:

True, ->bi_end_io instances can do a lot of things as long as they
are hardirq safe.

And in the end the IPI case isn't the super fast path anyway, as it
means we don't use a queue per CPU.

Is there a way to raise a softirq and preferably place it on a given
CPU without our IPI dance?  That should be a win-win situation for
everyone.

Thomas Gleixner Oct. 27, 2020, 5:05 p.m. UTC | #6

On Tue, Oct 27 2020 at 16:07, Christoph Hellwig wrote:
> On Tue, Oct 27, 2020 at 11:11:02AM +0100, Sebastian Andrzej Siewior wrote:
>> Right. I found this David Runge's log:
>
> True, ->bi_end_io instances can do a lot of things as long as they
> are hardirq safe.
>
> And in the end the IPI case isn't the super fast path anyway, as it
> means we don't use a queue per CPU.
>
> Is there a way to raise a softirq and preferably place it on a given
> CPU without our IPI dance?  That should be a win-win situation for
> everyone.

Not really. Softirq pending bits are strictly per cpu and we don't have
locking or atomics to set them remotely. Even if we had that, then you'd
still need a mechanism to make sure that the remote CPU actually
processes them. So you'd still need an IPI of some sorts.

Thanks,

        tglx

Christoph Hellwig Oct. 27, 2020, 5:23 p.m. UTC | #7

On Tue, Oct 27, 2020 at 06:05:15PM +0100, Thomas Gleixner wrote:
> > Is there a way to raise a softirq and preferably place it on a given
> > CPU without our IPI dance?  That should be a win-win situation for
> > everyone.
> 
> Not really. Softirq pending bits are strictly per cpu and we don't have
> locking or atomics to set them remotely. Even if we had that, then you'd
> still need a mechanism to make sure that the remote CPU actually
> processes them. So you'd still need an IPI of some sorts.

Ok.  I was hoping we could hide this in core code somehow, especially
a peterz didn't like the use of smp_call_function_single_async in the
blk-mq completion code very much.

Sebastian, would this solve your preempt-rt and lockdep issues?


diff --git a/block/blk-mq.c b/block/blk-mq.c
index cdced4aca2e812..5c125fb11b5691 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -626,19 +626,7 @@ static void __blk_mq_complete_request_remote(void *data)
 {
 	struct request *rq = data;
 
-	/*
-	 * For most of single queue controllers, there is only one irq vector
-	 * for handling I/O completion, and the only irq's affinity is set
-	 * to all possible CPUs.  On most of ARCHs, this affinity means the irq
-	 * is handled on one specific CPU.
-	 *
-	 * So complete I/O requests in softirq context in case of single queue
-	 * devices to avoid degrading I/O performance due to irqsoff latency.
-	 */
-	if (rq->q->nr_hw_queues == 1)
-		blk_mq_trigger_softirq(rq);
-	else
-		rq->q->mq_ops->complete(rq);
+	blk_mq_trigger_softirq(rq);
 }
 
 static inline bool blk_mq_complete_need_ipi(struct request *rq)

Sebastian Andrzej Siewior Oct. 27, 2020, 5:59 p.m. UTC | #8

On 2020-10-27 17:23:09 [+0000], Christoph Hellwig wrote:
> On Tue, Oct 27, 2020 at 06:05:15PM +0100, Thomas Gleixner wrote:
> > > Is there a way to raise a softirq and preferably place it on a given
> > > CPU without our IPI dance?  That should be a win-win situation for
> > > everyone.
> > 
> > Not really. Softirq pending bits are strictly per cpu and we don't have
> > locking or atomics to set them remotely. Even if we had that, then you'd
> > still need a mechanism to make sure that the remote CPU actually
> > processes them. So you'd still need an IPI of some sorts.
> 
> Ok.  I was hoping we could hide this in core code somehow, especially
> a peterz didn't like the use of smp_call_function_single_async in the
> blk-mq completion code very much.
> 
> Sebastian, would this solve your preempt-rt and lockdep issues?

second. I'm cooking something.

Sebastian

Sebastian Andrzej Siewior Oct. 27, 2020, 8:58 p.m. UTC | #9

On 2020-10-27 17:23:09 [+0000], Christoph Hellwig wrote:
> Ok.  I was hoping we could hide this in core code somehow, especially
> a peterz didn't like the use of smp_call_function_single_async in the
> blk-mq completion code very much.

No idea how you could efficiently avoid smp_call_function_single_async():
- workqueue (okay)
- a timer_list timer which expires now. More code plus it may delay up
  to HZ.

> Sebastian, would this solve your preempt-rt and lockdep issues?

the problem with that is that on RT/force-threaded it will always wake
ksoftirqd thread and complete there. That are extra steps which should
be probably avoided.

Now.
The hunk in blk_mq_complete_need_ipi() will avoid the waking a thread
with force-threaded enabled.

The remaining part is a switch to llist which avoids locking (IRQ
off/on) and it allows invoke the IPI/raise softirq only if something was
added. The entries are now processed in the reverse order but this
shouldn't matter right?

I would split this into two patches (the blk_mq_complete_need_ipi() hunk
and the llist part) unless there are objections.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 55bcee5dc0320..d2452ee9b0e2c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -41,7 +41,7 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 
-static DEFINE_PER_CPU(struct list_head, blk_cpu_done);
+static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);
 
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
@@ -565,80 +565,31 @@ void blk_mq_end_request(struct request *rq, blk_status_t error)
 }
 EXPORT_SYMBOL(blk_mq_end_request);
 
-/*
- * Softirq action handler - move entries to local list and loop over them
- * while passing them to the queue registered handler.
- */
-static __latent_entropy void blk_done_softirq(struct softirq_action *h)
+static void blk_complete_reqs(struct llist_head *cpu_list)
 {
-	struct list_head *cpu_list, local_list;
+	struct llist_node *entry;
+	struct request *rq, *rq_next;
 
-	local_irq_disable();
-	cpu_list = this_cpu_ptr(&blk_cpu_done);
-	list_replace_init(cpu_list, &local_list);
-	local_irq_enable();
+	entry = llist_del_all(cpu_list);
 
-	while (!list_empty(&local_list)) {
-		struct request *rq;
-
-		rq = list_entry(local_list.next, struct request, ipi_list);
-		list_del_init(&rq->ipi_list);
+	llist_for_each_entry_safe(rq, rq_next, entry, ipi_list)
 		rq->q->mq_ops->complete(rq);
-	}
 }
 
-static void blk_mq_trigger_softirq(struct request *rq)
+static __latent_entropy void blk_done_softirq(struct softirq_action *h)
 {
-	struct list_head *list;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	list = this_cpu_ptr(&blk_cpu_done);
-	list_add_tail(&rq->ipi_list, list);
-
-	/*
-	 * If the list only contains our just added request, signal a raise of
-	 * the softirq.  If there are already entries there, someone already
-	 * raised the irq but it hasn't run yet.
-	 */
-	if (list->next == &rq->ipi_list)
-		raise_softirq_irqoff(BLOCK_SOFTIRQ);
-	local_irq_restore(flags);
+	blk_complete_reqs(this_cpu_ptr(&blk_cpu_done));
 }
 
 static int blk_softirq_cpu_dead(unsigned int cpu)
 {
-	/*
-	 * If a CPU goes away, splice its entries to the current CPU
-	 * and trigger a run of the softirq
-	 */
-	local_irq_disable();
-	list_splice_init(&per_cpu(blk_cpu_done, cpu),
-			 this_cpu_ptr(&blk_cpu_done));
-	raise_softirq_irqoff(BLOCK_SOFTIRQ);
-	local_irq_enable();
-
+	blk_complete_reqs(&per_cpu(blk_cpu_done, cpu));
 	return 0;
 }
 
-
 static void __blk_mq_complete_request_remote(void *data)
 {
-	struct request *rq = data;
-
-	/*
-	 * For most of single queue controllers, there is only one irq vector
-	 * for handling I/O completion, and the only irq's affinity is set
-	 * to all possible CPUs.  On most of ARCHs, this affinity means the irq
-	 * is handled on one specific CPU.
-	 *
-	 * So complete I/O requests in softirq context in case of single queue
-	 * devices to avoid degrading I/O performance due to irqsoff latency.
-	 */
-	if (rq->q->nr_hw_queues == 1)
-		blk_mq_trigger_softirq(rq);
-	else
-		rq->q->mq_ops->complete(rq);
+	__raise_softirq_irqoff(BLOCK_SOFTIRQ);
 }
 
 static inline bool blk_mq_complete_need_ipi(struct request *rq)
@@ -648,6 +599,14 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
 	if (!IS_ENABLED(CONFIG_SMP) ||
 	    !test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags))
 		return false;
+	/*
+	 * With force threaded interrupts enabled, raising softirq from an SMP
+	 * function call will always result in waking the ksoftirqd thread.
+	 * This is probably worse than completing the request on a different
+	 * cache domain.
+	 */
+       if (force_irqthreads)
+               return false;
 
 	/* same CPU or cache domain?  Complete locally */
 	if (cpu == rq->mq_ctx->cpu ||
@@ -661,6 +620,7 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq)
 
 bool blk_mq_complete_request_remote(struct request *rq)
 {
+	struct llist_head *cpu_list;
 	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
 
 	/*
@@ -671,14 +631,21 @@ bool blk_mq_complete_request_remote(struct request *rq)
 		return false;
 
 	if (blk_mq_complete_need_ipi(rq)) {
-		rq->csd.func = __blk_mq_complete_request_remote;
-		rq->csd.info = rq;
-		rq->csd.flags = 0;
-		smp_call_function_single_async(rq->mq_ctx->cpu, &rq->csd);
+		unsigned int cpu;
+
+		cpu = rq->mq_ctx->cpu;
+		cpu_list = &per_cpu(blk_cpu_done, cpu);
+		if (llist_add(&rq->ipi_list, cpu_list)) {
+			rq->csd.func = __blk_mq_complete_request_remote;
+			rq->csd.flags = 0;
+			smp_call_function_single_async(cpu, &rq->csd);
+		}
 	} else {
 		if (rq->q->nr_hw_queues > 1)
 			return false;
-		blk_mq_trigger_softirq(rq);
+		cpu_list = this_cpu_ptr(&blk_cpu_done);
+		if (llist_add(&rq->ipi_list, cpu_list))
+			raise_softirq(BLOCK_SOFTIRQ);
 	}
 
 	return true;
@@ -3909,7 +3876,7 @@ static int __init blk_mq_init(void)
 	int i;
 
 	for_each_possible_cpu(i)
-		INIT_LIST_HEAD(&per_cpu(blk_cpu_done, i));
+		init_llist_head(&per_cpu(blk_cpu_done, i));
 	open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);
 
 	cpuhp_setup_state_nocalls(CPUHP_BLOCK_SOFTIRQ_DEAD,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 639cae2c158b5..331b2b675b417 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -156,7 +156,7 @@ struct request {
 	 */
 	union {
 		struct hlist_node hash;	/* merge hash */
-		struct list_head ipi_list;
+		struct llist_node ipi_list;
 	};
 
 	/*

Christoph Hellwig Oct. 28, 2020, 6:56 a.m. UTC | #10

> The remaining part is a switch to llist which avoids locking (IRQ

> off/on) and it allows invoke the IPI/raise softirq only if something was

> added. The entries are now processed in the reverse order but this

> shouldn't matter right?


For correctness it should not matter, but I think it could have
performance implications.  I think you'll have to throw in a
llist_reverse_order.

> I would split this into two patches (the blk_mq_complete_need_ipi() hunk

> and the llist part) unless there are objections.


Yes, please do.

Peter Zijlstra Oct. 28, 2020, 10:04 a.m. UTC | #11

On Tue, Oct 27, 2020 at 05:23:09PM +0000, Christoph Hellwig wrote:
> Ok.  I was hoping we could hide this in core code somehow, especially
> a peterz didn't like the use of smp_call_function_single_async in the
> blk-mq completion code very much.

It's smp_call_function_single_async() in general that I don't much like.
But Linus seemed unconvinced, so we'll keep it for a while I suppose.

[RFC] blk-mq: Don't IPI requests on PREEMPT_RT

Commit Message

Comments

Patch