Message ID | 20240806-isolcpus-io-queues-v3-0-da0eecfeaf8b@suse.de |
---|---|
Headers | show |
Series | honor isolcpus configuration | expand |
On Tue, 6 Aug 2024 at 08:10, Daniel Wagner <dwagner@suse.de> wrote: > The only stall I was able to trigger > reliable was with qemu's PCI emulation. It looks like when a CPU is > offlined, the PCI affinity is reprogrammed but qemu still routes IRQs to > an offline CPU instead to newly programmed destination CPU. All worked > fine on real hardware. Hi Daniel, Please file a QEMU bug report here (or just reply to this emails with details on how to reproduce the issue and I'll file the issue on your behalf): https://gitlab.com/qemu-project/qemu/-/issues We can also wait until your Linux patches have landed if that makes it easier to reproduce the bug. Thanks! Stefan
On Tue, Aug 06, 2024 at 02:06:33PM +0200, Daniel Wagner wrote: > blk_mq_pci_map_queues maps all queues but right after this, we > overwrite these mappings by calling blk_mq_map_queues. Just use one > helper but not both. Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de>
On Tue, Aug 06, 2024 at 02:06:47PM +0200, Daniel Wagner wrote: > When isolcpus=io_queue is enabled all hardware queues should run on the > housekeeping CPUs only. Thus ignore the affinity mask provided by the > driver. Also we can't use blk_mq_map_queues because it will map all CPUs > to first hctx unless, the CPU is the same as the hctx has the affinity > set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config What is the expected behavior if someone still tries to submit IO on isolated CPUs? BTW, I don't see any change in blk_mq_get_ctx()/blk_mq_map_queue() in this patchset, that means one random hctx(or even NULL) may be used for submitting IO from isolated CPUs, then there can be io hang risk during cpu hotplug, or kernel panic when submitting bio. Thanks, Ming
On 06/08/2024 13:06, Daniel Wagner wrote: > blk_mq_pci_map_queues maps all queues but right after this, we > overwrite these mappings by calling blk_mq_map_queues. Just use one > helper but not both. > > Fixes: 42f22fe36d51 ("scsi: pm8001: Expose hardware queues for pm80xx") > Signed-off-by: Daniel Wagner<dwagner@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com>
On Tue, Aug 06, 2024 at 09:09:50AM GMT, Stefan Hajnoczi wrote: > On Tue, 6 Aug 2024 at 08:10, Daniel Wagner <dwagner@suse.de> wrote: > > The only stall I was able to trigger > > reliable was with qemu's PCI emulation. It looks like when a CPU is > > offlined, the PCI affinity is reprogrammed but qemu still routes IRQs to > > an offline CPU instead to newly programmed destination CPU. All worked > > fine on real hardware. > > Hi Daniel, > Please file a QEMU bug report here (or just reply to this emails with > details on how to reproduce the issue and I'll file the issue on your > behalf): > https://gitlab.com/qemu-project/qemu/-/issues > > We can also wait until your Linux patches have landed if that makes it > easier to reproduce the bug. Thanks for the offer. I tried simplify the setup and come up with producer using qemu directly instead with libvirt. And now it works just fine. I'll try to figure out what the magic argument...
On Tue, Aug 06, 2024 at 10:55:09PM GMT, Ming Lei wrote: > On Tue, Aug 06, 2024 at 02:06:47PM +0200, Daniel Wagner wrote: > > When isolcpus=io_queue is enabled all hardware queues should run on the > > housekeeping CPUs only. Thus ignore the affinity mask provided by the > > driver. Also we can't use blk_mq_map_queues because it will map all CPUs > > to first hctx unless, the CPU is the same as the hctx has the affinity > > set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config > > What is the expected behavior if someone still tries to submit IO on isolated > CPUs? If a user thread is issuing an IO the IO is handled by the housekeeping CPU, which will cause some noise on the submitting CPU. As far I was told this is acceptable. Our customers really don't want to have any IO not from their application ever hitting the isolcpus. When their application is issuing an IO. > BTW, I don't see any change in blk_mq_get_ctx()/blk_mq_map_queue() in this > patchset, I was trying to figure out what you tried to explain last time with hangs, but didn't really understand what the conditions are for this problem to occur. > that means one random hctx(or even NULL) may be used for submitting > IO from isolated CPUs, > then there can be io hang risk during cpu hotplug, or > kernel panic when submitting bio. Can you elaborate a bit more? I must miss something important here. Anyway, my understanding is that when the last CPU of a hctx goes offline the affinity is broken and assigned to an online HK CPU. And we ensure all flight IO have finished and also ensure we don't submit any new IO to a CPU which goes offline. FWIW, I tried really hard to get an IO hang with cpu hotplug.
On Wed, Aug 07, 2024 at 02:40:11PM +0200, Daniel Wagner wrote: > On Tue, Aug 06, 2024 at 10:55:09PM GMT, Ming Lei wrote: > > On Tue, Aug 06, 2024 at 02:06:47PM +0200, Daniel Wagner wrote: > > > When isolcpus=io_queue is enabled all hardware queues should run on the > > > housekeeping CPUs only. Thus ignore the affinity mask provided by the > > > driver. Also we can't use blk_mq_map_queues because it will map all CPUs > > > to first hctx unless, the CPU is the same as the hctx has the affinity > > > set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config > > > > What is the expected behavior if someone still tries to submit IO on isolated > > CPUs? > > If a user thread is issuing an IO the IO is handled by the housekeeping > CPU, which will cause some noise on the submitting CPU. As far I was > told this is acceptable. Our customers really don't want to have any > IO not from their application ever hitting the isolcpus. When their > application is issuing an IO. > > > BTW, I don't see any change in blk_mq_get_ctx()/blk_mq_map_queue() in this > > patchset, > > I was trying to figure out what you tried to explain last time with > hangs, but didn't really understand what the conditions are for this > problem to occur. Isolated CPUs are removed from queue mapping in this patchset, when someone submit IOs from the isolated CPU, what is the correct hctx used for handling these IOs?
On Thu, Aug 08, 2024 at 01:26:41PM GMT, Ming Lei wrote: > Isolated CPUs are removed from queue mapping in this patchset, when someone > submit IOs from the isolated CPU, what is the correct hctx used for handling > these IOs? No, every possible CPU gets a mapping. What this patch series does, is to limit/aligns the number of hardware context to the number of housekeeping CPUs. There is still a complete ctx-hctc mapping. So whenever an user thread on an isolated CPU is issuing an IO a housekeeping CPU will also be involved (with the additional overhead, which seems to be okay for these users). Without hardware queue on the isolated CPUs ensures we really never get any unexpected IO on those CPUs unless userspace does it own its own. It's a safety net. Just to illustrate it, the non isolcpus configuration (default) map for an 8 CPU setup: queue mapping for /dev/vda hctx0: default 0 hctx1: default 1 hctx2: default 2 hctx3: default 3 hctx4: default 4 hctx5: default 5 hctx6: default 6 hctx7: default 7 and with isolcpus=io_queue,2-3,6-7 queue mapping for /dev/vda hctx0: default 0 2 hctx1: default 1 3 hctx2: default 4 6 hctx3: default 5 7 > From current implementation, it depends on implied zero filled > tag_set->map[type].mq_map[isolated_cpu], so hctx 0 is used. > > During CPU offline, in blk_mq_hctx_notify_offline(), > blk_mq_hctx_has_online_cpu() returns true even though the last cpu in > hctx 0 is offline because isolated cpus join hctx 0 unexpectedly, so IOs in > hctx 0 won't be drained. > > However managed irq core code still shutdowns the hw queue's irq because all > CPUs in this hctx are offline now. Then IO hang is triggered, isn't > it? Thanks for the explanation. I was able to reproduce this scenario, that is a hardware context with two CPUs which go offline. Initially, I used fio for creating the workload but this never hit the hanger. Instead some background workload from systemd-journald is pretty reliable to trigger the hanger you describe. Example: hctx2: default 4 6 CPU 0 stays online, CPU 1-5 are offline. CPU 6 is offlined: smpboot: CPU 5 is now offline blk_mq_hctx_has_online_cpu:3537 hctx3 offline blk_mq_hctx_has_online_cpu:3537 hctx2 offline and there is no forward progress anymore, the cpuhotplug state machine is blocked and an IO is hanging: # grep busy /sys/kernel/debug/block/*/hctx*/tags | grep -v busy=0 /sys/kernel/debug/block/vda/hctx2/tags:busy=61 and blk_mq_hctx_notify_offline busy loops forever: task:cpuhp/6 state:D stack:0 pid:439 tgid:439 ppid:2 flags:0x00004000 Call Trace: <TASK> __schedule+0x79d/0x15c0 ? lockdep_hardirqs_on_prepare+0x152/0x210 ? kvm_sched_clock_read+0xd/0x20 ? local_clock_noinstr+0x28/0xb0 ? local_clock+0x11/0x30 ? lock_release+0x122/0x4a0 schedule+0x3d/0xb0 schedule_timeout+0x88/0xf0 ? __pfx_process_timeout+0x10/0x10d msleep+0x28/0x40 blk_mq_hctx_notify_offline+0x1b5/0x200 ? cpuhp_thread_fun+0x41/0x1f0 cpuhp_invoke_callback+0x27e/0x780 ? __pfx_blk_mq_hctx_notify_offline+0x10/0x10 ? cpuhp_thread_fun+0x42/0x1f0 cpuhp_thread_fun+0x178/0x1f0 smpboot_thread_fn+0x12e/0x1c0 ? __pfx_smpboot_thread_fn+0x10/0x10 kthread+0xe8/0x110 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x33/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> I don't think this is a new problem this code introduces. This problem exists for any hardware context which has more than one CPU. As far I understand it, the problem is that there is no forward progress possible for the IO itself (I assume the corresponding resources for the CPU going offline have already been shutdown, thus no progress?) and blk_mq_hctx_notifiy_offline isn't doing anything in this scenario. Couldn't we do something like: +static bool blk_mq_hctx_timeout_rq(struct request *rq, void *data) +{ + blk_mq_rq_timed_out(rq); + return true; +} + +static void blk_mq_hctx_timeout_rqs(struct blk_mq_hw_ctx *hctx) +{ + struct blk_mq_tags *tags = hctx->sched_tags ? + hctx->sched_tags : hctx->tags; + blk_mq_all_tag_iter(tags, blk_mq_hctx_timeout_rq, NULL); +} + + static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) { struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_online); + int i; if (blk_mq_hctx_has_online_cpu(hctx, cpu)) return 0; @@ -3551,9 +3589,16 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) * requests. If we could not grab a reference the queue has been * frozen and there are no requests. */ + i = 0; if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) { - while (blk_mq_hctx_has_requests(hctx)) + while (blk_mq_hctx_has_requests(hctx) && i++ < 10) msleep(5); + if (blk_mq_hctx_has_requests(hctx)) { + pr_info("%s:%d hctx %d force timeout request\n", + __func__, __LINE__, hctx->queue_num); + blk_mq_hctx_timeout_rqs(hctx); + } + This guarantees forward progress and it worked in my test scenario, got the corresponding log entries blk_mq_hctx_notify_offline:3598 hctx 2 force timeout request and the hotplug state machine continued. Didn't see an IO error either, but I haven't looked closely, this is just a POC. BTW, when looking at the tag allocator, I didn't see any hctx state checks for the batched alloction path. Don't we need to check if the corresponding hardware context is active there too? @ -486,6 +487,15 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data) if (data->nr_tags > 1) { rq = __blk_mq_alloc_requests_batch(data); if (rq) { + if (unlikely(test_bit(BLK_MQ_S_INACTIVE, + &data->hctx->state))) { + blk_mq_put_tag(blk_mq_tags_from_data(data), + rq->mq_ctx, rq->tag); + msleep(3); + goto retry; + } blk_mq_rq_time_init(rq, alloc_time_ns); return rq; } But given this is the hotpath and the hotplug path is very unlikely to be used at all, at least for the majority of users, I would suggest to try to get blk_mq_hctx_notify_offline to guarantee forward progress?. This would make the hotpath an 'if' less.
On Fri, Aug 09, 2024 at 09:22:11AM +0200, Daniel Wagner wrote: > On Thu, Aug 08, 2024 at 01:26:41PM GMT, Ming Lei wrote: > > Isolated CPUs are removed from queue mapping in this patchset, when someone > > submit IOs from the isolated CPU, what is the correct hctx used for handling > > these IOs? > > No, every possible CPU gets a mapping. What this patch series does, is > to limit/aligns the number of hardware context to the number of > housekeeping CPUs. There is still a complete ctx-hctc mapping. So OK, then I guess patch 1~7 aren't supposed to belong to this series, cause you just want to reduce nr_hw_queues, meantime spread house-keeping CPUs first for avoiding queues with all isolated cpu mask. > whenever an user thread on an isolated CPU is issuing an IO a > housekeeping CPU will also be involved (with the additional overhead, > which seems to be okay for these users). > > Without hardware queue on the isolated CPUs ensures we really never get > any unexpected IO on those CPUs unless userspace does it own its own. > It's a safety net. > > Just to illustrate it, the non isolcpus configuration (default) map > for an 8 CPU setup: > > queue mapping for /dev/vda > hctx0: default 0 > hctx1: default 1 > hctx2: default 2 > hctx3: default 3 > hctx4: default 4 > hctx5: default 5 > hctx6: default 6 > hctx7: default 7 > > and with isolcpus=io_queue,2-3,6-7 > > queue mapping for /dev/vda > hctx0: default 0 2 > hctx1: default 1 3 > hctx2: default 4 6 > hctx3: default 5 7 OK, Looks I missed the point in patch 15 in which you added isolated cpu into mapping manually, just wondering why not take the current two-stage policy to cover both house-keeping and isolated CPUs in group_cpus_evenly()? Such as spread house-keeping CPUs first, then isolated CPUs, just like what we did for present & non-present cpus. Then the whole patchset can be simplified a lot. > > > From current implementation, it depends on implied zero filled > > tag_set->map[type].mq_map[isolated_cpu], so hctx 0 is used. > > > > During CPU offline, in blk_mq_hctx_notify_offline(), > > blk_mq_hctx_has_online_cpu() returns true even though the last cpu in > > hctx 0 is offline because isolated cpus join hctx 0 unexpectedly, so IOs in > > hctx 0 won't be drained. > > > > However managed irq core code still shutdowns the hw queue's irq because all > > CPUs in this hctx are offline now. Then IO hang is triggered, isn't > > it? > > Thanks for the explanation. I was able to reproduce this scenario, that > is a hardware context with two CPUs which go offline. Initially, I used > fio for creating the workload but this never hit the hanger. Instead > some background workload from systemd-journald is pretty reliable to > trigger the hanger you describe. > > Example: > > hctx2: default 4 6 > > CPU 0 stays online, CPU 1-5 are offline. CPU 6 is offlined: > > smpboot: CPU 5 is now offline > blk_mq_hctx_has_online_cpu:3537 hctx3 offline > blk_mq_hctx_has_online_cpu:3537 hctx2 offline > > and there is no forward progress anymore, the cpuhotplug state machine > is blocked and an IO is hanging: > > # grep busy /sys/kernel/debug/block/*/hctx*/tags | grep -v busy=0 > /sys/kernel/debug/block/vda/hctx2/tags:busy=61 > > and blk_mq_hctx_notify_offline busy loops forever: > > task:cpuhp/6 state:D stack:0 pid:439 tgid:439 ppid:2 flags:0x00004000 > Call Trace: > <TASK> > __schedule+0x79d/0x15c0 > ? lockdep_hardirqs_on_prepare+0x152/0x210 > ? kvm_sched_clock_read+0xd/0x20 > ? local_clock_noinstr+0x28/0xb0 > ? local_clock+0x11/0x30 > ? lock_release+0x122/0x4a0 > schedule+0x3d/0xb0 > schedule_timeout+0x88/0xf0 > ? __pfx_process_timeout+0x10/0x10d > msleep+0x28/0x40 > blk_mq_hctx_notify_offline+0x1b5/0x200 > ? cpuhp_thread_fun+0x41/0x1f0 > cpuhp_invoke_callback+0x27e/0x780 > ? __pfx_blk_mq_hctx_notify_offline+0x10/0x10 > ? cpuhp_thread_fun+0x42/0x1f0 > cpuhp_thread_fun+0x178/0x1f0 > smpboot_thread_fn+0x12e/0x1c0 > ? __pfx_smpboot_thread_fn+0x10/0x10 > kthread+0xe8/0x110 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x33/0x40 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1a/0x30 > </TASK> > > I don't think this is a new problem this code introduces. This problem > exists for any hardware context which has more than one CPU. As far I > understand it, the problem is that there is no forward progress possible > for the IO itself (I assume the corresponding resources for the CPU When blk_mq_hctx_notify_offline() is running, the current CPU isn't offline yet, and the hctx is active, same with the managed irq, so it is fine to wait until all in-flight IOs originated from this hctx completed there. The reason is why these requests can't be completed? And the forward progress is provided by blk-mq. And these requests are very likely allocated & submitted from CPU6. Can you figure out what is effective mask for irq of hctx2? It is supposed to be cpu6. And block debugfs for vda should provide helpful hint. > going offline have already been shutdown, thus no progress?) and > blk_mq_hctx_notifiy_offline isn't doing anything in this scenario. RH has internal cpu hotplug stress test, but not see such report so far. I will try to setup such kind of setting and see if it can be reproduced. > > Couldn't we do something like: I usually won't thinking about any solution until root-cause is figured out, :-) Thanks, Ming
On Fri, Aug 09, 2024 at 10:53:16PM GMT, Ming Lei wrote: > On Fri, Aug 09, 2024 at 09:22:11AM +0200, Daniel Wagner wrote: > > On Thu, Aug 08, 2024 at 01:26:41PM GMT, Ming Lei wrote: > > > Isolated CPUs are removed from queue mapping in this patchset, when someone > > > submit IOs from the isolated CPU, what is the correct hctx used for handling > > > these IOs? > > > > No, every possible CPU gets a mapping. What this patch series does, is > > to limit/aligns the number of hardware context to the number of > > housekeeping CPUs. There is still a complete ctx-hctc mapping. So > > OK, then I guess patch 1~7 aren't supposed to belong to this series, > cause you just want to reduce nr_hw_queues, meantime spread > house-keeping CPUs first for avoiding queues with all isolated cpu > mask. I tried to explain the reason for these patches in the cover letter. The idea here is that it makes the later changes simpler, because we only have to touch one place. Furthermore, the caller just needs to provide an affinity mask the rest of the code then is generic. This allows to replace the open coded mapping code in hisi for example. Overall I think the resulting code is nicer and cleaner. > OK, Looks I missed the point in patch 15 in which you added isolated cpu > into mapping manually, just wondering why not take the current two-stage > policy to cover both house-keeping and isolated CPUs in > group_cpus_evenly()? Patch #15 explains why this approach didn't work in the current form. blk_mq_map_queues will map all isolated CPUs to the first hctx. > Such as spread house-keeping CPUs first, then isolated CPUs, just like > what we did for present & non-present cpus. I've experimented with this approach and it didn't work (see above). > When blk_mq_hctx_notify_offline() is running, the current CPU isn't > offline yet, and the hctx is active, same with the managed irq, so it is fine > to wait until all in-flight IOs originated from this hctx completed > there. But if the if for some reason these never complete (as in my case), this blocks forever. Wouldn't it make sense to abort the wait after a while? > The reason is why these requests can't be completed? And the forward > progress is provided by blk-mq. And these requests are very likely > allocated & submitted from CPU6. Yes, I can confirm that the in flight request have been allocated and submitted by the CPU which is offlined. Here a log snipped from a different debug session. CPU 1 and 2 are already offline, CPU 3 is offlined. The CPU mapping for hctx1 is hctx1: default 1 3 I've added a printk to my hack timeout handler: blk_mq_hctx_notify_offline:3600 hctx 1 force timeout request blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3 that means these request have been allocated on CPU 3 and are still marked as in flight. I am trying to figure out why they are not completed as next step. > Can you figure out what is effective mask for irq of hctx2? It is > supposed to be cpu6. And block debugfs for vda should provide helpful > hint. The effective mask for the above debug output is queue mapping for /dev/vda hctx0: default 0 2 hctx1: default 1 3 hctx2: default 4 6 hctx3: default 5 7 PCI name is 00:02.0: vda irq 27 affinity 0-1 effective 0 virtio0-config irq 28 affinity 0 effective 0 virtio0-req.0 irq 29 affinity 1 effective 1 virtio0-req.1 irq 30 affinity 4 effective 4 virtio0-req.2 irq 31 affinity 5 effective 5 virtio0-req.3 Maybe there is still something off with qemu and the IRQ routing and the interrupts have been delivered to the wrong CPU. > > going offline have already been shutdown, thus no progress?) and > > blk_mq_hctx_notifiy_offline isn't doing anything in this scenario. > > RH has internal cpu hotplug stress test, but not see such report so > far. Is this stress test running on real hardware? If so, it adds to my theory that the interrupt might be lost in certain situation when running qemu. > Couldn't we do something like: > > I usually won't thinking about any solution until root-cause is figured > out, :-) I agree, though sometimes is also is okay to have some defensive programming in place, such an upper limit when until giving up the wait. But yeah, let's focus figuring out what's wrong.
After the discussion with Ming on managed_irq, I decided to bring back the io_queue option because managed_irq is clearly targetting a different use case and Ming ask not to drop support for his use case. In an offline discussion with Frederic, I learned what the plans are with isolcpus. The future is in cpusets but reconfiguration happens only on offline CPUs. I think this approach will go into this direction. I've digged up Ming's attempt to replace the blk_mq_[pci|virtio]_map_queues a more generic blk_mq_dev_map_queues function which takes callback to ask the driver for an affinity mask for a hardware queue. With this central function in place, it's also simple to overwrite affinities in the core and the drivers don't have to be made aware of isolcpus configurations. The original attempt was to update the nvme-pci driver only. Hannes asked me to look into the other multiqueue drivers/devices. Unfortunatly, I don't have all the hardware to test against. So this is only tested with nvme-pci, smartpqi, qla2xxx and megaraid. The testing also involved CPU hotplug events and I was not able to observe any stalls, e.g. with hctx which have online and offline CPUs. The only stall I was able to trigger reliable was with qemu's PCI emulation. It looks like when a CPU is offlined, the PCI affinity is reprogrammed but qemu still routes IRQs to an offline CPU instead to newly programmed destination CPU. All worked fine on real hardware. Finally, I also added a new CPU-hctx mapping function for the isolcpus case. Initially the blk_mq_map_queues function was used but it turns out this will map all isol CPUs to the first CPU. The new function first maps the housekeeping CPUs to the right htcx using the existing mapping logic. The isol CPUs are then assigned evenly to the hctxs. I suppose this could be done a bit smarter and also considering NUMA aspects when we agree on this approach. This series is based on linux-block/for-6.12/block. If this is the wrong branch please let me know which one is better suited. Thanks. Initial cover letter: The nvme-pci driver is ignoring the isolcpus configuration. There were several attempts to fix this in the past [1][2]. This is another attempt but this time trying to address the feedback and solve it in the core code. The first patch introduces a new option for isolcpus 'io_queue', but I'm not really sure if this is needed and we could just use the managed_irq option instead. I guess depends if there is an use case which depens on queues on the isolated CPUs. The second patch introduces a new block layer helper which returns the number of possible queues. I suspect it would makes sense also to make this helper a bit smarter and also consider the number of queues the hardware supports. And the last patch updates the group_cpus_evenly function so that it uses only the housekeeping CPUs when they are defined Note this series is not addressing the affinity setting of the admin queue (queue 0). I'd like to address this after we agreed on how to solve this. Currently, the admin queue affinity can be controlled by the irq_afffinity command line option, so there is at least a workaround for it. Baseline: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 1536 MB node 0 free: 1227 MB node 1 cpus: 4 5 6 7 node 1 size: 1729 MB node 1 free: 1422 MB node distances: node 0 1 0: 10 20 1: 20 10 options nvme write_queues=4 poll_queues=4 55: 0 41 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 0-edge nvme0q0 affinity: 0-3 63: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1 affinity: 4-5 64: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2 affinity: 6-7 65: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3 affinity: 0-1 66: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 4-edge nvme0q4 affinity: 2-3 67: 0 0 0 0 24 0 0 0 PCI-MSIX-0000:00:05.0 5-edge nvme0q5 affinity: 4 68: 0 0 0 0 0 1 0 0 PCI-MSIX-0000:00:05.0 6-edge nvme0q6 affinity: 5 69: 0 0 0 0 0 0 41 0 PCI-MSIX-0000:00:05.0 7-edge nvme0q7 affinity: 6 70: 0 0 0 0 0 0 0 3 PCI-MSIX-0000:00:05.0 8-edge nvme0q8 affinity: 7 71: 1 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 9-edge nvme0q9 affinity: 0 72: 0 18 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 10-edge nvme0q10 affinity: 1 73: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 11-edge nvme0q11 affinity: 2 74: 0 0 0 3 0 0 0 0 PCI-MSIX-0000:00:05.0 12-edge nvme0q12 affinity: 3 queue mapping for /dev/nvme0n1 hctx0: default 4 5 hctx1: default 6 7 hctx2: default 0 1 hctx3: default 2 3 hctx4: read 4 hctx5: read 5 hctx6: read 6 hctx7: read 7 hctx8: read 0 hctx9: read 1 hctx10: read 2 hctx11: read 3 hctx12: poll 4 5 hctx13: poll 6 7 hctx14: poll 0 1 hctx15: poll 2 3 PCI name is 00:05.0: nvme0n1 irq 55, cpu list 0-3, effective list 1 irq 63, cpu list 4-5, effective list 5 irq 64, cpu list 6-7, effective list 7 irq 65, cpu list 0-1, effective list 1 irq 66, cpu list 2-3, effective list 3 irq 67, cpu list 4, effective list 4 irq 68, cpu list 5, effective list 5 irq 69, cpu list 6, effective list 6 irq 70, cpu list 7, effective list 7 irq 71, cpu list 0, effective list 0 irq 72, cpu list 1, effective list 1 irq 73, cpu list 2, effective list 2 irq 74, cpu list 3, effective list 3 * patched: 48: 0 0 33 0 0 0 0 0 PCI-MSIX-0000:00:05.0 0-edge nvme0q0 affinity: 0-3 58: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1 affinity: 4 59: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2 affinity: 5 60: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3 affinity: 0 61: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 4-edge nvme0q4 affinity: 1 62: 0 0 0 0 45 0 0 0 PCI-MSIX-0000:00:05.0 5-edge nvme0q5 affinity: 4 63: 0 0 0 0 0 12 0 0 PCI-MSIX-0000:00:05.0 6-edge nvme0q6 affinity: 5 64: 2 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 7-edge nvme0q7 affinity: 0 65: 0 35 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 8-edge nvme0q8 affinity: 1 queue mapping for /dev/nvme0n1 hctx0: default 2 3 4 6 7 hctx1: default 5 hctx2: default 0 hctx3: default 1 hctx4: read 4 hctx5: read 5 hctx6: read 0 hctx7: read 1 hctx8: poll 4 hctx9: poll 5 hctx10: poll 0 hctx11: poll 1 PCI name is 00:05.0: nvme0n1 irq 48, cpu list 0-3, effective list 2 irq 58, cpu list 4, effective list 4 irq 59, cpu list 5, effective list 5 irq 60, cpu list 0, effective list 0 irq 61, cpu list 1, effective list 1 irq 62, cpu list 4, effective list 4 irq 63, cpu list 5, effective list 5 irq 64, cpu list 0, effective list 0 irq 65, cpu list 1, effective list 1 [1] https://lore.kernel.org/lkml/20220423054331.GA17823@lst.de/T/#m9939195a465accbf83187caf346167c4242e798d [2] https://lore.kernel.org/linux-nvme/87fruci5nj.ffs@tglx/ Signed-off-by: Daniel Wagner <dwagner@suse.de> --- Changes in v3: - lifted a couple of patches from https://lore.kernel.org/all/20210709081005.421340-1-ming.lei@redhat.com/ "virito: add APIs for retrieving vq affinity" "blk-mq: introduce blk_mq_dev_map_queues" - replaces all users of blk_mq_[pci|virtio]_map_queues with blk_mq_dev_map_queues - updated/extended number of queue calc helpers - add isolcpus=io_queue CPU-hctx mapping function - documented enum hk_type and isolcpus=io_queue - added "scsi: pm8001: do not overwrite PCI queue mapping" - Link to v2: https://lore.kernel.org/r/20240627-isolcpus-io-queues-v2-0-26a32e3c4f75@suse.de Changes in v2: - updated documentation - splitted blk/nvme-pci patch - dropped HK_TYPE_IO_QUEUE, use HK_TYPE_MANAGED_IRQ - Link to v1: https://lore.kernel.org/r/20240621-isolcpus-io-queues-v1-0-8b169bf41083@suse.de --- Daniel Wagner (13): scsi: pm8001: do not overwrite PCI queue mapping scsi: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues nvme: replace blk_mq_pci_map_queues with blk_mq_dev_map_queues virtio: blk/scs: replace blk_mq_virtio_map_queues with blk_mq_dev_map_queues blk-mq: remove unused queue mapping helpers sched/isolation: Add io_queue housekeeping option docs: add io_queue as isolcpus options blk-mq: add number of queue calc helper nvme-pci: use block layer helpers to calculate num of queues scsi: use block layer helpers to calculate num of queues virtio: blk/scsi: use block layer helpers to calculate num of queues lib/group_cpus.c: honor housekeeping config when grouping CPUs blk-mq: use hk cpus only when isolcpus=io_queue is enabled Ming Lei (2): virito: add APIs for retrieving vq affinity blk-mq: introduce blk_mq_dev_map_queues Documentation/admin-guide/kernel-parameters.txt | 9 ++ block/blk-mq-cpumap.c | 136 ++++++++++++++++++++++++ block/blk-mq-pci.c | 41 ++----- block/blk-mq-virtio.c | 46 +++----- drivers/block/virtio_blk.c | 8 +- drivers/nvme/host/pci.c | 8 +- drivers/scsi/fnic/fnic_main.c | 3 +- drivers/scsi/hisi_sas/hisi_sas.h | 1 - drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 20 ++-- drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 5 +- drivers/scsi/megaraid/megaraid_sas_base.c | 18 ++-- drivers/scsi/mpi3mr/mpi3mr_os.c | 3 +- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 3 +- drivers/scsi/pm8001/pm8001_init.c | 9 +- drivers/scsi/qla2xxx/qla_isr.c | 10 +- drivers/scsi/qla2xxx/qla_nvme.c | 3 +- drivers/scsi/qla2xxx/qla_os.c | 3 +- drivers/scsi/smartpqi/smartpqi_init.c | 12 +-- drivers/scsi/virtio_scsi.c | 4 +- drivers/virtio/virtio.c | 10 ++ include/linux/blk-mq-pci.h | 7 +- include/linux/blk-mq-virtio.h | 8 +- include/linux/blk-mq.h | 7 ++ include/linux/sched/isolation.h | 15 +++ include/linux/virtio.h | 2 + kernel/sched/isolation.c | 7 ++ lib/group_cpus.c | 75 ++++++++++++- 27 files changed, 350 insertions(+), 123 deletions(-) --- base-commit: f48ada402d2f1e46fa241bcc6725bdde70725e15 change-id: 20240620-isolcpus-io-queues-1a88eb47ff8b Best regards,