mbox series

[v6,0/9] blk: honor isolcpus configuration

Message ID 20250424-isolcpus-io-queues-v6-0-9a53a870ca1f@kernel.org
Headers show
Series blk: honor isolcpus configuration | expand

Message

Daniel Wagner April 24, 2025, 6:19 p.m. UTC
I've added back the isolcpus io_queue agrument. This avoids any semantic
changes of managed_irq. I don't like it but I haven't found a
better way to deal with it. Ming clearly stated managed_irq should not
change.

Another change is to prevent offlining a housekeeping CPU which is still
serving an isolated CPU instead just warning. Seem way saner way to
handle this situation. Thanks Mathieu!

Here details what's the difference is between managed_irq and io_queue.

* nr cpus <= nr hardware queue

(e.g. 8 CPUs, 8 hardware queues)

managed_irq is working nicely for situation where there hardware has at
least as many hardware queues as CPUs, e.g. enterprise nvme-pci devices.

managed_irq will assign each CPU its own hardware queue and ensures that
non unbound IO is scheduled to a isolated CPU. As long the isolated CPU
is not issuing any IO there will be no block layer 'noise' on the
isolated CPU.

  - irqaffinity=0 isolcpus=managed_ird,2-3,6-7

	queue mapping for /dev/nvme0n1
	        hctx0: default 0
	        hctx1: default 1
	        hctx2: default 2
	        hctx3: default 3
	        hctx4: default 4
	        hctx5: default 5
	        hctx6: default 6
	        hctx7: default 7

	IRQ mapping for nvme0n1
	        irq 40 affinity 0 effective 0  nvme0q0
	        irq 41 affinity 0 effective 0  nvme0q1
	        irq 42 affinity 1 effective 1  nvme0q2
	        irq 43 affinity 2 effective 2  nvme0q3
	        irq 44 affinity 3 effective 3  nvme0q4
	        irq 45 affinity 4 effective 4  nvme0q5
	        irq 46 affinity 5 effective 5  nvme0q6
	        irq 47 affinity 6 effective 6  nvme0q7
	        irq 48 affinity 7 effective 7  nvme0q8

With this configuration io_queue will create four hctx for the four
housekeeping CPUs:

  - irqaffinity=0 isolcpus=io_queue,2-3,6-7

	queue mapping for /dev/nvme0n1
	        hctx0: default 0 2
	        hctx1: default 1 3
	        hctx2: default 4 6
	        hctx3: default 5 7

	IRQ mapping for /dev/nvme0n1
	        irq 36 affinity 0 effective 0  nvme0q0
	        irq 37 affinity 0 effective 0  nvme0q1
	        irq 38 affinity 1 effective 1  nvme0q2
	        irq 39 affinity 4 effective 4  nvme0q3
	        irq 40 affinity 5 effective 5  nvme0q4

* nr cpus > nr hardware queue

(e.g. 8 CPUs, 2 hardware queues)

managed_irq is creating two hctx and all CPUs could handle IRQs. In this
case an isolated CPU is selected to handle all IRQs for a given hctx:

  - irqaffinity=0 isolcpus=managed_ird,2-3,6-7

	queue mapping for /dev/nvme0n1
	        hctx0: default 0 1 2 3
	        hctx1: default 4 5 6 7

	IRQ mapping for /dev/nvme0n1
	        irq 40 affinity 0 effective 0  nvme0q0
	        irq 41 affinity 0-3 effective 3  nvme0q1
	        irq 42 affinity 4-7 effective 7  nvme0q2

io_queue also creates also two hctxs but only assigns housekeeping CPUs
to handle the IRQs:

  - irqaffinity=0 isolcpus=io_queue,2-3,6-7

	queue mapping for /dev/nvme0n1
	        hctx0: default 0 1 2 6
	        hctx1: default 3 4 5 7

	IRQ mapping for /dev/nvme0n1
	        irq 36 affinity 0 effective 0  nvme0q0
	        irq 37 affinity 0-1 effective 1  nvme0q1
	        irq 38 affinity 4-5 effective 5  nvme0q2

The case that there are less hardware queues than CPUs is more common
with the SCSI HBAs so with the io_queue approach not just nvme-pci are
supported.

Something completely different: we got several bug reports for kdump and
SCSI HBAs. The issue is that the SCSI drivers are allocating too many
resources when it's a kdump kernel. This series will fix this as well,
because the number of queues will be limitted by
blk_mq_num_possible_queues() instead of num_possible_cpus(). This will
avoid sprinkling is_kdump_kernel() around.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
Changes in v6:
- added io_queue isolcpus type back
- prevent offlining hk cpu if a isol cpu is still present isntead just warning
- Link to v5: https://lore.kernel.org/r/20250110-isolcpus-io-queues-v5-0-0e4f118680b0@kernel.org

Changes in v5:
- rebased on latest for-6.14/block
- udpated documetation on managed_irq
- updated commit message "blk-mq: issue warning when offlining hctx with online isolcpus"
- split input/output parameter in "lib/group_cpus: let group_cpu_evenly return number of groups"
- dropped "sched/isolation: document HK_TYPE housekeeping option"
- Link to v4: https://lore.kernel.org/r/20241217-isolcpus-io-queues-v4-0-5d355fbb1e14@kernel.org

Changes in v4:
- added "blk-mq: issue warning when offlining hctx with online isolcpus"
- fixed check in cgroup_cpus_evenly, the if condition needs to use
  housekeeping_enabled() and not cpusmask_weight(housekeeping_masks),
  because the later will always return a valid mask.
- dropped fixed tag from "lib/group_cpus.c: honor housekeeping config when
  grouping CPUs"
- fixed overlong line "scsi: use block layer helpers to calculate num
  of queues"
- dropped "sched/isolation: Add io_queue housekeeping option",
  just document the housekeep enum hk_type
- added "lib/group_cpus: let group_cpu_evenly return number of groups"
- collected tags
- splitted series into a preperation series:
  https://lore.kernel.org/linux-nvme/20241202-refactor-blk-affinity-helpers-v6-0-27211e9c2cd5@kernel.org/
- Link to v3: https://lore.kernel.org/r/20240806-isolcpus-io-queues-v3-0-da0eecfeaf8b@suse.de

Changes in v3:
- lifted a couple of patches from
  https://lore.kernel.org/all/20210709081005.421340-1-ming.lei@redhat.com/
  "virito: add APIs for retrieving vq affinity"
  "blk-mq: introduce blk_mq_dev_map_queues"
- replaces all users of blk_mq_[pci|virtio]_map_queues with
  blk_mq_dev_map_queues
- updated/extended number of queue calc helpers
- add isolcpus=io_queue CPU-hctx mapping function
- documented enum hk_type and isolcpus=io_queue
- added "scsi: pm8001: do not overwrite PCI queue mapping"
- Link to v2: https://lore.kernel.org/r/20240627-isolcpus-io-queues-v2-0-26a32e3c4f75@suse.de

Changes in v2:
- updated documentation
- splitted blk/nvme-pci patch
- dropped HK_TYPE_IO_QUEUE, use HK_TYPE_MANAGED_IRQ
- Link to v1: https://lore.kernel.org/r/20240621-isolcpus-io-queues-v1-0-8b169bf41083@suse.de

---
Daniel Wagner (9):
      lib/group_cpus: let group_cpu_evenly return number initialized masks
      blk-mq: add number of queue calc helper
      nvme-pci: use block layer helpers to calculate num of queues
      scsi: use block layer helpers to calculate num of queues
      virtio: blk/scsi: use block layer helpers to calculate num of queues
      isolation: introduce io_queue isolcpus type
      lib/group_cpus: honor housekeeping config when grouping CPUs
      blk-mq: use hk cpus only when isolcpus=io_queue is enabled
      blk-mq: prevent offlining hk CPU with associated online isolated CPUs

 block/blk-mq-cpumap.c                     | 116 +++++++++++++++++++++++++++++-
 block/blk-mq.c                            |  46 +++++++++++-
 drivers/block/virtio_blk.c                |   5 +-
 drivers/nvme/host/pci.c                   |   5 +-
 drivers/scsi/megaraid/megaraid_sas_base.c |  15 ++--
 drivers/scsi/qla2xxx/qla_isr.c            |  10 +--
 drivers/scsi/smartpqi/smartpqi_init.c     |   5 +-
 drivers/scsi/virtio_scsi.c                |   1 +
 drivers/virtio/virtio_vdpa.c              |   9 +--
 fs/fuse/virtio_fs.c                       |   6 +-
 include/linux/blk-mq.h                    |   2 +
 include/linux/group_cpus.h                |   3 +-
 include/linux/sched/isolation.h           |   1 +
 kernel/irq/affinity.c                     |   9 +--
 kernel/sched/isolation.c                  |   7 ++
 lib/group_cpus.c                          |  90 +++++++++++++++++++++--
 16 files changed, 290 insertions(+), 40 deletions(-)
---
base-commit: 3b607b75a345b1d808031bf1bb1038e4dac8d521
change-id: 20240620-isolcpus-io-queues-1a88eb47ff8b

Best regards,

Comments

Ming Lei May 6, 2025, 3:17 a.m. UTC | #1
On Thu, Apr 24, 2025 at 08:19:39PM +0200, Daniel Wagner wrote:
> I've added back the isolcpus io_queue agrument. This avoids any semantic
> changes of managed_irq.

IMO, this is correct thing to do.

> I don't like it but I haven't found a
> better way to deal with it. Ming clearly stated managed_irq should not
> change.

Precisely, we can't cause io hang and break existing managed_irq applications,
especially you know there isn't kernel solution for it, same for v5, v6 or
whatever.

I will look at v6 this week.

Thanks, 
Ming