diff mbox series

mpt3sas fails to allocate budget_map and detects no devices

Message ID be78dc2cfeecaafd171060fbebda2d268d2a94e5.camel@suse.com
State New
Headers show
Series mpt3sas fails to allocate budget_map and detects no devices | expand

Commit Message

Martin Wilck Jan. 5, 2022, 6 p.m. UTC
Hello Ming, Sreekanth,

I'm observing a problem where mpt3sas fails to allocate the budget_map
for any SCSI device, because attempted allocation is larger than the
maximum possible. The issue is caused by the logic used in 020b0f0a3192
("scsi: core: Replace sdev->device_busy with sbitmap") 
to calculate the bitmap size. This is observed with 5.16-rc8.

The controller at hand has properties can_queue=29865 and
cmd_per_lun=7. The way these parameters are used in scsi_alloc_sdev()->
sbitmap_init_node(), this results in an sbitmap with 29865 maps, where
only a single bit is used per map. On x86_64, this results in an
attempt to allocate 29865 * 192 =  5734080 bytes for the sbitmap, which
is larger than  PAGE_SIZE * (1 << (MAX_ORDER - 1)), and fails.

cmd_per_lun=7 is an extreme case, because sbitmap_calculate_shift()
returns 0 (i.e. 1 bit per word) for this case. The problem for this
controller is the very large difference between the actual
cmd_per_lun=7 and the theoretical maximum
scsi_device_max_queue_depth(sdev) = host->can_queue = 29865. The
generic sbitmap logic assumes that the initial size is at least rougly
similar to the maximum.

I believe it would make sense to limit scsi_device_max_queue_depth(),
and thus the bitmap size, to BLK_MQ_MAX_DEPTH = 10240. But even then,
we would allocate a huge bitmap for this controller - 2MiB, of which no
more than 10kiB would ever be used (and only 7 bytes (!) in the default
case).

I'm unsure how to fix this correctly. To my understanding, the reason
that the sbitmap is split over multiple cache lines is to avoid ping-
pong between CPUs accessing the bitmap simultaneously. 29865 maps is
quite obviously too much. AFAICS it makes no sense to use more than
nr_cpu_ids separate bitmap chunks, and for a single SCSI device
probably much less (like 8 or 16 chunks) would be sufficient.
That means that we'd have to calculate the "shift" argument to
sbitmap_init_node() differently.

Here's a tentative patch that would do this, please have a look.
With this patch, the driver would use a bitmap with shift 6 for my
problem case./bitma


Regards,
Martin

Below is a dmesg snippet illustrating the problem. 
See "Current Controller Queue Depth" (=can_queue).

[ 2760.377997] mpt2sas_cm0: scatter gather: sge_in_main_msg(1), sge_per_chain(9), sge_per_io(128), chains_per_io(15)
[ 2760.475225] mpt2sas_cm0: request pool(0x000000004d742465) - dma(0x447800000): depth(30127), frame_size(128), pool_size(3765 kB)
[ 2761.787680] mpt2sas_cm0: sense pool(0x0000000069af807c) - dma(0x447400000): depth(29868), element_size(96), pool_size (2800 kB)
[ 2761.898550] mpt2sas_cm0: sense pool(0x0000000069af807c)- dma(0x447400000): depth(29868),element_size(96), pool_size(0 kB)
[ 2762.002396] mpt2sas_cm0: reply pool(0x000000005def89ce) - dma(0x44b400000): depth(30191), frame_size(128), pool_size(3773 kB)
[ 2762.107551] mpt2sas_cm0: config page(0x00000000dc01e404) - dma(0x44d69f000): size(512)
[ 2762.183235] mpt2sas_cm0: Allocated physical memory: size(66932 kB)
[ 2762.242790] mpt2sas_cm0: Current Controller Queue Depth(29865),Max Controller Queue Depth(32455)
[ 2762.326393] mpt2sas_cm0: Scatter Gather Elements per IO(128)
[ 2762.426639] mpt2sas_cm0: LSISAS2116: FWVersion(15.00.00.00), ChipRevision(0x02), BiosVersion(07.29.00.00)
[ 2762.517588] mpt2sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
[ 2762.626129] scsi host1: Fusion MPT SAS Host
[ 2762.669095] blk-mq: reduced tag depth to 10240
[ 2762.721135] mpt2sas_cm0: sending port enable !!
[ 2762.768419] mpt2sas_cm0: hba_port entry: 00000000c01410d1, port: 255 is added to hba_port list
[ 2762.858130] mpt2sas_cm0: host_add: handle(0x0001), sas_addr(0x500605b002d80ae0), phys(16)
[ 2762.937991] mpt2sas_cm0: handle(0x11) sas_address(0x500a0b838933700c) port_type(0x4)
[ 2763.012583] mpt2sas_cm0: handle(0x14) sas_address(0x50080e51beede010) port_type(0x4)
[ 2763.086293] mpt2sas_cm0: handle(0x12) sas_address(0x500a0b83763eb00c) port_type(0x4)
[ 2763.161090] mpt2sas_cm0: handle(0x13) sas_address(0x50080e51bf1f0010) port_type(0x4)
[ 2763.243850] mpt2sas_cm0: port enable: SUCCESS
[ 2763.290141] ------------[ cut here ]------------
[ 2763.336935] WARNING: CPU: 15 PID: 9641 at mm/page_alloc.c:5344 __alloc_pages+0x2a1/0x340
[ 2764.707134] CPU: 15 PID: 9641 Comm: kworker/u258:2 Tainted: G          I E     5.16.0-rc8-mw #1 openSUSE Tumbleweed (unreleased) d2750100410fdc3981005f9733dae65153e2c621
[ 2764.707140] Hardware name: HP ProLiant DL560 Gen8, BIOS P77
05/24/2019
[ 2764.707144] Workqueue: events_unbound async_run_entry_fn
[ 2764.707153] RIP: 0010:__alloc_pages+0x2a1/0x340
[ 2764.707155] Code: 3d fb f6 a6 01 01 76 0e 48 83 b8 d0 f9 ff ff 00 0f 84 e2 fe ff ff 80 ce 01 e9 da fe ff ff 81 e7 00 20 00 00 0f 85 79 ff ff ff <0f> 0b e9 72 ff ff ff 48 89 c7 e8 30 c8 fb ff e9 91 fe ff ff 31 c0
[ 2764.707158] RSP: 0018:ffffb5dfc688fa80 EFLAGS: 00010246
[ 2764.707160] RAX: 0000000000000000 RBX: 00000000000074a9 RCX: 0000000000000000
[ 2764.707161] RDX: 0000000000000001 RSI: 000000000000000b RDI: 0000000000000000
[ 2764.707163] RBP: 000000000000000b R08: 0000000000000000 R09: 0000000000000000
[ 2764.707164] R10: 000000000000007f R11: 000000000000000f R12: 0000000000577ec0
[ 2764.707166] R13: 0000000000000cc0 R14: 00000000ffffffff R15: 0000000000000080
[ 2764.707167] FS:  0000000000000000(0000) GS:ffff8954ef7c0000(0000) knlGS:0000000000000000
[ 2764.707169] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2764.707170] CR2: 00007fdb29c3ffa0 CR3: 00000002c6a10006 CR4: 00000000000606e0
[ 2764.707172] Call Trace:
[ 2764.707178]  <TASK>
[ 2764.707185]  kmalloc_large_node+0x3a/0xa0
[ 2764.707192]  __kmalloc_node+0x2b7/0x330
[ 2764.707195]  sbitmap_init_node+0x7d/0x1a0
[ 2764.707204]  scsi_alloc_sdev+0x25e/0x330
[ 2764.707210]  scsi_probe_and_add_lun+0x43d/0xdf0
[ 2764.707214]  __scsi_scan_target+0xf9/0x610
...
[ 2766.769381]  </TASK>
[ 2766.769382] ---[ end trace ac1aec9da4689c68 ]---
[ 2766.769387] scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured

Comments

Martin Wilck Jan. 6, 2022, 4:19 p.m. UTC | #1
On Thu, 2022-01-06 at 23:41 +0800, Ming Lei wrote:
> On Thu, Jan 06, 2022 at 03:22:53PM +0000, Martin Wilck wrote:
> > > 
> > > I'd suggest to fix mpt3sas for avoiding this memory waste.
> > 
> > Let's wait for Sreekanth's comment on that.
> > 
> > mpt3sas is not the only driver using a low value. Qlogic drivers
> > set
> > cmd_per_lun=3, for example (with 3, our logic would use shift=6, so
> > the
> > issue I observed wouldn't occur - but it would be prone to cache
> > line
> > bouncing).
> 
> But qlogic has smaller .can_queue which looks at most 512, .can_queue
> is
> the depth for allocating sbitmap, since each sdev->queue_depth is <=
> .can_queue.

I'm seeing here (on an old kernel, admittedly) cmd_per_lun=3 and
can_queue=2038 for qla2xxx and cmd_per_lun=3 and can_queue=5884 for
lpfc. Both drivers change the queue depth for devices to 64 in their
slave_configure() methods.

Many drivers do this, as it's recommended in scsi_host.h. That's quite
bad in view of the current bitmap allocation logic - we lay out the
bitmap assuming the depth used will be cmd_per_lun, but that doesn't
match the actual depth when the device comes online. For qla2xxx, it
means that we'd allocate the sbitmap with shift=6 (64 bits per word),
thus using just a single cache line for 64 requests. Shift=4 (16 bits
per word) would be the default shift for depth 64.

Am I misreading the code? Perhaps we should only allocate a preliminary
sbitmap in scsi_alloc_sdev, and reallocate it after slave_configure()
has been called, to get the shift right for the driver's default
settings?

Thanks,
Martin
Martin Wilck Jan. 12, 2022, 4:59 p.m. UTC | #2
On Mon, 2022-01-10 at 10:59 +0800, Ming Lei wrote:
> 
> Hello Martin Wilck,
> 
> Can you test the following change and report back the result?
> 
> From 480a61a85e9669d3487ebee8db3d387df79279fc Mon Sep 17 00:00:00
> 2001
> From: Ming Lei <ming.lei@redhat.com>
> Date: Mon, 10 Jan 2022 10:26:59 +0800
> Subject: [PATCH] scsi: core: reallocate scsi device's budget map if
> default
>  queue depth is changed
> 
> Martin reported that sdev->queue_depth can often be changed in
> ->slave_configure(), and now we uses ->cmd_per_lun as initial queue
> depth for setting up sdev->budget_map. And some extreme ->cmd_per_lun
> or ->can_queue won't be used at default actually, if we they are used
> to allocate sdev->budget_map, huge memory may be consumed just
> because
> of bad ->cmd_per_lun.
> 
> Fix the issue by reallocating sdev->budget_map after -
> >slave_configure()
> returns, at that time, queue_depth should be much more reasonable.
> 
> Reported-by: Martin Wilck <martin.wilck@suse.com>
> Suggested-by: Martin Wilck <martin.wilck@suse.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

This looks good. I added a few pr_infos, and for the strange mpt3sas
devices I reported, I get this:

# first allocation with depth=7 (cmds_per_lun)
Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: 7 0->0 
   (these numbers are: depth old_shift->new_shift)
Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: map_nr = 1024

# after slave_alloc() with depth 254
Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: 254 0->5
Jan 12 17:05:52 localhost kernel: scsi_realloc_sdev_budget_map: map_nr = 32

So the depth changed from 7 to 254, the shift from 0 to 5, and the memory size of the
sbitmap was reduced by a factor of 32. Nice!

Tested-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Martin Wilck <mwilck@suse.com>


> ---
>  drivers/scsi/scsi_scan.c | 56 ++++++++++++++++++++++++++++++++++++--
> --
>  1 file changed, 51 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 23e1c0acdeae..9593c9111611 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -214,6 +214,48 @@ static void scsi_unlock_floptical(struct
> scsi_device *sdev,
>                          SCSI_TIMEOUT, 3, NULL);
>  }
>  
> +static int scsi_realloc_sdev_budget_map(struct scsi_device *sdev,
> +                                       unsigned int depth)
> +{
> +       int new_shift = sbitmap_calculate_shift(depth);
> +       bool need_alloc = !sdev->budget_map.map;
> +       bool need_free = false;
> +       int ret;
> +       struct sbitmap sb_back;
> +
> +       /*
> +        * realloc if new shift is calculated, which is caused by
> setting
> +        * up one new default queue depth after calling -
> >slave_configure
> +        */
> +       if (!need_alloc && new_shift != sdev->budget_map.shift)
> +               need_alloc = need_free = true;
> +
> +       if (!need_alloc)
> +               return 0;
> +
> +       /*
> +        * Request queue has to be freezed for reallocating budget
> map,
> +        * and here disk isn't added yet, so freezing is pretty fast
> +        */
> +       if (need_free) {
> +               blk_mq_freeze_queue(sdev->request_queue);
> +               sb_back = sdev->budget_map;
> +       }
> +       ret = sbitmap_init_node(&sdev->budget_map,
> +                               scsi_device_max_queue_depth(sdev),
> +                               new_shift, GFP_KERNEL,
> +                               sdev->request_queue->node, false,
> true);
> +       if (need_free) {
> +               if (ret)
> +                       sdev->budget_map = sb_back;
> +               else
> +                       sbitmap_free(&sb_back);
> +               ret = 0;
> +               blk_mq_unfreeze_queue(sdev->request_queue);
> +       }
> +       return ret;
> +}
> +
>  /**
>   * scsi_alloc_sdev - allocate and setup a scsi_Device
>   * @starget: which target to allocate a &scsi_device for
> @@ -306,11 +348,7 @@ static struct scsi_device
> *scsi_alloc_sdev(struct scsi_target *starget,
>          * default device queue depth to figure out sbitmap shift
>          * since we use this queue depth most of times.
>          */
> -       if (sbitmap_init_node(&sdev->budget_map,
> -                               scsi_device_max_queue_depth(sdev),
> -                               sbitmap_calculate_shift(depth),
> -                               GFP_KERNEL, sdev->request_queue-
> >node,
> -                               false, true)) {
> +       if (scsi_realloc_sdev_budget_map(sdev, depth)) {
>                 put_device(&starget->dev);
>                 kfree(sdev);
>                 goto out;
> @@ -1017,6 +1055,14 @@ static int scsi_add_lun(struct scsi_device
> *sdev, unsigned char *inq_result,
>                         }
>                         return SCSI_SCAN_NO_RESPONSE;
>                 }
> +
> +               /*
> +                * queue_depth is often changed in ->slave_configure,
> so
> +                * setup budget map again for getting better memory
> uses
> +                * since memory consumption of the map depends on
> queue
> +                * depth heavily
> +                */
> +               scsi_realloc_sdev_budget_map(sdev, sdev-
> >queue_depth);
>         }
>  
>         if (sdev->scsi_level >= SCSI_3)
> -- 
> 2.31.1
> 
> 
>
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 23e1c0acdeae..3e51f8183b42 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -231,7 +231,8 @@  static void scsi_unlock_floptical(struct scsi_device *sdev,
 static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 					   u64 lun, void *hostdata)
 {
-	unsigned int depth;
+	static const unsigned int BITMAP_CHUNKS = 16;
+	unsigned int depth, sb_map_chunks, sb_depth;
 	struct scsi_device *sdev;
 	struct request_queue *q;
 	int display_failure_msg = 1, ret;
@@ -298,17 +299,26 @@  static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 	__scsi_init_queue(sdev->host, q);
 	WARN_ON_ONCE(!blk_get_queue(q));
 
-	depth = sdev->host->cmd_per_lun ?: 1;
-
 	/*
 	 * Use .can_queue as budget map's depth because we have to
 	 * support adjusting queue depth from sysfs. Meantime use
 	 * default device queue depth to figure out sbitmap shift
 	 * since we use this queue depth most of times.
+	 * Avoid extremely small shift if cmd_per_lun is small but
+	 * greater than 3 (see sbitmap_calculate_shift()).
+	 * Assume that usually no more than BITMAP_CHUNKS
+	 * CPUs will access this bitmap simultaneously.
 	 */
+	depth = sdev->host->cmd_per_lun ?: 1;
+	sb_map_chunks = max_t(unsigned int, 1U,
+			      min_t(unsigned int, nr_cpu_ids, BITMAP_CHUNKS));
+	sb_depth = max_t(unsigned int,
+			 scsi_device_max_queue_depth(sdev) / sb_map_chunks,
+			 depth);
+
 	if (sbitmap_init_node(&sdev->budget_map,
 				scsi_device_max_queue_depth(sdev),
-				sbitmap_calculate_shift(depth),
+				sbitmap_calculate_shift(sb_depth),
 				GFP_KERNEL, sdev->request_queue->node,
 				false, true)) {
 		put_device(&starget->dev);