diff mbox series

[v8,17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

Message ID 1597850436-116171-18-git-send-email-john.garry@huawei.com
State New
Headers show
Series blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs | expand

Commit Message

John Garry Aug. 19, 2020, 3:20 p.m. UTC
From: Kashyap Desai <kashyap.desai@broadcom.com>


Fusion adapters can steer completions to individual queues, and
we now have support for shared host-wide tags.
So we can enable multiqueue support for fusion adapters.

Once driver enable shared host-wide tags, cpu hotplug feature is also
supported as it was enabled using below patchsets -
commit bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are
offline")

Currently driver has provision to disable host-wide tags using
"host_tagset_enable" module parameter.

Once we do not have any major performance regression using host-wide
tags, we will drop the hand-crafted interrupt affinity settings.

Performance is also meeting the expecatation - (used both none and
mq-deadline scheduler)
24 Drive SSD on Aero with/without this patch can get 3.1M IOPs
3 VDs consist of 8 SAS SSD on Aero with/without this patch can get 3.1M
IOPs.

Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>

Signed-off-by: Hannes Reinecke <hare@suse.com>

Signed-off-by: John Garry <john.garry@huawei.com>

---
 drivers/scsi/megaraid/megaraid_sas_base.c   | 39 +++++++++++++++++++++
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 29 ++++++++-------
 2 files changed, 55 insertions(+), 13 deletions(-)

-- 
2.26.2

Comments

Qian Cai Nov. 2, 2020, 2:17 p.m. UTC | #1
On Wed, 2020-08-19 at 23:20 +0800, John Garry wrote:
> From: Kashyap Desai <kashyap.desai@broadcom.com>
> 
> Fusion adapters can steer completions to individual queues, and
> we now have support for shared host-wide tags.
> So we can enable multiqueue support for fusion adapters.
> 
> Once driver enable shared host-wide tags, cpu hotplug feature is also
> supported as it was enabled using below patchsets -
> commit bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are
> offline")
> 
> Currently driver has provision to disable host-wide tags using
> "host_tagset_enable" module parameter.
> 
> Once we do not have any major performance regression using host-wide
> tags, we will drop the hand-crafted interrupt affinity settings.
> 
> Performance is also meeting the expecatation - (used both none and
> mq-deadline scheduler)
> 24 Drive SSD on Aero with/without this patch can get 3.1M IOPs
> 3 VDs consist of 8 SAS SSD on Aero with/without this patch can get 3.1M
> IOPs.
> 
> Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
> Signed-off-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: John Garry <john.garry@huawei.com>

Reverting this commit fixed an issue that Dell Power Edge R6415 server with
megaraid_sas is unable to boot.

c1:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
	DeviceName: Integrated RAID
	Subsystem: Dell PERC H730P Mini
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 48
	NUMA node: 3
	Region 0: I/O ports at c000 [size=256]
	Region 1: Memory at a5500000 (64-bit, non-prefetchable) [size=64K]
	Region 3: Memory at a5400000 (64-bit, non-prefetchable) [size=1M]
	Expansion ROM at <ignored> [disabled]
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
		DevCtl:	CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <2us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x8 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range BC, TimeoutDis+, NROPrPrP-, LTR-
			 10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-, TPHComp-, ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
		Address: 0000000000000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [c0] MSI-X: Enable+ Count=97 Masked-
		Vector table: BAR=1 offset=0000e000
		PBA: BAR=1 offset=0000f000
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 04000001 c000000f c1080000 4ba9007a
	Capabilities: [1e0 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
		LaneErrStat: 0
	Capabilities: [1c0 v1] Power Budgeting <?>
	Capabilities: [148 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Kernel driver in use: megaraid_sas
	Kernel modules: megaraid_sas

[   26.330282][  T567] megasas: 07.714.04.00-rc1
[   26.355663][  T611] ahci 0000:87:00.2: AHCI 0001.0301 32 slots 1 ports 6 Gbps 0x1 impl SATA mode
[   26.364585][  T611] ahci 0000:87:00.2: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part 
[   26.376125][  T289] megaraid_sas 0000:c1:00.0: FW now in Ready state
[   26.382534][  T289] megaraid_sas 0000:c1:00.0: 63 bit DMA mask and 32 bit consistent mask
[   26.391537][  T289] megaraid_sas 0000:c1:00.0: firmware supports msix	: (96)
[   26.431767][  T611] scsi host1: ahci
[   26.492580][  T611] ata1: SATA max UDMA/133 abar m4096@0xc0a02000 port 0xc0a02100 irq 60
[   26.701197][  T283] bnxt_en 0000:84:00.0 eth0: Broadcom BCM57416 NetXtreme-E 10GBase-T Ethernet found at mem ad210000, node addr 4c:d9:8f:4a:20:e6
[   26.714352][  T283] bnxt_en 0000:84:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[   26.743738][   T24] tg3 0000:81:00.0 eth1: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address 4c:d9:8f:65:3f:32
[   26.754974][   T24] tg3 0000:81:00.0 eth1: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[   26.765523][   T24] tg3 0000:81:00.0 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[   26.774074][   T24] tg3 0000:81:00.0 eth1: dma_rwctrl[00000001] dma_mask[64-bit]
[   26.842518][  T620] ata1: SATA link down (SStatus 0 SControl 300)
[   26.945741][  T289] megaraid_sas 0000:c1:00.0: requested/available msix 49/49
[   26.952912][  T289] megaraid_sas 0000:c1:00.0: current msix/online cpus	: (49/48)
[   26.960401][  T289] megaraid_sas 0000:c1:00.0: RDPQ mode	: (disabled)
[   26.966876][  T289] megaraid_sas 0000:c1:00.0: Current firmware supports maximum commands: 928	 LDIO threshold: 0
[   27.079361][  T289] megaraid_sas 0000:c1:00.0: Performance mode :Latency (latency index = 1)
[   27.085381][  T283] bnxt_en 0000:84:00.1 eth2: Broadcom BCM57416 NetXtreme-E 10GBase-T Ethernet found at mem ad200000, node addr 4c:d9:8f:4a:20:e7
[   27.087824][  T289] megaraid_sas 0000:c1:00.0: FW supports sync cache	: No
[   27.100959][  T283] bnxt_en 0000:84:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[   27.107835][  T289] megaraid_sas 0000:c1:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
[   27.130978][   T24] tg3 0000:81:00.1 eth3: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address 4c:d9:8f:65:3f:33
[   27.142919][   T24] tg3 0000:81:00.1 eth3: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[   27.146042][  T571] bnxt_en 0000:84:00.1 enp132s0f1np1: renamed from eth2
[   27.153456][   T24] tg3 0000:81:00.1 eth3: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[   27.153467][   T24] tg3 0000:81:00.1 eth3: dma_rwctrl[00000001] dma_mask[64-bit]
[   27.200900][  T289] megaraid_sas 0000:c1:00.0: FW provided supportMaxExtLDs: 1	max_lds: 64
[   27.209174][  T289] megaraid_sas 0000:c1:00.0: controller type	: MR(2048MB)
[   27.216260][  T289] megaraid_sas 0000:c1:00.0: Online Controller Reset(OCR)	: Enabled
[   27.224105][  T289] megaraid_sas 0000:c1:00.0: Secure JBOD support	: No
[   27.230720][  T289] megaraid_sas 0000:c1:00.0: NVMe passthru support	: No
[   27.237527][  T289] megaraid_sas 0000:c1:00.0: FW provided TM TaskAbort/Reset timeout	: 0 secs/0 secs
[   27.246754][  T289] megaraid_sas 0000:c1:00.0: JBOD sequence map support	: No
[   27.253906][  T289] megaraid_sas 0000:c1:00.0: PCI Lane Margining support	: No
[   27.341447][  T289] megaraid_sas 0000:c1:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
[   27.351729][  T289] megaraid_sas 0000:c1:00.0: INIT adapter done
[   27.357742][  T289] megaraid_sas 0000:c1:00.0: JBOD sequence map is disabled megasas_setup_jbod_map 5709
[   27.367832][  T289] megaraid_sas 0000:c1:00.0: pci id		: (0x1000)/(0x005d)/(0x1028)/(0x1f47)
[   27.376287][  T289] megaraid_sas 0000:c1:00.0: unevenspan support	: yes
[   27.382925][  T289] megaraid_sas 0000:c1:00.0: firmware crash dump	: no
[   27.389547][  T289] megaraid_sas 0000:c1:00.0: JBOD sequence map	: disabled
[   27.397816][  T289] megaraid_sas 0000:c1:00.0: Max firmware commands: 927 shared with nr_hw_queues = 48
[   27.407232][  T289] scsi host0: Avago SAS based MegaRAID driver
[   27.430212][  T586] bnxt_en 0000:84:00.0 enp132s0f0np0: renamed from eth0
[   27.781038][  T603] tg3 0000:81:00.0 eno1: renamed from eth1
[   28.194046][  T552] tg3 0000:81:00.1 eno2: renamed from eth3

[  251.961152][  T330] INFO: task systemd-udevd:567 blocked for more than 122 seconds.
[  251.968876][  T330]       Not tainted 5.10.0-rc1-next-20201102 #1
[  251.975003][  T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  251.983546][  T330] task:systemd-udevd   state:D stack:27224 pid:  567 ppid:   506 flags:0x00004324
[  251.992620][  T330] Call Trace:
[  251.995784][  T330]  __schedule+0x71d/0x1b60
[  252.000067][  T330]  ? __sched_text_start+0x8/0x8
[  252.004798][  T330]  schedule+0xbf/0x270
[  252.008735][  T330]  schedule_timeout+0x3fc/0x590
[  252.013464][  T330]  ? usleep_range+0x120/0x120
[  252.018008][  T330]  ? wait_for_completion+0x156/0x250
[  252.023176][  T330]  ? lock_downgrade+0x700/0x700
[  252.027886][  T330]  ? rcu_read_unlock+0x40/0x40
[  252.032530][  T330]  ? do_raw_spin_lock+0x121/0x290
[  252.037412][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
[  252.043268][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
[  252.048331][  T330]  wait_for_completion+0x15e/0x250
[  252.053323][  T330]  ? wait_for_completion_interruptible+0x320/0x320
[  252.059687][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
[  252.065543][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
[  252.070606][  T330]  __flush_work+0x42a/0x900
[  252.074989][  T330]  ? queue_delayed_work_on+0x90/0x90
[  252.080139][  T330]  ? __queue_work+0x463/0xf40
[  252.084700][  T330]  ? init_pwq+0x320/0x320
[  252.088891][  T330]  ? queue_work_on+0x5e/0x80
[  252.093364][  T330]  ? trace_hardirqs_on+0x1c/0x150
[  252.098255][  T330]  work_on_cpu+0xe7/0x130
[  252.102461][  T330]  ? flush_delayed_work+0xc0/0xc0
[  252.107342][  T330]  ? __mutex_unlock_slowpath+0xd4/0x670
[  252.112764][  T330]  ? work_debug_hint+0x30/0x30
[  252.117391][  T330]  ? pci_device_shutdown+0x80/0x80
[  252.122378][  T330]  ? cpumask_next_and+0x57/0x80
[  252.127094][  T330]  pci_device_probe+0x500/0x5c0
[  252.131824][  T330]  ? pci_device_remove+0x1f0/0x1f0
[  252.136805][  T330]  really_probe+0x207/0xad0
[  252.141191][  T330]  ? device_driver_attach+0x120/0x120
[  252.146428][  T330]  driver_probe_device+0x1f1/0x370
[  252.151424][  T330]  device_driver_attach+0xe5/0x120
[  252.156399][  T330]  __driver_attach+0xf0/0x260
[  252.160953][  T330]  bus_for_each_dev+0x117/0x1a0
[  252.165669][  T330]  ? subsys_dev_iter_exit+0x10/0x10
[  252.170731][  T330]  bus_add_driver+0x399/0x560
[  252.175289][  T330]  driver_register+0x189/0x310
[  252.179919][  T330]  ? 0xffffffffc05c1000
[  252.183960][  T330]  megasas_init+0x117/0x1000 [megaraid_sas]
[  252.189713][  T330]  do_one_initcall+0xf6/0x510
[  252.194267][  T330]  ? perf_trace_initcall_level+0x490/0x490
[  252.199940][  T330]  ? kasan_unpoison_shadow+0x30/0x40
[  252.205104][  T330]  ? __kasan_kmalloc.constprop.11+0xc1/0xd0
[  252.210859][  T330]  ? do_init_module+0x49/0x6c0
[  252.215500][  T330]  ? kmem_cache_alloc_trace+0x11f/0x1e0
[  252.220925][  T330]  ? kasan_unpoison_shadow+0x30/0x40
[  252.226068][  T330]  do_init_module+0x1ed/0x6c0
[  252.230608][  T330]  load_module+0x4a59/0x5d20
[  252.235081][  T330]  ? layout_and_allocate+0x2770/0x2770
[  252.240404][  T330]  ? __vmalloc_node+0x8d/0x100
[  252.245046][  T330]  ? kernel_read_file+0x485/0x5a0
[  252.249934][  T330]  ? kernel_read_file+0x305/0x5a0
[  252.254839][  T330]  ? __x64_sys_fsconfig+0x970/0x970
[  252.259903][  T330]  ? __do_sys_finit_module+0xff/0x180
[  252.265153][  T330]  __do_sys_finit_module+0xff/0x180
[  252.270216][  T330]  ? __do_sys_init_module+0x1d0/0x1d0
[  252.275465][  T330]  ? __fget_files+0x1c3/0x2e0
[  252.280010][  T330]  do_syscall_64+0x33/0x40
[  252.284304][  T330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  252.290054][  T330] RIP: 0033:0x7fbb3e2fa78d
[  252.294348][  T330] Code: Unable to access opcode bytes at RIP 0x7fbb3e2fa763.
[  252.301584][  T330] RSP: 002b:00007ffe572e8d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  252.309855][  T330] RAX: ffffffffffffffda RBX: 000055c7795d90f0 RCX: 00007fbb3e2fa78d
[  252.317703][  T330] RDX: 0000000000000000 RSI: 00007fbb3ee6c82d RDI: 0000000000000006
[  252.325553][  T330] RBP: 00007fbb3ee6c82d R08: 0000000000000000 R09: 00007ffe572e8e40
[  252.333402][  T330] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000000
[  252.341257][  T330] R13: 000055c7795930e0 R14: 0000000000020000 R15: 0000000000000000
[  252.349117][  T330] 
[  252.349117][  T330] Showing all locks held in the system:
[  252.356770][  T330] 3 locks held by kworker/3:1/289:
[  252.361759][  T330]  #0: ffff8881001eb938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x7ec/0x1610
[  252.371976][  T330]  #1: ffffc90004ee7e00 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x820/0x1610
[  252.382803][  T330]  #2: ffff8881430380e0 (&shost->scan_mutex){+.+.}-{3:3}, at: scsi_scan_host_selected+0xde/0x260
[  252.393199][  T330] 1 lock held by khungtaskd/330:
[  252.397993][  T330]  #0: ffffffff9d4d3760 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.52+0x0/0x30
[  252.408296][  T330] 1 lock held by systemd-journal/420:
[  252.413562][  T330] 1 lock held by systemd-udevd/567:
[  252.418619][  T330]  #0: ffff8881207ac218 (&dev->mutex){....}-{3:3}, at: device_driver_attach+0x37/0x120
[  252.428159][  T330] 
[  252.430355][  T330] =============================================
[  252.430355][  T330] 

> ---
>  drivers/scsi/megaraid/megaraid_sas_base.c   | 39 +++++++++++++++++++++
>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 29 ++++++++-------
>  2 files changed, 55 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index 861f7140f52e..6960922d0d7f 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -37,6 +37,7 @@
>  #include <linux/poll.h>
>  #include <linux/vmalloc.h>
>  #include <linux/irq_poll.h>
> +#include <linux/blk-mq-pci.h>
>  
>  #include <scsi/scsi.h>
>  #include <scsi/scsi_cmnd.h>
> @@ -113,6 +114,10 @@ unsigned int enable_sdev_max_qd;
>  module_param(enable_sdev_max_qd, int, 0444);
>  MODULE_PARM_DESC(enable_sdev_max_qd, "Enable sdev max qd as can_queue.
> Default: 0");
>  
> +int host_tagset_enable = 1;
> +module_param(host_tagset_enable, int, 0444);
> +MODULE_PARM_DESC(host_tagset_enable, "Shared host tagset enable/disable
> Default: enable(1)");
> +
>  MODULE_LICENSE("GPL");
>  MODULE_VERSION(MEGASAS_VERSION);
>  MODULE_AUTHOR("megaraidlinux.pdl@broadcom.com");
> @@ -3119,6 +3124,19 @@ megasas_bios_param(struct scsi_device *sdev, struct
> block_device *bdev,
>  	return 0;
>  }
>  
> +static int megasas_map_queues(struct Scsi_Host *shost)
> +{
> +	struct megasas_instance *instance;
> +
> +	instance = (struct megasas_instance *)shost->hostdata;
> +
> +	if (shost->nr_hw_queues == 1)
> +		return 0;
> +
> +	return blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
> +			instance->pdev, instance->low_latency_index_start);
> +}
> +
>  static void megasas_aen_polling(struct work_struct *work);
>  
>  /**
> @@ -3427,6 +3445,7 @@ static struct scsi_host_template megasas_template = {
>  	.eh_timed_out = megasas_reset_timer,
>  	.shost_attrs = megaraid_host_attrs,
>  	.bios_param = megasas_bios_param,
> +	.map_queues = megasas_map_queues,
>  	.change_queue_depth = scsi_change_queue_depth,
>  	.max_segment_size = 0xffffffff,
>  };
> @@ -6808,6 +6827,26 @@ static int megasas_io_attach(struct megasas_instance
> *instance)
>  	host->max_lun = MEGASAS_MAX_LUN;
>  	host->max_cmd_len = 16;
>  
> +	/* Use shared host tagset only for fusion adaptors
> +	 * if there are managed interrupts (smp affinity enabled case).
> +	 * Single msix_vectors in kdump, so shared host tag is also disabled.
> +	 */
> +
> +	host->host_tagset = 0;
> +	host->nr_hw_queues = 1;
> +
> +	if ((instance->adapter_type != MFI_SERIES) &&
> +		(instance->msix_vectors > instance->low_latency_index_start) &&
> +		host_tagset_enable &&
> +		instance->smp_affinity_enable) {
> +		host->host_tagset = 1;
> +		host->nr_hw_queues = instance->msix_vectors -
> +			instance->low_latency_index_start;
> +	}
> +
> +	dev_info(&instance->pdev->dev,
> +		"Max firmware commands: %d shared with nr_hw_queues = %d\n",
> +		instance->max_fw_cmds, host->nr_hw_queues);
>  	/*
>  	 * Notify the mid-layer about the new controller
>  	 */
> diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> index 0824410f78f8..a4251121f173 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> @@ -359,24 +359,29 @@ megasas_get_msix_index(struct megasas_instance
> *instance,
>  {
>  	int sdev_busy;
>  
> -	/* nr_hw_queue = 1 for MegaRAID */
> -	struct blk_mq_hw_ctx *hctx =
> -		scmd->device->request_queue->queue_hw_ctx[0];
> -
> -	sdev_busy = atomic_read(&hctx->nr_active);
> +	/* TBD - if sml remove device_busy in future, driver
> +	 * should track counter in internal structure.
> +	 */
> +	sdev_busy = atomic_read(&scmd->device->device_busy);
>  
>  	if (instance->perf_mode == MR_BALANCED_PERF_MODE &&
> -	    sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH))
> +	    sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH)) {
>  		cmd->request_desc->SCSIIO.MSIxIndex =
>  			mega_mod64((atomic64_add_return(1, &instance-
> >high_iops_outstanding) /
>  					MR_HIGH_IOPS_BATCH_COUNT), instance-
> >low_latency_index_start);
> -	else if (instance->msix_load_balance)
> +	} else if (instance->msix_load_balance) {
>  		cmd->request_desc->SCSIIO.MSIxIndex =
>  			(mega_mod64(atomic64_add_return(1, &instance-
> >total_io_count),
>  				instance->msix_vectors));
> -	else
> +	} else if (instance->host->nr_hw_queues > 1) {
> +		u32 tag = blk_mq_unique_tag(scmd->request);
> +
> +		cmd->request_desc->SCSIIO.MSIxIndex =
> blk_mq_unique_tag_to_hwq(tag) +
> +			instance->low_latency_index_start;
> +	} else {
>  		cmd->request_desc->SCSIIO.MSIxIndex =
>  			instance->reply_map[raw_smp_processor_id()];
> +	}
>  }
>  
>  /**
> @@ -956,9 +961,6 @@ megasas_alloc_cmds_fusion(struct megasas_instance
> *instance)
>  	if (megasas_alloc_cmdlist_fusion(instance))
>  		goto fail_exit;
>  
> -	dev_info(&instance->pdev->dev, "Configured max firmware commands: %d\n",
> -		 instance->max_fw_cmds);
> -
>  	/* The first 256 bytes (SMID 0) is not used. Don't add to the cmd list
> */
>  	io_req_base = fusion->io_request_frames +
> MEGA_MPI2_RAID_DEFAULT_IO_FRAME_SIZE;
>  	io_req_base_phys = fusion->io_request_frames_phys +
> MEGA_MPI2_RAID_DEFAULT_IO_FRAME_SIZE;
> @@ -1102,8 +1104,9 @@ megasas_ioc_init_fusion(struct megasas_instance
> *instance)
>  		MR_HIGH_IOPS_QUEUE_COUNT) && cur_intr_coalescing)
>  		instance->perf_mode = MR_BALANCED_PERF_MODE;
>  
> -	dev_info(&instance->pdev->dev, "Performance mode :%s\n",
> -		MEGASAS_PERF_MODE_2STR(instance->perf_mode));
> +	dev_info(&instance->pdev->dev, "Performance mode :%s (latency index =
> %d)\n",
> +		MEGASAS_PERF_MODE_2STR(instance->perf_mode),
> +		instance->low_latency_index_start);
>  
>  	instance->fw_sync_cache_support = (scratch_pad_1 &
>  		MR_CAN_HANDLE_SYNC_CACHE_OFFSET) ? 1 : 0;
Kashyap Desai Nov. 2, 2020, 2:31 p.m. UTC | #2
> On Wed, 2020-08-19 at 23:20 +0800, John Garry wrote:

> > From: Kashyap Desai <kashyap.desai@broadcom.com>

> >

> > Fusion adapters can steer completions to individual queues, and we now

> > have support for shared host-wide tags.

> > So we can enable multiqueue support for fusion adapters.

> >

> > Once driver enable shared host-wide tags, cpu hotplug feature is also

> > supported as it was enabled using below patchsets - commit

> > bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are

> > offline")

> >

> > Currently driver has provision to disable host-wide tags using

> > "host_tagset_enable" module parameter.

> >

> > Once we do not have any major performance regression using host-wide

> > tags, we will drop the hand-crafted interrupt affinity settings.

> >

> > Performance is also meeting the expecatation - (used both none and

> > mq-deadline scheduler)

> > 24 Drive SSD on Aero with/without this patch can get 3.1M IOPs

> > 3 VDs consist of 8 SAS SSD on Aero with/without this patch can get

> > 3.1M IOPs.

> >

> > Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>

> > Signed-off-by: Hannes Reinecke <hare@suse.com>

> > Signed-off-by: John Garry <john.garry@huawei.com>

>

> Reverting this commit fixed an issue that Dell Power Edge R6415 server

> with

> megaraid_sas is unable to boot.


I will take a look at this. BTW, can you try keeping same PATCH but use
module parameter "host_tagset_enable =0"

Kashyap
John Garry Nov. 2, 2020, 2:51 p.m. UTC | #3
On 02/11/2020 14:17, Qian Cai wrote:
> [  251.961152][  T330] INFO: task systemd-udevd:567 blocked for more than 122 seconds.
> [  251.968876][  T330]       Not tainted 5.10.0-rc1-next-20201102 #1
> [  251.975003][  T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  251.983546][  T330] task:systemd-udevd   state:D stack:27224 pid:  567 ppid:   506 flags:0x00004324
> [  251.992620][  T330] Call Trace:
> [  251.995784][  T330]  __schedule+0x71d/0x1b60
> [  252.000067][  T330]  ? __sched_text_start+0x8/0x8
> [  252.004798][  T330]  schedule+0xbf/0x270
> [  252.008735][  T330]  schedule_timeout+0x3fc/0x590
> [  252.013464][  T330]  ? usleep_range+0x120/0x120
> [  252.018008][  T330]  ? wait_for_completion+0x156/0x250
> [  252.023176][  T330]  ? lock_downgrade+0x700/0x700
> [  252.027886][  T330]  ? rcu_read_unlock+0x40/0x40
> [  252.032530][  T330]  ? do_raw_spin_lock+0x121/0x290
> [  252.037412][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> [  252.043268][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
> [  252.048331][  T330]  wait_for_completion+0x15e/0x250
> [  252.053323][  T330]  ? wait_for_completion_interruptible+0x320/0x320
> [  252.059687][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> [  252.065543][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
> [  252.070606][  T330]  __flush_work+0x42a/0x900
> [  252.074989][  T330]  ? queue_delayed_work_on+0x90/0x90
> [  252.080139][  T330]  ? __queue_work+0x463/0xf40
> [  252.084700][  T330]  ? init_pwq+0x320/0x320
> [  252.088891][  T330]  ? queue_work_on+0x5e/0x80
> [  252.093364][  T330]  ? trace_hardirqs_on+0x1c/0x150
> [  252.098255][  T330]  work_on_cpu+0xe7/0x130
> [  252.102461][  T330]  ? flush_delayed_work+0xc0/0xc0
> [  252.107342][  T330]  ? __mutex_unlock_slowpath+0xd4/0x670
> [  252.112764][  T330]  ? work_debug_hint+0x30/0x30
> [  252.117391][  T330]  ? pci_device_shutdown+0x80/0x80
> [  252.122378][  T330]  ? cpumask_next_and+0x57/0x80
> [  252.127094][  T330]  pci_device_probe+0x500/0x5c0
> [  252.131824][  T330]  ? pci_device_remove+0x1f0/0x1f0

Is CONFIG_DEBUG_TEST_DRIVER_REMOVE enabled? I figure it is, with this call.

Or please share the .config

Cheers,
John

> [  252.136805][  T330]  really_probe+0x207/0xad0
> [  252.141191][  T330]  ? device_driver_attach+0x120/0x120
> [  252.146428][  T330]  driver_probe_device+0x1f1/0x370
> [  252.151424][  T330]  device_driver_attach+0xe5/0x120
> [  252.156399][  T330]  __driver_attach+0xf0/0x260
> [  252.160953][  T330]  bus_for_each_dev+0x117/0x1a0
> [  252.165669][  T330]  ? subsys_dev_iter_exit+0x10/0x10
> [  252.170731][  T330]  bus_add_driver+0x399/0x560
> [  252.175289][  T330]  driver_register+0x189/0x310
> [  252.179919][  T330]  ? 0xffffffffc05c1000
> [  252.183960][  T330]  megasas_init+0x117/0x1000 [megaraid_sas]
> [  252.189713][  T330]  do_one_initcall+0xf6/0x510
> [  252.194267][  T330]  ? perf_trace_initcall_level+0x490/0x490
> [  252.199940][  T330]  ? kasan_unpoison_shadow+0x30/0x40
> [  252.205104][  T330]  ? __kasan_kmalloc.constprop.11+0xc1/0xd0
> [  252.210859][  T330]  ? do_init_module+0x49/0x6c0
> [  252.215500][  T330]  ? kmem_cache_alloc_trace+0x11f/0x1e0
> [  252.220925][  T330]  ? kasan_unpoison_shadow+0x30/0x40
> [  252.226068][  T330]  do_init_module+0x1ed/0x6c0
> [  252.230608][  T330]  load_module+0x4a59/0x5d20
> [  252.235081][  T330]  ? layout_and_allocate+0x2770/0x2770
> [  252.240404][  T330]  ? __vmalloc_node+0x8d/0x100
> [  252.245046][  T330]  ? kernel_read_file+0x485/0x5a0
> [  252.249934][  T330]  ? kernel_read_file+0x305/0x5a0
> [  252.254839][  T330]  ? __x64_sys_fsconfig+0x970/0x970
> [  252.259903][  T330]  ? __do_sys_finit_module+0xff/0x180
> [  252.265153][  T330]  __do_sys_finit_module+0xff/0x180
> [  252.270216][  T330]  ? __do_sys_init_module+0x1d0/0x1d0
> [  252.275465][  T330]  ? __fget_files+0x1c3/0x2e0
> [  252.280010][  T330]  do_syscall_64+0x33/0x40
> [  252.284304][  T330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  252.290054][  T330] RIP: 0033:0x7fbb3e2fa78d
> [  252.294348][  T330] Code: Unable to access opcode bytes at RIP 0x7fbb3e2fa763.
> [  252.301584][  T330] RSP: 002b:00007ffe572e8d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> [  252.309855][  T330] RAX: ffffffffffffffda RBX: 000055c7795d90f0 RCX: 00007fbb3e2fa78d
> [  252.317703][  T330] RDX: 0000000000000000 RSI: 00007fbb3ee6c82d RDI: 0000000000000006
> [  252.325553][  T330] RBP: 00007fbb3ee6c82d R08: 0000000000000000 R09: 00007ffe572e8e40
> [  252.333402][  T330] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000000
> [  252.341257][  T330] R13: 000055c7795930e0 R14: 0000000000020000 R15: 0000000000000000
> [  252.349117][  T330]
Qian Cai Nov. 2, 2020, 3:18 p.m. UTC | #4
On Mon, 2020-11-02 at 14:51 +0000, John Garry wrote:
> On 02/11/2020 14:17, Qian Cai wrote:

> > [  251.961152][  T330] INFO: task systemd-udevd:567 blocked for more than

> > 122 seconds.

> > [  251.968876][  T330]       Not tainted 5.10.0-rc1-next-20201102 #1

> > [  251.975003][  T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"

> > disables this message.

> > [  251.983546][  T330] task:systemd-udevd   state:D stack:27224 pid:  567

> > ppid:   506 flags:0x00004324

> > [  251.992620][  T330] Call Trace:

> > [  251.995784][  T330]  __schedule+0x71d/0x1b60

> > [  252.000067][  T330]  ? __sched_text_start+0x8/0x8

> > [  252.004798][  T330]  schedule+0xbf/0x270

> > [  252.008735][  T330]  schedule_timeout+0x3fc/0x590

> > [  252.013464][  T330]  ? usleep_range+0x120/0x120

> > [  252.018008][  T330]  ? wait_for_completion+0x156/0x250

> > [  252.023176][  T330]  ? lock_downgrade+0x700/0x700

> > [  252.027886][  T330]  ? rcu_read_unlock+0x40/0x40

> > [  252.032530][  T330]  ? do_raw_spin_lock+0x121/0x290

> > [  252.037412][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0

> > [  252.043268][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30

> > [  252.048331][  T330]  wait_for_completion+0x15e/0x250

> > [  252.053323][  T330]  ? wait_for_completion_interruptible+0x320/0x320

> > [  252.059687][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0

> > [  252.065543][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30

> > [  252.070606][  T330]  __flush_work+0x42a/0x900

> > [  252.074989][  T330]  ? queue_delayed_work_on+0x90/0x90

> > [  252.080139][  T330]  ? __queue_work+0x463/0xf40

> > [  252.084700][  T330]  ? init_pwq+0x320/0x320

> > [  252.088891][  T330]  ? queue_work_on+0x5e/0x80

> > [  252.093364][  T330]  ? trace_hardirqs_on+0x1c/0x150

> > [  252.098255][  T330]  work_on_cpu+0xe7/0x130

> > [  252.102461][  T330]  ? flush_delayed_work+0xc0/0xc0

> > [  252.107342][  T330]  ? __mutex_unlock_slowpath+0xd4/0x670

> > [  252.112764][  T330]  ? work_debug_hint+0x30/0x30

> > [  252.117391][  T330]  ? pci_device_shutdown+0x80/0x80

> > [  252.122378][  T330]  ? cpumask_next_and+0x57/0x80

> > [  252.127094][  T330]  pci_device_probe+0x500/0x5c0

> > [  252.131824][  T330]  ? pci_device_remove+0x1f0/0x1f0

> 

> Is CONFIG_DEBUG_TEST_DRIVER_REMOVE enabled? I figure it is, with this call.

> 

> Or please share the .config


No. https://cailca.coding.net/public/linux/mm/git/files/master/x86.config

> 

> Cheers,

> John

> 

> > [  252.136805][  T330]  really_probe+0x207/0xad0

> > [  252.141191][  T330]  ? device_driver_attach+0x120/0x120

> > [  252.146428][  T330]  driver_probe_device+0x1f1/0x370

> > [  252.151424][  T330]  device_driver_attach+0xe5/0x120

> > [  252.156399][  T330]  __driver_attach+0xf0/0x260

> > [  252.160953][  T330]  bus_for_each_dev+0x117/0x1a0

> > [  252.165669][  T330]  ? subsys_dev_iter_exit+0x10/0x10

> > [  252.170731][  T330]  bus_add_driver+0x399/0x560

> > [  252.175289][  T330]  driver_register+0x189/0x310

> > [  252.179919][  T330]  ? 0xffffffffc05c1000

> > [  252.183960][  T330]  megasas_init+0x117/0x1000 [megaraid_sas]

> > [  252.189713][  T330]  do_one_initcall+0xf6/0x510

> > [  252.194267][  T330]  ? perf_trace_initcall_level+0x490/0x490

> > [  252.199940][  T330]  ? kasan_unpoison_shadow+0x30/0x40

> > [  252.205104][  T330]  ? __kasan_kmalloc.constprop.11+0xc1/0xd0

> > [  252.210859][  T330]  ? do_init_module+0x49/0x6c0

> > [  252.215500][  T330]  ? kmem_cache_alloc_trace+0x11f/0x1e0

> > [  252.220925][  T330]  ? kasan_unpoison_shadow+0x30/0x40

> > [  252.226068][  T330]  do_init_module+0x1ed/0x6c0

> > [  252.230608][  T330]  load_module+0x4a59/0x5d20

> > [  252.235081][  T330]  ? layout_and_allocate+0x2770/0x2770

> > [  252.240404][  T330]  ? __vmalloc_node+0x8d/0x100

> > [  252.245046][  T330]  ? kernel_read_file+0x485/0x5a0

> > [  252.249934][  T330]  ? kernel_read_file+0x305/0x5a0

> > [  252.254839][  T330]  ? __x64_sys_fsconfig+0x970/0x970

> > [  252.259903][  T330]  ? __do_sys_finit_module+0xff/0x180

> > [  252.265153][  T330]  __do_sys_finit_module+0xff/0x180

> > [  252.270216][  T330]  ? __do_sys_init_module+0x1d0/0x1d0

> > [  252.275465][  T330]  ? __fget_files+0x1c3/0x2e0

> > [  252.280010][  T330]  do_syscall_64+0x33/0x40

> > [  252.284304][  T330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

> > [  252.290054][  T330] RIP: 0033:0x7fbb3e2fa78d

> > [  252.294348][  T330] Code: Unable to access opcode bytes at RIP

> > 0x7fbb3e2fa763.

> > [  252.301584][  T330] RSP: 002b:00007ffe572e8d18 EFLAGS: 00000246 ORIG_RAX:

> > 0000000000000139

> > [  252.309855][  T330] RAX: ffffffffffffffda RBX: 000055c7795d90f0 RCX:

> > 00007fbb3e2fa78d

> > [  252.317703][  T330] RDX: 0000000000000000 RSI: 00007fbb3ee6c82d RDI:

> > 0000000000000006

> > [  252.325553][  T330] RBP: 00007fbb3ee6c82d R08: 0000000000000000 R09:

> > 00007ffe572e8e40

> > [  252.333402][  T330] R10: 0000000000000006 R11: 0000000000000246 R12:

> > 0000000000000000

> > [  252.341257][  T330] R13: 000055c7795930e0 R14: 0000000000020000 R15:

> > 0000000000000000

> > [  252.349117][  T330]
Qian Cai Nov. 2, 2020, 3:24 p.m. UTC | #5
On Mon, 2020-11-02 at 20:01 +0530, Kashyap Desai wrote:
> > On Wed, 2020-08-19 at 23:20 +0800, John Garry wrote:
> > > From: Kashyap Desai <kashyap.desai@broadcom.com>
> > > 
> > > Fusion adapters can steer completions to individual queues, and we now
> > > have support for shared host-wide tags.
> > > So we can enable multiqueue support for fusion adapters.
> > > 
> > > Once driver enable shared host-wide tags, cpu hotplug feature is also
> > > supported as it was enabled using below patchsets - commit
> > > bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are
> > > offline")
> > > 
> > > Currently driver has provision to disable host-wide tags using
> > > "host_tagset_enable" module parameter.
> > > 
> > > Once we do not have any major performance regression using host-wide
> > > tags, we will drop the hand-crafted interrupt affinity settings.
> > > 
> > > Performance is also meeting the expecatation - (used both none and
> > > mq-deadline scheduler)
> > > 24 Drive SSD on Aero with/without this patch can get 3.1M IOPs
> > > 3 VDs consist of 8 SAS SSD on Aero with/without this patch can get
> > > 3.1M IOPs.
> > > 
> > > Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
> > > Signed-off-by: Hannes Reinecke <hare@suse.com>
> > > Signed-off-by: John Garry <john.garry@huawei.com>
> > 
> > Reverting this commit fixed an issue that Dell Power Edge R6415 server
> > with
> > megaraid_sas is unable to boot.
> 
> I will take a look at this. BTW, can you try keeping same PATCH but use
> module parameter "host_tagset_enable =0"

Yes, that also works.
John Garry Nov. 3, 2020, 10:54 a.m. UTC | #6
On 02/11/2020 15:18, Qian Cai wrote:
> On Mon, 2020-11-02 at 14:51 +0000, John Garry wrote:
>> On 02/11/2020 14:17, Qian Cai wrote:
>>> [  251.961152][  T330] INFO: task systemd-udevd:567 blocked for more than
>>> 122 seconds.
>>> [  251.968876][  T330]       Not tainted 5.10.0-rc1-next-20201102 #1
>>> [  251.975003][  T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> disables this message.
>>> [  251.983546][  T330] task:systemd-udevd   state:D stack:27224 pid:  567
>>> ppid:   506 flags:0x00004324
>>> [  251.992620][  T330] Call Trace:
>>> [  251.995784][  T330]  __schedule+0x71d/0x1b60
>>> [  252.000067][  T330]  ? __sched_text_start+0x8/0x8
>>> [  252.004798][  T330]  schedule+0xbf/0x270
>>> [  252.008735][  T330]  schedule_timeout+0x3fc/0x590
>>> [  252.013464][  T330]  ? usleep_range+0x120/0x120
>>> [  252.018008][  T330]  ? wait_for_completion+0x156/0x250
>>> [  252.023176][  T330]  ? lock_downgrade+0x700/0x700
>>> [  252.027886][  T330]  ? rcu_read_unlock+0x40/0x40
>>> [  252.032530][  T330]  ? do_raw_spin_lock+0x121/0x290
>>> [  252.037412][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
>>> [  252.043268][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
>>> [  252.048331][  T330]  wait_for_completion+0x15e/0x250
>>> [  252.053323][  T330]  ? wait_for_completion_interruptible+0x320/0x320
>>> [  252.059687][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
>>> [  252.065543][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
>>> [  252.070606][  T330]  __flush_work+0x42a/0x900
>>> [  252.074989][  T330]  ? queue_delayed_work_on+0x90/0x90
>>> [  252.080139][  T330]  ? __queue_work+0x463/0xf40
>>> [  252.084700][  T330]  ? init_pwq+0x320/0x320
>>> [  252.088891][  T330]  ? queue_work_on+0x5e/0x80
>>> [  252.093364][  T330]  ? trace_hardirqs_on+0x1c/0x150
>>> [  252.098255][  T330]  work_on_cpu+0xe7/0x130
>>> [  252.102461][  T330]  ? flush_delayed_work+0xc0/0xc0
>>> [  252.107342][  T330]  ? __mutex_unlock_slowpath+0xd4/0x670
>>> [  252.112764][  T330]  ? work_debug_hint+0x30/0x30
>>> [  252.117391][  T330]  ? pci_device_shutdown+0x80/0x80
>>> [  252.122378][  T330]  ? cpumask_next_and+0x57/0x80
>>> [  252.127094][  T330]  pci_device_probe+0x500/0x5c0
>>> [  252.131824][  T330]  ? pci_device_remove+0x1f0/0x1f0
>>
>> Is CONFIG_DEBUG_TEST_DRIVER_REMOVE enabled? I figure it is, with this call.
>>
>> Or please share the .config
> 
> No. https://cailca.coding.net/public/linux/mm/git/files/master/x86.config
> 

thanks, FWIW, I just tested another megaraid sas card on linux-next 02 
Nov with vanilla arm64 defconfig and no special commandline param, and 
found no issue:

dmesg | grep mega
[30.031739] megasas: 07.714.04.00-rc1
[30.039749] megaraid_sas 0000:08:00.0: Adding to iommu group 0
[30.053247] megaraid_sas 0000:08:00.0: BAR:0x0  BAR's 
base_addr(phys):0x0000080010000000  mapped virt_addr:0x(____ptrval____)
[30.053251] megaraid_sas 0000:08:00.0: FW now in Ready state
[30.065162] megaraid_sas 0000:08:00.0: 63 bit DMA mask and 63 bit 
consistent mask
[30.081197] megaraid_sas 0000:08:00.0: firmware supports msix  : (128)
[30.096349] megaraid_sas 0000:08:00.0: requested/available msix 128/128
[30.110277] megaraid_sas 0000:08:00.0: current msix/online cpus: (128/128)
[30.124917] megaraid_sas 0000:08:00.0: RDPQ mode  : (enabled)
[30.136821] megaraid_sas 0000:08:00.0: Current firmware supports maximum 
commands: 4077 LDIO threshold: 0
[30.208538] megaraid_sas 0000:08:00.0: Performance mode :Latency 
(latency index = 1)
[30.224838] megaraid_sas 0000:08:00.0: FW supports sync cache  : Yes
[30.238021] megaraid_sas 0000:08:00.0: megasas_disable_intr_fusion is 
called outbound_intr_mask:0x40000009
[30.311960] megaraid_sas 0000:08:00.0: FW provided supportMaxExtLDs: 1 
max_lds: 64
[30.327885] megaraid_sas 0000:08:00.0: controller type : MR(2048MB)
[30.341066] megaraid_sas 0000:08:00.0: Online Controller Reset(OCR)  : 
Enabled
[30.356076] megaraid_sas 0000:08:00.0: Secure JBOD support: Yes
[30.368710] megaraid_sas 0000:08:00.0: NVMe passthru support : Yes
[30.381708] megaraid_sas 0000:08:00.0: FW provided TM TaskAbort/Reset 
timeout  : 6 secs/60 secs
[30.399825] megaraid_sas 0000:08:00.0: JBOD sequence map support  : Yes
[30.413552] megaraid_sas 0000:08:00.0: PCI Lane Margining support : No
[30.452059] megaraid_sas 0000:08:00.0: NVME page size  : (4096)
[30.465079] megaraid_sas 0000:08:00.0: megasas_enable_intr_fusion is 
called outbound_intr_mask:0x40000000
[30.485208] megaraid_sas 0000:08:00.0: INIT adapter done
[30.496609] megaraid_sas 0000:08:00.0: pci id : 
(0x1000)/(0x0016)/(0x19e5)/(0xd215)
[30.512931] megaraid_sas 0000:08:00.0: unevenspan support : no
[30.525199] megaraid_sas 0000:08:00.0: firmware crash dump: no
[30.537649] megaraid_sas 0000:08:00.0: JBOD sequence map  : enabled
[30.550743] megaraid_sas 0000:08:00.0: Max firmware commands: 4076 
shared with nr_hw_queues = 127

john@ubuntu:~$ lspci -s 08:00.0 -v
08:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID Tri-Mode 
SAS3508 (rev 01)
   Subsystem: Huawei Technologies Co., Ltd. MegaRAID Tri-Mode SAS3508
   Flags: bus master, fast devsel, latency 0, IRQ 41, NUMA node 0
   Memory at 80010000000 (64-bit, prefetchable) [size=1M]
   Memory at 80010100000 (64-bit, prefetchable) [size=1M]
   Memory at e9400000 (32-bit, non-prefetchable) [size=1M]
   I/O ports at 1000 [size=256]
   Expansion ROM at e9500000 [disabled] [size=1M]
   Capabilities: <access denied>
   Kernel driver in use: megaraid_sas

I have no x86 system to test that x86 config, though. How about 
v5.10-rc2 for this issue?

Thanks,
John


>>
>> Cheers,
>> John
>>
>>> [  252.136805][  T330]  really_probe+0x207/0xad0
>>> [  252.141191][  T330]  ? device_driver_attach+0x120/0x120
>>> [  252.146428][  T330]  driver_probe_device+0x1f1/0x370
>>> [  252.151424][  T330]  device_driver_attach+0xe5/0x120
>>> [  252.156399][  T330]  __driver_attach+0xf0/0x260
>>> [  252.160953][  T330]  bus_for_each_dev+0x117/0x1a0
>>> [  252.165669][  T330]  ? subsys_dev_iter_exit+0x10/0x10
>>> [  252.170731][  T330]  bus_add_driver+0x399/0x560
>>> [  252.175289][  T330]  driver_register+0x189/0x310
>>> [  252.179919][  T330]  ? 0xffffffffc05c1000
>>> [  252.183960][  T330]  megasas_init+0x117/0x1000 [megaraid_sas]
>>> [  252.189713][  T330]  do_one_initcall+0xf6/0x510
>>> [  252.194267][  T330]  ? perf_trace_initcall_level+0x490/0x490
>>> [  252.199940][  T330]  ? kasan_unpoison_shadow+0x30/0x40
>>> [  252.205104][  T330]  ? __kasan_kmalloc.constprop.11+0xc1/0xd0
>>> [  252.210859][  T330]  ? do_init_module+0x49/0x6c0
>>> [  252.215500][  T330]  ? kmem_cache_alloc_trace+0x11f/0x1e0
>>> [  252.220925][  T330]  ? kasan_unpoison_shadow+0x30/0x40
>>> [  252.226068][  T330]  do_init_module+0x1ed/0x6c0
>>> [  252.230608][  T330]  load_module+0x4a59/0x5d20
>>> [  252.235081][  T330]  ? layout_and_allocate+0x2770/0x2770
>>> [  252.240404][  T330]  ? __vmalloc_node+0x8d/0x100
>>> [  252.245046][  T330]  ? kernel_read_file+0x485/0x5a0
>>> [  252.249934][  T330]  ? kernel_read_file+0x305/0x5a0
>>> [  252.254839][  T330]  ? __x64_sys_fsconfig+0x970/0x970
>>> [  252.259903][  T330]  ? __do_sys_finit_module+0xff/0x180
>>> [  252.265153][  T330]  __do_sys_finit_module+0xff/0x180
>>> [  252.270216][  T330]  ? __do_sys_init_module+0x1d0/0x1d0
>>> [  252.275465][  T330]  ? __fget_files+0x1c3/0x2e0
>>> [  252.280010][  T330]  do_syscall_64+0x33/0x40
>>> [  252.284304][  T330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [  252.290054][  T330] RIP: 0033:0x7fbb3e2fa78d
>>> [  252.294348][  T330] Code: Unable to access opcode bytes at RIP
>>> 0x7fbb3e2fa763.
>>> [  252.301584][  T330] RSP: 002b:00007ffe572e8d18 EFLAGS: 00000246 ORIG_RAX:
>>> 0000000000000139
>>> [  252.309855][  T330] RAX: ffffffffffffffda RBX: 000055c7795d90f0 RCX:
>>> 00007fbb3e2fa78d
>>> [  252.317703][  T330] RDX: 0000000000000000 RSI: 00007fbb3ee6c82d RDI:
>>> 0000000000000006
>>> [  252.325553][  T330] RBP: 00007fbb3ee6c82d R08: 0000000000000000 R09:
>>> 00007ffe572e8e40
>>> [  252.333402][  T330] R10: 0000000000000006 R11: 0000000000000246 R12:
>>> 0000000000000000
>>> [  252.341257][  T330] R13: 000055c7795930e0 R14: 0000000000020000 R15:
>>> 0000000000000000
>>> [  252.349117][  T330]
> 
> .
>
Qian Cai Nov. 3, 2020, 1:04 p.m. UTC | #7
On Tue, 2020-11-03 at 10:54 +0000, John Garry wrote:
> I have no x86 system to test that x86 config, though. How about 

> v5.10-rc2 for this issue?


v5.10-rc2 is also broken here.

[  251.941451][  T330] INFO: task systemd-udevd:551 blocked for more than 122 seconds.
[  251.949176][  T330]       Not tainted 5.10.0-rc2 #3
[  251.954094][  T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  251.962633][  T330] task:systemd-udevd   state:D stack:27160 pid:  551 ppid:   506 flags:0x00000324
[  251.971707][  T330] Call Trace:
[  251.974871][  T330]  __schedule+0x71d/0x1b50
[  251.979155][  T330]  ? kcore_callback+0x1d/0x1d
[  251.983709][  T330]  schedule+0xbf/0x270
[  251.987640][  T330]  schedule_timeout+0x3fc/0x590
[  251.992370][  T330]  ? usleep_range+0x120/0x120
[  251.996910][  T330]  ? wait_for_completion+0x156/0x250
[  252.002080][  T330]  ? lock_downgrade+0x700/0x700
[  252.006792][  T330]  ? rcu_read_unlock+0x40/0x40
[  252.011435][  T330]  ? do_raw_spin_lock+0x121/0x290
[  252.016324][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
[  252.022178][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
[  252.027235][  T330]  wait_for_completion+0x15e/0x250
[  252.032226][  T330]  ? wait_for_completion_interruptible+0x2f0/0x2f0
[  252.038590][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
[  252.044443][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
[  252.049502][  T330]  __flush_work+0x42a/0x900
[  252.053882][  T330]  ? queue_delayed_work_on+0x90/0x90
[  252.059025][  T330]  ? __queue_work+0x463/0xf40
[  252.063583][  T330]  ? init_pwq+0x320/0x320
[  252.067777][  T330]  ? queue_work_on+0x5e/0x80
[  252.072249][  T330]  ? trace_hardirqs_on+0x1c/0x150
[  252.077138][  T330]  work_on_cpu+0xe7/0x130
[  252.081347][  T330]  ? flush_delayed_work+0xc0/0xc0
[  252.086231][  T330]  ? __mutex_unlock_slowpath+0xd4/0x670
[  252.091655][  T330]  ? work_debug_hint+0x30/0x30
[  252.096284][  T330]  ? pci_device_shutdown+0x80/0x80
[  252.101274][  T330]  ? cpumask_next_and+0x57/0x80
[  252.105990][  T330]  pci_device_probe+0x500/0x5c0
[  252.110703][  T330]  ? pci_device_remove+0x1f0/0x1f0
[  252.115697][  T330]  really_probe+0x207/0xad0
[  252.120065][  T330]  ? device_driver_attach+0x120/0x120
[  252.125317][  T330]  driver_probe_device+0x1f1/0x370
[  252.130291][  T330]  device_driver_attach+0xe5/0x120
[  252.135281][  T330]  __driver_attach+0xf0/0x260
[  252.139827][  T330]  bus_for_each_dev+0x117/0x1a0
[  252.144552][  T330]  ? subsys_dev_iter_exit+0x10/0x10
[  252.149609][  T330]  bus_add_driver+0x399/0x560
[  252.154166][  T330]  driver_register+0x189/0x310
[  252.158795][  T330]  ? 0xffffffffc05c5000
[  252.162838][  T330]  megasas_init+0x117/0x1000 [megaraid_sas]
[  252.168593][  T330]  do_one_initcall+0xf6/0x510
[  252.173143][  T330]  ? perf_trace_initcall_level+0x490/0x490
[  252.178809][  T330]  ? kasan_unpoison_shadow+0x30/0x40
[  252.183973][  T330]  ? __kasan_kmalloc.constprop.11+0xc1/0xd0
[  252.189728][  T330]  ? do_init_module+0x49/0x6c0
[  252.194370][  T330]  ? kmem_cache_alloc_trace+0x12e/0x2a0
[  252.199780][  T330]  ? kasan_unpoison_shadow+0x30/0x40
[  252.204942][  T330]  do_init_module+0x1ed/0x6c0
[  252.209479][  T330]  load_module+0x4a25/0x5cf0
[  252.213950][  T330]  ? layout_and_allocate+0x2770/0x2770
[  252.219271][  T330]  ? __vmalloc_node+0x8d/0x100
[  252.223913][  T330]  ? kernel_read_file+0x485/0x5a0
[  252.228796][  T330]  ? kernel_read_file+0x305/0x5a0
[  252.233696][  T330]  ? __ia32_sys_fsconfig+0x6a0/0x6a0
[  252.238841][  T330]  ? __do_sys_finit_module+0xff/0x180
[  252.244093][  T330]  __do_sys_finit_module+0xff/0x180
[  252.249155][  T330]  ? __do_sys_init_module+0x1d0/0x1d0
[  252.254403][  T330]  ? __fget_files+0x1c3/0x2e0
[  252.258940][  T330]  do_syscall_64+0x33/0x40
[  252.263234][  T330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  252.268984][  T330] RIP: 0033:0x7f7cf6a4878d
[  252.273276][  T330] Code: Unable to access opcode bytes at RIP 0x7f7cf6a48763.
[  252.280499][  T330] RSP: 002b:00007ffcfa94b978 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  252.288781][  T330] RAX: ffffffffffffffda RBX: 000055e01f48b730 RCX: 00007f7cf6a4878d
[  252.296628][  T330] RDX: 0000000000000000 RSI: 00007f7cf75ba82d RDI: 0000000000000006
[  252.304482][  T330] RBP: 00007f7cf75ba82d R08: 0000000000000000 R09: 00007ffcfa94baa0
[  252.312331][  T330] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000000
[  252.320167][  T330] R13: 000055e01f433530 R14: 0000000000020000 R15: 0000000000000000
[  252.328052][  T330] 
[  252.328052][  T330] Showing all locks held in the system:
[  252.335722][  T330] 3 locks held by kworker/3:1/289:
[  252.340697][  T330]  #0: ffff8881001eb338 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x7ec/0x1610
[  252.350906][  T330]  #1: ffffc90004ef7e00 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x820/0x1610
[  252.361725][  T330]  #2: ffff88810dc600e0 (&shost->scan_mutex){+.+.}-{3:3}, at: scsi_scan_host_selected+0xde/0x260
[  252.372132][  T330] 1 lock held by khungtaskd/330:
[  252.376933][  T330]  #0: ffffffffb42d2de0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.52+0x0/0x30
[  252.387234][  T330] 1 lock held by systemd-journal/398:
[  252.392489][  T330] 1 lock held by systemd-udevd/551:
[  252.397550][  T330]  #0: ffff888109a49218 (&dev->mutex){....}-{3:3}, at: device_driver_attach+0x37/0x120
[  252.407085][  T330] 
[  252.409285][  T330] =============================================
[  252.409285][  T330]
Qian Cai Nov. 4, 2020, 3:21 p.m. UTC | #8
On Tue, 2020-11-03 at 08:04 -0500, Qian Cai wrote:
> On Tue, 2020-11-03 at 10:54 +0000, John Garry wrote:
> > I have no x86 system to test that x86 config, though. How about 
> > v5.10-rc2 for this issue?
> 
> v5.10-rc2 is also broken here.

John, Kashyap, any update on this? If this is going to take a while to fix it
proper, should I send a patch to revert this or at least disable the feature by
default for megaraid_sas in the meantime, so it no longer breaks the existing
systems out there?

> 
> [  251.941451][  T330] INFO: task systemd-udevd:551 blocked for more than 122
> seconds.
> [  251.949176][  T330]       Not tainted 5.10.0-rc2 #3
> [  251.954094][  T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  251.962633][  T330] task:systemd-udevd   state:D stack:27160 pid:  551
> ppid:   506 flags:0x00000324
> [  251.971707][  T330] Call Trace:
> [  251.974871][  T330]  __schedule+0x71d/0x1b50
> [  251.979155][  T330]  ? kcore_callback+0x1d/0x1d
> [  251.983709][  T330]  schedule+0xbf/0x270
> [  251.987640][  T330]  schedule_timeout+0x3fc/0x590
> [  251.992370][  T330]  ? usleep_range+0x120/0x120
> [  251.996910][  T330]  ? wait_for_completion+0x156/0x250
> [  252.002080][  T330]  ? lock_downgrade+0x700/0x700
> [  252.006792][  T330]  ? rcu_read_unlock+0x40/0x40
> [  252.011435][  T330]  ? do_raw_spin_lock+0x121/0x290
> [  252.016324][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> [  252.022178][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
> [  252.027235][  T330]  wait_for_completion+0x15e/0x250
> [  252.032226][  T330]  ? wait_for_completion_interruptible+0x2f0/0x2f0
> [  252.038590][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> [  252.044443][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
> [  252.049502][  T330]  __flush_work+0x42a/0x900
> [  252.053882][  T330]  ? queue_delayed_work_on+0x90/0x90
> [  252.059025][  T330]  ? __queue_work+0x463/0xf40
> [  252.063583][  T330]  ? init_pwq+0x320/0x320
> [  252.067777][  T330]  ? queue_work_on+0x5e/0x80
> [  252.072249][  T330]  ? trace_hardirqs_on+0x1c/0x150
> [  252.077138][  T330]  work_on_cpu+0xe7/0x130
> [  252.081347][  T330]  ? flush_delayed_work+0xc0/0xc0
> [  252.086231][  T330]  ? __mutex_unlock_slowpath+0xd4/0x670
> [  252.091655][  T330]  ? work_debug_hint+0x30/0x30
> [  252.096284][  T330]  ? pci_device_shutdown+0x80/0x80
> [  252.101274][  T330]  ? cpumask_next_and+0x57/0x80
> [  252.105990][  T330]  pci_device_probe+0x500/0x5c0
> [  252.110703][  T330]  ? pci_device_remove+0x1f0/0x1f0
> [  252.115697][  T330]  really_probe+0x207/0xad0
> [  252.120065][  T330]  ? device_driver_attach+0x120/0x120
> [  252.125317][  T330]  driver_probe_device+0x1f1/0x370
> [  252.130291][  T330]  device_driver_attach+0xe5/0x120
> [  252.135281][  T330]  __driver_attach+0xf0/0x260
> [  252.139827][  T330]  bus_for_each_dev+0x117/0x1a0
> [  252.144552][  T330]  ? subsys_dev_iter_exit+0x10/0x10
> [  252.149609][  T330]  bus_add_driver+0x399/0x560
> [  252.154166][  T330]  driver_register+0x189/0x310
> [  252.158795][  T330]  ? 0xffffffffc05c5000
> [  252.162838][  T330]  megasas_init+0x117/0x1000 [megaraid_sas]
> [  252.168593][  T330]  do_one_initcall+0xf6/0x510
> [  252.173143][  T330]  ? perf_trace_initcall_level+0x490/0x490
> [  252.178809][  T330]  ? kasan_unpoison_shadow+0x30/0x40
> [  252.183973][  T330]  ? __kasan_kmalloc.constprop.11+0xc1/0xd0
> [  252.189728][  T330]  ? do_init_module+0x49/0x6c0
> [  252.194370][  T330]  ? kmem_cache_alloc_trace+0x12e/0x2a0
> [  252.199780][  T330]  ? kasan_unpoison_shadow+0x30/0x40
> [  252.204942][  T330]  do_init_module+0x1ed/0x6c0
> [  252.209479][  T330]  load_module+0x4a25/0x5cf0
> [  252.213950][  T330]  ? layout_and_allocate+0x2770/0x2770
> [  252.219271][  T330]  ? __vmalloc_node+0x8d/0x100
> [  252.223913][  T330]  ? kernel_read_file+0x485/0x5a0
> [  252.228796][  T330]  ? kernel_read_file+0x305/0x5a0
> [  252.233696][  T330]  ? __ia32_sys_fsconfig+0x6a0/0x6a0
> [  252.238841][  T330]  ? __do_sys_finit_module+0xff/0x180
> [  252.244093][  T330]  __do_sys_finit_module+0xff/0x180
> [  252.249155][  T330]  ? __do_sys_init_module+0x1d0/0x1d0
> [  252.254403][  T330]  ? __fget_files+0x1c3/0x2e0
> [  252.258940][  T330]  do_syscall_64+0x33/0x40
> [  252.263234][  T330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  252.268984][  T330] RIP: 0033:0x7f7cf6a4878d
> [  252.273276][  T330] Code: Unable to access opcode bytes at RIP
> 0x7f7cf6a48763.
> [  252.280499][  T330] RSP: 002b:00007ffcfa94b978 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000139
> [  252.288781][  T330] RAX: ffffffffffffffda RBX: 000055e01f48b730 RCX:
> 00007f7cf6a4878d
> [  252.296628][  T330] RDX: 0000000000000000 RSI: 00007f7cf75ba82d RDI:
> 0000000000000006
> [  252.304482][  T330] RBP: 00007f7cf75ba82d R08: 0000000000000000 R09:
> 00007ffcfa94baa0
> [  252.312331][  T330] R10: 0000000000000006 R11: 0000000000000246 R12:
> 0000000000000000
> [  252.320167][  T330] R13: 000055e01f433530 R14: 0000000000020000 R15:
> 0000000000000000
> [  252.328052][  T330] 
> [  252.328052][  T330] Showing all locks held in the system:
> [  252.335722][  T330] 3 locks held by kworker/3:1/289:
> [  252.340697][  T330]  #0: ffff8881001eb338 ((wq_completion)events){+.+.}-
> {0:0}, at: process_one_work+0x7ec/0x1610
> [  252.350906][  T330]  #1: ffffc90004ef7e00
> ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x820/0x1610
> [  252.361725][  T330]  #2: ffff88810dc600e0 (&shost->scan_mutex){+.+.}-{3:3}, 
> at: scsi_scan_host_selected+0xde/0x260
> [  252.372132][  T330] 1 lock held by khungtaskd/330:
> [  252.376933][  T330]  #0: ffffffffb42d2de0 (rcu_read_lock){....}-{1:2}, at:
> rcu_lock_acquire.constprop.52+0x0/0x30
> [  252.387234][  T330] 1 lock held by systemd-journal/398:
> [  252.392489][  T330] 1 lock held by systemd-udevd/551:
> [  252.397550][  T330]  #0: ffff888109a49218 (&dev->mutex){....}-{3:3}, at:
> device_driver_attach+0x37/0x120
> [  252.407085][  T330] 
> [  252.409285][  T330] =============================================
> [  252.409285][  T330] 
>
Kashyap Desai Nov. 4, 2020, 4:07 p.m. UTC | #9
> >

> > v5.10-rc2 is also broken here.

>

> John, Kashyap, any update on this? If this is going to take a while to fix

> it

> proper, should I send a patch to revert this or at least disable the

> feature by

> default for megaraid_sas in the meantime, so it no longer breaks the

> existing

> systems out there?


I am trying to get similar h/w to try out. All my current h/w works fine.
Give me couple of days' time.
If this is not obviously common issue and need time, we will go with module
parameter disable method.
I will let you know.

Kashyap
John Garry Nov. 4, 2020, 6:08 p.m. UTC | #10
On 04/11/2020 16:07, Kashyap Desai wrote:
>>>

>>> v5.10-rc2 is also broken here.

>>

>> John, Kashyap, any update on this? If this is going to take a while to fix

>> it

>> proper, should I send a patch to revert this or at least disable the

>> feature by

>> default for megaraid_sas in the meantime, so it no longer breaks the

>> existing

>> systems out there?

> 

> I am trying to get similar h/w to try out. All my current h/w works fine.

> Give me couple of days' time.

> If this is not obviously common issue and need time, we will go with module

> parameter disable method.

> I will let you know.


Hi Kashyap,

Please also consider just disabling for this card, so any other possible 
issues are unearthed on other cards. I don't have this card or any x86 
machine to test it unfortunately to assist.

BTW, just to be clear, did you try the same .config as Qian Cai?

Thanks,
John
Sumit Saxena Nov. 6, 2020, 7:25 p.m. UTC | #11
On Wed, Nov 4, 2020 at 11:38 PM John Garry <john.garry@huawei.com> wrote:
>

> On 04/11/2020 16:07, Kashyap Desai wrote:

> >>>

> >>> v5.10-rc2 is also broken here.

> >>

> >> John, Kashyap, any update on this? If this is going to take a while to fix

> >> it

> >> proper, should I send a patch to revert this or at least disable the

> >> feature by

> >> default for megaraid_sas in the meantime, so it no longer breaks the

> >> existing

> >> systems out there?

> >

> > I am trying to get similar h/w to try out. All my current h/w works fine.

> > Give me couple of days' time.

> > If this is not obviously common issue and need time, we will go with module

> > parameter disable method.

> > I will let you know.

>

> Hi Kashyap,

>

> Please also consider just disabling for this card, so any other possible

> issues are unearthed on other cards. I don't have this card or any x86

> machine to test it unfortunately to assist.

>

> BTW, just to be clear, did you try the same .config as Qian Cai?

>

> Thanks,

> John

I am able to hit the boot hang and similar kind of stack traces as
reported by Qian with shared .config on x86 machine.
In my case the system boots after a hang of 40-45 mins. Qian, is it
true for you as well ?
With module parameter -"host_tagset_enable=0", the issue is not seen.
Below is snippet of the dmesg logs/traces which are observed during
system bootup and after wait of 40-45 mins
drives attached to megaraid_sas adapter are discovered:

========================================
[ 1969.502913] INFO: task systemd-udevd:906 can't die for more than
1720 seconds.
[ 1969.597725] task:systemd-udevd   state:D stack:13456 pid:  906
ppid:   858 flags:0x00000324
[ 1969.597730] Call Trace:
[ 1969.597734]  __schedule+0x263/0x7f0
[ 1969.597737]  ? __lock_acquire+0x576/0xaf0
[ 1969.597739]  ? wait_for_completion+0x7b/0x110
[ 1969.597741]  schedule+0x4c/0xc0
[ 1969.597743]  schedule_timeout+0x244/0x2e0
[ 1969.597745]  ? find_held_lock+0x2d/0x90
[ 1969.597748]  ? wait_for_completion+0xa6/0x110
[ 1969.597750]  ? wait_for_completion+0x7b/0x110
[ 1969.597752]  ? lockdep_hardirqs_on_prepare+0xd4/0x170
[ 1969.597753]  ? wait_for_completion+0x7b/0x110
[ 1969.597755]  wait_for_completion+0xae/0x110
[ 1969.597757]  __flush_work+0x269/0x4b0
[ 1969.597760]  ? init_pwq+0xf0/0xf0
[ 1969.597763]  work_on_cpu+0x9c/0xd0
[ 1969.597765]  ? work_is_static_object+0x10/0x10
[ 1969.597768]  ? pci_device_shutdown+0x30/0x30
[ 1969.597770]  pci_device_probe+0x197/0x1b0
[ 1969.597773]  really_probe+0xda/0x410
[ 1969.597776]  driver_probe_device+0xd9/0x140
[ 1969.597778]  device_driver_attach+0x4a/0x50
[ 1969.597780]  __driver_attach+0x83/0x140
[ 1969.597782]  ? device_driver_attach+0x50/0x50
[ 1969.597784]  ? device_driver_attach+0x50/0x50
[ 1969.597787]  bus_for_each_dev+0x74/0xc0
[ 1969.597789]  bus_add_driver+0x14b/0x1f0
[ 1969.597791]  ? 0xffffffffc04fb000
[ 1969.597793]  driver_register+0x66/0xb0
[ 1969.597795]  ? 0xffffffffc04fb000
[ 1969.597801]  megasas_init+0xe7/0x1000 [megaraid_sas]
[ 1969.597803]  do_one_initcall+0x62/0x300
[ 1969.597806]  ? do_init_module+0x1d/0x200
[ 1969.597808]  ? kmem_cache_alloc_trace+0x296/0x2d0
[ 1969.597811]  do_init_module+0x55/0x200
[ 1969.597813]  load_module+0x15f2/0x17b0
[ 1969.597816]  ? __do_sys_finit_module+0xad/0x110
[ 1969.597818]  __do_sys_finit_module+0xad/0x110
[ 1969.597820]  do_syscall_64+0x33/0x40
[ 1969.597823]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1969.597825] RIP: 0033:0x7f66340262bd
[ 1969.597827] Code: Unable to access opcode bytes at RIP 0x7f6634026293.
[ 1969.597828] RSP: 002b:00007ffca1011f48 EFLAGS: 00000246 ORIG_RAX:
0000000000000139
[ 1969.597831] RAX: ffffffffffffffda RBX: 000055f6720cf370 RCX: 00007f66340262bd
[ 1969.597833] RDX: 0000000000000000 RSI: 00007f6634b9880d RDI: 0000000000000006
[ 1969.597835] RBP: 00007f6634b9880d R08: 0000000000000000 R09: 00007ffca1012070
[ 1969.597836] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000000
[ 1969.597838] R13: 000055f6720cce70 R14: 0000000000020000 R15: 0000000000000000
[ 1969.597859]
               Showing all locks held in the system:
[ 1969.597862] 2 locks held by kworker/0:0/5:
[ 1969.597863]  #0: ffff9af800194b38
((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1e6/0x5e0
[ 1969.597872]  #1: ffffbf3bc01f3e70
((kfence_timer).work){+.+.}-{0:0}, at: process_one_work+0x1e6/0x5e0
[ 1969.597890] 3 locks held by kworker/0:1/7:
[ 1969.597960] 1 lock held by khungtaskd/643:
[ 1969.597962]  #0: ffffffffa624cb60 (rcu_read_lock){....}-{1:2}, at:
rcu_lock_acquire.constprop.54+0x0/0x30
[ 1969.597982] 1 lock held by systemd-udevd/906:
[ 1969.597983]  #0: ffff9af984a1c218 (&dev->mutex){....}-{3:3}, at:
device_driver_attach+0x18/0x50

[ 1969.598010] =============================================

[ 1983.242512] random: fast init done
[ 2071.928411] sd 0:2:0:0: [sda] 1951399936 512-byte logical blocks:
(999 GB/931 GiB)
[ 2071.928480] sd 0:2:2:0: [sdc] 1756889088 512-byte logical blocks:
(900 GB/838 GiB)
[ 2071.928537] sd 0:2:1:0: [sdb] 285474816 512-byte logical blocks:
(146 GB/136 GiB)
[ 2071.928580] sd 0:2:0:0: [sda] Write Protect is off
[ 2071.928625] sd 0:2:0:0: [sda] Mode Sense: 1f 00 00 08
[ 2071.928629] sd 0:2:2:0: [sdc] Write Protect is off
[ 2071.928669] sd 0:2:1:0: [sdb] Write Protect is off
[ 2071.928706] sd 0:2:1:0: [sdb] Mode Sense: 1f 00 00 08
[ 2071.928844] sd 0:2:2:0: [sdc] Mode Sense: 1f 00 00 08
[ 2071.928848] sd 0:2:0:0: [sda] Write cache: disabled, read cache:
enabled, doesn't support DPO or FUA


================================

I am working on it and need some time for debugging. BTW did anyone
try "shared host tagset" patchset on some other adapter/s which are
not really multiqueue at HW level
but driver exposes multiple hardware queues(similar to megaraid_sas)
with the .config shared by Qian ?

Thanks,
Sumit
Qian Cai Nov. 7, 2020, 12:17 a.m. UTC | #12
On Sat, 2020-11-07 at 00:55 +0530, Sumit Saxena wrote:
> I am able to hit the boot hang and similar kind of stack traces as

> reported by Qian with shared .config on x86 machine.

> In my case the system boots after a hang of 40-45 mins. Qian, is it

> true for you as well ?

I don't know. I had never waited for that long.
John Garry Nov. 9, 2020, 8:49 a.m. UTC | #13
On 07/11/2020 00:17, Qian Cai wrote:
> On Sat, 2020-11-07 at 00:55 +0530, Sumit Saxena wrote:

>> I am able to hit the boot hang and similar kind of stack traces as

>> reported by Qian with shared .config on x86 machine.

>> In my case the system boots after a hang of 40-45 mins. Qian, is it

>> true for you as well ?

> I don't know. I had never waited for that long.

> 

> .

> 


Hi Qian,

By chance do have an equivalent arm64 .config, enabling the same RH 
config options?

I suppose I could try do this myself also, but an authentic version 
would be nicer.

Thanks,
John
Qian Cai Nov. 9, 2020, 1:39 p.m. UTC | #14
On Mon, 2020-11-09 at 08:49 +0000, John Garry wrote:
> On 07/11/2020 00:17, Qian Cai wrote:

> > On Sat, 2020-11-07 at 00:55 +0530, Sumit Saxena wrote:

> > > I am able to hit the boot hang and similar kind of stack traces as

> > > reported by Qian with shared .config on x86 machine.

> > > In my case the system boots after a hang of 40-45 mins. Qian, is it

> > > true for you as well ?

> > I don't know. I had never waited for that long.

> > 

> > .

> > 

> 

> Hi Qian,

> 

> By chance do have an equivalent arm64 .config, enabling the same RH 

> config options?

> 

> I suppose I could try do this myself also, but an authentic version 

> would be nicer.

The closest one I have here is:
https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config

but it only selects the Thunder X2 platform and needs to manually select
CONFIG_MEGARAID_SAS=m to start with, but none of arm64 systems here have
megaraid_sas.
John Garry Nov. 9, 2020, 2:05 p.m. UTC | #15
On 09/11/2020 13:39, Qian Cai wrote:
>> I suppose I could try do this myself also, but an authentic version

>> would be nicer.

> The closest one I have here is:

> https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config

> 

> but it only selects the Thunder X2 platform and needs to manually select

> CONFIG_MEGARAID_SAS=m to start with, but none of arm64 systems here have

> megaraid_sas.


Thanks, I'm confident I can fix it up to get it going on my Huawei arm64 
D06CS.

So that board has a megaraid sas card. In addition, it also has hisi_sas 
HW, which is another storage controller which we enabled this same 
feature which is causing the problem.

I'll report back when I can.

Thanks,
john
John Garry Nov. 10, 2020, 5:42 p.m. UTC | #16
On 09/11/2020 14:05, John Garry wrote:
> On 09/11/2020 13:39, Qian Cai wrote:

>>> I suppose I could try do this myself also, but an authentic version

>>> would be nicer.

>> The closest one I have here is:

>> https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config

>>

>> but it only selects the Thunder X2 platform and needs to manually select

>> CONFIG_MEGARAID_SAS=m to start with, but none of arm64 systems here have

>> megaraid_sas.

> 

> Thanks, I'm confident I can fix it up to get it going on my Huawei arm64 

> D06CS.

> 

> So that board has a megaraid sas card. In addition, it also has hisi_sas 

> HW, which is another storage controller which we enabled this same 

> feature which is causing the problem.

> 

> I'll report back when I can.


So I had to hack that arm64 config a bit to get it booting:
https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.10-megaraid-hang

Boot is ok on my board without the megaraid sas card, but includes 
hisi_sas HW (which enables the equivalent option which is exposing the 
problem).

But the board with the megaraid sas boots very slowly, specifically 
around the megaraid sas probe:

: ttyS0 at MMIO 0x3f00002f8 (irq = 17, base_baud = 115200) is a 16550A
[   50.023726][    T1] printk: console [ttyS0] enabled
[   50.412597][    T1] megasas: 07.714.04.00-rc1
[   50.436614][    T5] megaraid_sas 0000:08:00.0: FW now in Ready state
[   50.450079][    T5] megaraid_sas 0000:08:00.0: 63 bit DMA mask and 63 
bit consistent mask
[   50.467811][    T5] megaraid_sas 0000:08:00.0: firmware supports msix 
        : (128)
[   50.845995][    T5] megaraid_sas 0000:08:00.0: requested/available 
msix 128/128
[   50.861476][    T5] megaraid_sas 0000:08:00.0: current msix/online 
cpus      : (128/128)
[   50.877616][    T5] megaraid_sas 0000:08:00.0: RDPQ mode     : (enabled)
[   50.891018][    T5] megaraid_sas 0000:08:00.0: Current firmware 
supports maximum commands: 4077       LDIO threshold: 0
[   51.262942][    T5] megaraid_sas 0000:08:00.0: Performance mode 
:Latency (latency index = 1)
[   51.280749][    T5] megaraid_sas 0000:08:00.0: FW supports sync cache 
        : Yes
[   51.295451][    T5] megaraid_sas 0000:08:00.0: 
megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
[   51.387474][    T5] megaraid_sas 0000:08:00.0: FW provided 
supportMaxExtLDs: 1       max_lds: 64
[   51.404931][    T5] megaraid_sas 0000:08:00.0: controller type 
: MR(2048MB)
[   51.419616][    T5] megaraid_sas 0000:08:00.0: Online Controller 
Reset(OCR)  : Enabled
[   51.436132][    T5] megaraid_sas 0000:08:00.0: Secure JBOD support 
: Yes
[   51.450265][    T5] megaraid_sas 0000:08:00.0: NVMe passthru support 
: Yes
[   51.464757][    T5] megaraid_sas 0000:08:00.0: FW provided TM 
TaskAbort/Reset timeout        : 6 secs/60 secs
[   51.484379][    T5] megaraid_sas 0000:08:00.0: JBOD sequence map 
support     : Yes
[   51.499607][    T5] megaraid_sas 0000:08:00.0: PCI Lane Margining 
support    : No
[   51.547610][    T5] megaraid_sas 0000:08:00.0: NVME page size 
: (4096)
[   51.608635][    T5] megaraid_sas 0000:08:00.0: 
megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
[   51.630285][    T5] megaraid_sas 0000:08:00.0: INIT adapter done
[   51.649854][    T5] megaraid_sas 0000:08:00.0: pci id 
: (0x1000)/(0x0016)/(0x19e5)/(0xd215)
[   51.667873][    T5] megaraid_sas 0000:08:00.0: unevenspan support    : no
[   51.681646][    T5] megaraid_sas 0000:08:00.0: firmware crash dump   : no
[   51.695596][    T5] megaraid_sas 0000:08:00.0: JBOD sequence map 
: enabled
[   51.711521][    T5] megaraid_sas 0000:08:00.0: Max firmware commands: 
4076 shared with nr_hw_queues = 127
[   51.733056][    T5] scsi host0: Avago SAS based MegaRAID driver
[   65.304363][    T5] scsi 0:0:0:0: Direct-Access     ATA      SAMSUNG 
MZ7KH1T9 404Q PQ: 0 ANSI: 6
[   65.392401][    T5] scsi 0:0:1:0: Direct-Access     ATA      SAMSUNG 
MZ7KH1T9 404Q PQ: 0 ANSI: 6
[   79.508307][    T5] scsi 0:0:65:0: Enclosure         HUAWEI 
Expander 12Gx16  131  PQ: 0 ANSI: 6
[  183.965109][   C14] random: fast init done

Notice the 14 and 104 second delays.

But does boot fully to get to the console. I'll wait for further issues, 
which you guys seem to experience after a while.

Thanks,
John
Sumit Saxena Nov. 11, 2020, 7:27 a.m. UTC | #17
On Tue, Nov 10, 2020 at 11:12 PM John Garry <john.garry@huawei.com> wrote:
>

> On 09/11/2020 14:05, John Garry wrote:

> > On 09/11/2020 13:39, Qian Cai wrote:

> >>> I suppose I could try do this myself also, but an authentic version

> >>> would be nicer.

> >> The closest one I have here is:

> >> https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config

> >>

> >> but it only selects the Thunder X2 platform and needs to manually select

> >> CONFIG_MEGARAID_SAS=m to start with, but none of arm64 systems here have

> >> megaraid_sas.

> >

> > Thanks, I'm confident I can fix it up to get it going on my Huawei arm64

> > D06CS.

> >

> > So that board has a megaraid sas card. In addition, it also has hisi_sas

> > HW, which is another storage controller which we enabled this same

> > feature which is causing the problem.

> >

> > I'll report back when I can.

>

> So I had to hack that arm64 config a bit to get it booting:

> https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.10-megaraid-hang

>

> Boot is ok on my board without the megaraid sas card, but includes

> hisi_sas HW (which enables the equivalent option which is exposing the

> problem).

>

> But the board with the megaraid sas boots very slowly, specifically

> around the megaraid sas probe:

>

> : ttyS0 at MMIO 0x3f00002f8 (irq = 17, base_baud = 115200) is a 16550A

> [   50.023726][    T1] printk: console [ttyS0] enabled

> [   50.412597][    T1] megasas: 07.714.04.00-rc1

> [   50.436614][    T5] megaraid_sas 0000:08:00.0: FW now in Ready state

> [   50.450079][    T5] megaraid_sas 0000:08:00.0: 63 bit DMA mask and 63

> bit consistent mask

> [   50.467811][    T5] megaraid_sas 0000:08:00.0: firmware supports msix

>         : (128)

> [   50.845995][    T5] megaraid_sas 0000:08:00.0: requested/available

> msix 128/128

> [   50.861476][    T5] megaraid_sas 0000:08:00.0: current msix/online

> cpus      : (128/128)

> [   50.877616][    T5] megaraid_sas 0000:08:00.0: RDPQ mode     : (enabled)

> [   50.891018][    T5] megaraid_sas 0000:08:00.0: Current firmware

> supports maximum commands: 4077       LDIO threshold: 0

> [   51.262942][    T5] megaraid_sas 0000:08:00.0: Performance mode

> :Latency (latency index = 1)

> [   51.280749][    T5] megaraid_sas 0000:08:00.0: FW supports sync cache

>         : Yes

> [   51.295451][    T5] megaraid_sas 0000:08:00.0:

> megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009

> [   51.387474][    T5] megaraid_sas 0000:08:00.0: FW provided

> supportMaxExtLDs: 1       max_lds: 64

> [   51.404931][    T5] megaraid_sas 0000:08:00.0: controller type

> : MR(2048MB)

> [   51.419616][    T5] megaraid_sas 0000:08:00.0: Online Controller

> Reset(OCR)  : Enabled

> [   51.436132][    T5] megaraid_sas 0000:08:00.0: Secure JBOD support

> : Yes

> [   51.450265][    T5] megaraid_sas 0000:08:00.0: NVMe passthru support

> : Yes

> [   51.464757][    T5] megaraid_sas 0000:08:00.0: FW provided TM

> TaskAbort/Reset timeout        : 6 secs/60 secs

> [   51.484379][    T5] megaraid_sas 0000:08:00.0: JBOD sequence map

> support     : Yes

> [   51.499607][    T5] megaraid_sas 0000:08:00.0: PCI Lane Margining

> support    : No

> [   51.547610][    T5] megaraid_sas 0000:08:00.0: NVME page size

> : (4096)

> [   51.608635][    T5] megaraid_sas 0000:08:00.0:

> megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000

> [   51.630285][    T5] megaraid_sas 0000:08:00.0: INIT adapter done

> [   51.649854][    T5] megaraid_sas 0000:08:00.0: pci id

> : (0x1000)/(0x0016)/(0x19e5)/(0xd215)

> [   51.667873][    T5] megaraid_sas 0000:08:00.0: unevenspan support    : no

> [   51.681646][    T5] megaraid_sas 0000:08:00.0: firmware crash dump   : no

> [   51.695596][    T5] megaraid_sas 0000:08:00.0: JBOD sequence map

> : enabled

> [   51.711521][    T5] megaraid_sas 0000:08:00.0: Max firmware commands:

> 4076 shared with nr_hw_queues = 127

> [   51.733056][    T5] scsi host0: Avago SAS based MegaRAID driver

> [   65.304363][    T5] scsi 0:0:0:0: Direct-Access     ATA      SAMSUNG

> MZ7KH1T9 404Q PQ: 0 ANSI: 6

> [   65.392401][    T5] scsi 0:0:1:0: Direct-Access     ATA      SAMSUNG

> MZ7KH1T9 404Q PQ: 0 ANSI: 6

> [   79.508307][    T5] scsi 0:0:65:0: Enclosure         HUAWEI

> Expander 12Gx16  131  PQ: 0 ANSI: 6

> [  183.965109][   C14] random: fast init done

>

> Notice the 14 and 104 second delays.

>

> But does boot fully to get to the console. I'll wait for further issues,

> which you guys seem to experience after a while.

>

> Thanks,

> John

"megaraid_sas" driver calls “scsi_scan_host()” to discover SCSI
devices. In this failure case, scsi_scan_host() is taking a long time
to complete, hence causing delay in system boot.
With "host_tagset" enabled, scsi_scan_host() takes around 20 mins.
With "host_tagset" disabled, scsi_scan_host() takes upto 5-8 mins.

The scan time depends upon the number of scsi channels and devices per
scsi channel is exposed by LLD.
megaraid_sas driver exposes 4 channels and 128 drives per channel.

Each target scan takes 2 seconds (in case of failure with host_tagset
enabled).  That's why driver load completes after ~20 minutes. See
below:

[  299.725271] kobject: 'target18:0:96': free name
[  301.681267] kobject: 'target18:0:97' (00000000987c7f11):
kobject_cleanup, parent 0000000000000000
[  301.681269] kobject: 'target18:0:97' (00000000987c7f11): calling
ktype release
[  301.681273] kobject: 'target18:0:97': free name
[  303.575268] kobject: 'target18:0:98' (00000000a8c34149):
kobject_cleanup, parent 0000000000000000

In Qian's kernel .config, async scsi scan is disabled so in failure
case SCSI scan type is synchronous.
Below is the stack trace when scsi_scan_host() hangs:

[<0>] __wait_rcu_gp+0x134/0x170
[<0>] synchronize_rcu.part.80+0x53/0x60
[<0>] blk_free_flush_queue+0x12/0x30
[<0>] blk_mq_hw_sysfs_release+0x21/0x70
[<0>] kobject_release+0x46/0x150
[<0>] blk_mq_release+0xb4/0xf0
[<0>] blk_release_queue+0xc4/0x130
[<0>] kobject_release+0x46/0x150
[<0>] scsi_device_dev_release_usercontext+0x194/0x3f0
[<0>] execute_in_process_context+0x22/0xa0
[<0>] device_release+0x2e/0x80
[<0>] kobject_release+0x46/0x150
[<0>] scsi_alloc_sdev+0x2e7/0x310
[<0>] scsi_probe_and_add_lun+0x410/0xbd0
[<0>] __scsi_scan_target+0xf2/0x530
[<0>] scsi_scan_channel.part.7+0x51/0x70
[<0>] scsi_scan_host_selected+0xd4/0x140
[<0>] scsi_scan_host+0x198/0x1c0

This issue hits when lock related debugging is enabled in kernel config.
kernel .config parameters(may be subset of this list) are required to
hit the issue:

CONFIG_PREEMPT_COUNT=y
CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_TRACE_IRQFLAGS=y
CONFIG_TRACE_IRQFLAGS_NMI=y
CONFIG_DEBUG_KOBJECT=y
CONFIG_PROVE_RCU=y
CONFIG_PREEMPTIRQ_TRACEPOINTS=y

When scsi_scan_host() hangs, there are no outstanding IOs with
megaraid_sas driver-firmware stack as SCSI "host_busy" counter and
megaraid_sas driver's internal counter are "0".
Key takeaways:
1. Issue is observed when lock related debugging is enabled so issue
is seen in debug environment.
2. Issue seems to be related to generic shared "host_tagset" code
whenever some kind of kernel debugging is enabled. We do not see an
immediate reason to hide this issue through disabling the
"host_tagset" feature.

John,
Issue may hit on ARM platform too using Qian's .config file with other
adapters (e.g. hisi_sas) as well. So I feel disabling “host_tagset” in
megaraid_sas driver will not help.  It requires debugging from the
“Entire Shared host tag feature” perspective as scsi_scan_host()
waittime aggravates when "host_tagset" is enabled. Also, I am doing
parallel debugging and if I find anything useful, I will share.

Qian,
I need full dmesg logs from your setup with
megaraid_sas.host_tagset_enable=1 and
megaraid_sas.host_tagset_enable=0. Please wait for a long time. I just
want to make sure that whatever you observe is the same as mine.

Thanks,
Sumit
Ming Lei Nov. 11, 2020, 9:27 a.m. UTC | #18
On Wed, Nov 11, 2020 at 12:57:59PM +0530, Sumit Saxena wrote:
> On Tue, Nov 10, 2020 at 11:12 PM John Garry <john.garry@huawei.com> wrote:

> >

> > On 09/11/2020 14:05, John Garry wrote:

> > > On 09/11/2020 13:39, Qian Cai wrote:

> > >>> I suppose I could try do this myself also, but an authentic version

> > >>> would be nicer.

> > >> The closest one I have here is:

> > >> https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config

> > >>

> > >> but it only selects the Thunder X2 platform and needs to manually select

> > >> CONFIG_MEGARAID_SAS=m to start with, but none of arm64 systems here have

> > >> megaraid_sas.

> > >

> > > Thanks, I'm confident I can fix it up to get it going on my Huawei arm64

> > > D06CS.

> > >

> > > So that board has a megaraid sas card. In addition, it also has hisi_sas

> > > HW, which is another storage controller which we enabled this same

> > > feature which is causing the problem.

> > >

> > > I'll report back when I can.

> >

> > So I had to hack that arm64 config a bit to get it booting:

> > https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.10-megaraid-hang

> >

> > Boot is ok on my board without the megaraid sas card, but includes

> > hisi_sas HW (which enables the equivalent option which is exposing the

> > problem).

> >

> > But the board with the megaraid sas boots very slowly, specifically

> > around the megaraid sas probe:

> >

> > : ttyS0 at MMIO 0x3f00002f8 (irq = 17, base_baud = 115200) is a 16550A

> > [   50.023726][    T1] printk: console [ttyS0] enabled

> > [   50.412597][    T1] megasas: 07.714.04.00-rc1

> > [   50.436614][    T5] megaraid_sas 0000:08:00.0: FW now in Ready state

> > [   50.450079][    T5] megaraid_sas 0000:08:00.0: 63 bit DMA mask and 63

> > bit consistent mask

> > [   50.467811][    T5] megaraid_sas 0000:08:00.0: firmware supports msix

> >         : (128)

> > [   50.845995][    T5] megaraid_sas 0000:08:00.0: requested/available

> > msix 128/128

> > [   50.861476][    T5] megaraid_sas 0000:08:00.0: current msix/online

> > cpus      : (128/128)

> > [   50.877616][    T5] megaraid_sas 0000:08:00.0: RDPQ mode     : (enabled)

> > [   50.891018][    T5] megaraid_sas 0000:08:00.0: Current firmware

> > supports maximum commands: 4077       LDIO threshold: 0

> > [   51.262942][    T5] megaraid_sas 0000:08:00.0: Performance mode

> > :Latency (latency index = 1)

> > [   51.280749][    T5] megaraid_sas 0000:08:00.0: FW supports sync cache

> >         : Yes

> > [   51.295451][    T5] megaraid_sas 0000:08:00.0:

> > megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009

> > [   51.387474][    T5] megaraid_sas 0000:08:00.0: FW provided

> > supportMaxExtLDs: 1       max_lds: 64

> > [   51.404931][    T5] megaraid_sas 0000:08:00.0: controller type

> > : MR(2048MB)

> > [   51.419616][    T5] megaraid_sas 0000:08:00.0: Online Controller

> > Reset(OCR)  : Enabled

> > [   51.436132][    T5] megaraid_sas 0000:08:00.0: Secure JBOD support

> > : Yes

> > [   51.450265][    T5] megaraid_sas 0000:08:00.0: NVMe passthru support

> > : Yes

> > [   51.464757][    T5] megaraid_sas 0000:08:00.0: FW provided TM

> > TaskAbort/Reset timeout        : 6 secs/60 secs

> > [   51.484379][    T5] megaraid_sas 0000:08:00.0: JBOD sequence map

> > support     : Yes

> > [   51.499607][    T5] megaraid_sas 0000:08:00.0: PCI Lane Margining

> > support    : No

> > [   51.547610][    T5] megaraid_sas 0000:08:00.0: NVME page size

> > : (4096)

> > [   51.608635][    T5] megaraid_sas 0000:08:00.0:

> > megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000

> > [   51.630285][    T5] megaraid_sas 0000:08:00.0: INIT adapter done

> > [   51.649854][    T5] megaraid_sas 0000:08:00.0: pci id

> > : (0x1000)/(0x0016)/(0x19e5)/(0xd215)

> > [   51.667873][    T5] megaraid_sas 0000:08:00.0: unevenspan support    : no

> > [   51.681646][    T5] megaraid_sas 0000:08:00.0: firmware crash dump   : no

> > [   51.695596][    T5] megaraid_sas 0000:08:00.0: JBOD sequence map

> > : enabled

> > [   51.711521][    T5] megaraid_sas 0000:08:00.0: Max firmware commands:

> > 4076 shared with nr_hw_queues = 127

> > [   51.733056][    T5] scsi host0: Avago SAS based MegaRAID driver

> > [   65.304363][    T5] scsi 0:0:0:0: Direct-Access     ATA      SAMSUNG

> > MZ7KH1T9 404Q PQ: 0 ANSI: 6

> > [   65.392401][    T5] scsi 0:0:1:0: Direct-Access     ATA      SAMSUNG

> > MZ7KH1T9 404Q PQ: 0 ANSI: 6

> > [   79.508307][    T5] scsi 0:0:65:0: Enclosure         HUAWEI

> > Expander 12Gx16  131  PQ: 0 ANSI: 6

> > [  183.965109][   C14] random: fast init done

> >

> > Notice the 14 and 104 second delays.

> >

> > But does boot fully to get to the console. I'll wait for further issues,

> > which you guys seem to experience after a while.

> >

> > Thanks,

> > John

> "megaraid_sas" driver calls “scsi_scan_host()” to discover SCSI

> devices. In this failure case, scsi_scan_host() is taking a long time

> to complete, hence causing delay in system boot.

> With "host_tagset" enabled, scsi_scan_host() takes around 20 mins.

> With "host_tagset" disabled, scsi_scan_host() takes upto 5-8 mins.

> 

> The scan time depends upon the number of scsi channels and devices per

> scsi channel is exposed by LLD.

> megaraid_sas driver exposes 4 channels and 128 drives per channel.

> 

> Each target scan takes 2 seconds (in case of failure with host_tagset

> enabled).  That's why driver load completes after ~20 minutes. See

> below:

> 

> [  299.725271] kobject: 'target18:0:96': free name

> [  301.681267] kobject: 'target18:0:97' (00000000987c7f11):

> kobject_cleanup, parent 0000000000000000

> [  301.681269] kobject: 'target18:0:97' (00000000987c7f11): calling

> ktype release

> [  301.681273] kobject: 'target18:0:97': free name

> [  303.575268] kobject: 'target18:0:98' (00000000a8c34149):

> kobject_cleanup, parent 0000000000000000

> 

> In Qian's kernel .config, async scsi scan is disabled so in failure

> case SCSI scan type is synchronous.

> Below is the stack trace when scsi_scan_host() hangs:

> 

> [<0>] __wait_rcu_gp+0x134/0x170

> [<0>] synchronize_rcu.part.80+0x53/0x60

> [<0>] blk_free_flush_queue+0x12/0x30


Can this issue disappear by applying the following change?

diff --git a/block/blk-flush.c b/block/blk-flush.c
index e32958f0b687..b1fe6176d77f 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -469,9 +469,6 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
 	INIT_LIST_HEAD(&fq->flush_queue[1]);
 	INIT_LIST_HEAD(&fq->flush_data_in_flight);
 
-	lockdep_register_key(&fq->key);
-	lockdep_set_class(&fq->mq_flush_lock, &fq->key);
-
 	return fq;
 
  fail_rq:
@@ -486,7 +483,6 @@ void blk_free_flush_queue(struct blk_flush_queue *fq)
 	if (!fq)
 		return;
 
-	lockdep_unregister_key(&fq->key);
 	kfree(fq->flush_rq);
 	kfree(fq);
 }


Thanks, 
Ming
Sumit Saxena Nov. 11, 2020, 11:36 a.m. UTC | #19
>

> Can this issue disappear by applying the following change?

This change fixes the issue for me.

Qian,
Please try after applying changes suggested by Ming.

Thanks,
Sumit
>

> diff --git a/block/blk-flush.c b/block/blk-flush.c

> index e32958f0b687..b1fe6176d77f 100644

> --- a/block/blk-flush.c

> +++ b/block/blk-flush.c

> @@ -469,9 +469,6 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,

>         INIT_LIST_HEAD(&fq->flush_queue[1]);

>         INIT_LIST_HEAD(&fq->flush_data_in_flight);

>

> -       lockdep_register_key(&fq->key);

> -       lockdep_set_class(&fq->mq_flush_lock, &fq->key);

> -

>         return fq;

>

>   fail_rq:

> @@ -486,7 +483,6 @@ void blk_free_flush_queue(struct blk_flush_queue *fq)

>         if (!fq)

>                 return;

>

> -       lockdep_unregister_key(&fq->key);

>         kfree(fq->flush_rq);

>         kfree(fq);

>  }

>

>

> Thanks,

> Ming

>
John Garry Nov. 11, 2020, 11:51 a.m. UTC | #20
> 

> In Qian's kernel .config, async scsi scan is disabled so in failure

> case SCSI scan type is synchronous.

> Below is the stack trace when scsi_scan_host() hangs:

> 

> [<0>] __wait_rcu_gp+0x134/0x170

> [<0>] synchronize_rcu.part.80+0x53/0x60

> [<0>] blk_free_flush_queue+0x12/0x30

> [<0>] blk_mq_hw_sysfs_release+0x21/0x70


this is per blk_mq_hw_ctx

> [<0>] kobject_release+0x46/0x150

> [<0>] blk_mq_release+0xb4/0xf0

> [<0>] blk_release_queue+0xc4/0x130

> [<0>] kobject_release+0x46/0x150

> [<0>] scsi_device_dev_release_usercontext+0x194/0x3f0

> [<0>] execute_in_process_context+0x22/0xa0

> [<0>] device_release+0x2e/0x80

> [<0>] kobject_release+0x46/0x150

> [<0>] scsi_alloc_sdev+0x2e7/0x310

> [<0>] scsi_probe_and_add_lun+0x410/0xbd0

> [<0>] __scsi_scan_target+0xf2/0x530

> [<0>] scsi_scan_channel.part.7+0x51/0x70

> [<0>] scsi_scan_host_selected+0xd4/0x140

> [<0>] scsi_scan_host+0x198/0x1c0

> 

> This issue hits when lock related debugging is enabled in kernel config.

> kernel .config parameters(may be subset of this list) are required to

> hit the issue:

> 

> CONFIG_PREEMPT_COUNT=y *

> CONFIG_UNINLINE_SPIN_UNLOCK=y *

> CONFIG_LOCK_STAT=y

> CONFIG_DEBUG_RT_MUTEXES=y *

> CONFIG_DEBUG_SPINLOCK=y *

> CONFIG_DEBUG_MUTEXES=y *

> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y *

> CONFIG_DEBUG_RWSEMS=y *

> CONFIG_DEBUG_LOCK_ALLOC=y *

> CONFIG_LOCKDEP=y *

> CONFIG_DEBUG_LOCKDEP=y

> CONFIG_TRACE_IRQFLAGS=y *

> CONFIG_TRACE_IRQFLAGS_NMI=y

> CONFIG_DEBUG_KOBJECT=y 

> CONFIG_PROVE_RCU=y *

> CONFIG_PREEMPTIRQ_TRACEPOINTS=y *


(* means that I enabled)

> 

> When scsi_scan_host() hangs, there are no outstanding IOs with

> megaraid_sas driver-firmware stack as SCSI "host_busy" counter and

> megaraid_sas driver's internal counter are "0".

> Key takeaways:

> 1. Issue is observed when lock related debugging is enabled so issue

> is seen in debug environment.

> 2. Issue seems to be related to generic shared "host_tagset" code

> whenever some kind of kernel debugging is enabled. We do not see an

> immediate reason to hide this issue through disabling the

> "host_tagset" feature.

> 

> John,

> Issue may hit on ARM platform too using Qian's .config file with other

> adapters (e.g. hisi_sas) as well. So I feel disabling “host_tagset” in

> megaraid_sas driver will not help.  It requires debugging from the

> “Entire Shared host tag feature” perspective as scsi_scan_host()

> waittime aggravates when "host_tagset" is enabled. Also, I am doing

> parallel debugging and if I find anything useful, I will share.


So isn't this then really related to how many HW queues we expose there 
is just scaling up the time? For megaraid sas, it's 1->128 for my arm64 
platform when host_tagset_enable=1.

As a hack, I tried this (while keeping host_tagset_enable=1):

@@ -6162,11 +6168,15 @@ static int megasas_init_fw(struct 
megasas_instance *instance)
                else
                        instance->low_latency_index_start = 1;

-               num_msix_req = num_online_cpus() + 
instance->low_latency_index_start;
+               num_msix_req = 6 + instance->low_latency_index_start;

(6 is an arbitrary small number)

And boot time is nearly same as with host_tagset_enable=0.

For hisi_sas, max HW queue number ever is 16. In addition, we don't scan 
each channel/id/lun for hisi_sas, as it has a scan handler.

> 

> Qian,

> I need full dmesg logs from your setup with

> megaraid_sas.host_tagset_enable=1 and

> megaraid_sas.host_tagset_enable=0. Please wait for a long time. I just

> want to make sure that whatever you observe is the same as mine.

> 


Thanks,
John
Qian Cai Nov. 11, 2020, 2:42 p.m. UTC | #21
On Wed, 2020-11-11 at 17:27 +0800, Ming Lei wrote:
> Can this issue disappear by applying the following change?

This makes the system boot again as well.

> 
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index e32958f0b687..b1fe6176d77f 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -469,9 +469,6 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node,
> int cmd_size,
>  	INIT_LIST_HEAD(&fq->flush_queue[1]);
>  	INIT_LIST_HEAD(&fq->flush_data_in_flight);
>  
> -	lockdep_register_key(&fq->key);
> -	lockdep_set_class(&fq->mq_flush_lock, &fq->key);
> -
>  	return fq;
>  
>   fail_rq:
> @@ -486,7 +483,6 @@ void blk_free_flush_queue(struct blk_flush_queue *fq)
>  	if (!fq)
>  		return;
>  
> -	lockdep_unregister_key(&fq->key);
>  	kfree(fq->flush_rq);
>  	kfree(fq);
>  }
> 
> 
> Thanks, 
> Ming
Ming Lei Nov. 11, 2020, 3:04 p.m. UTC | #22
On Wed, Nov 11, 2020 at 09:42:17AM -0500, Qian Cai wrote:
> On Wed, 2020-11-11 at 17:27 +0800, Ming Lei wrote:

> > Can this issue disappear by applying the following change?

> 

> This makes the system boot again as well.


OK, actually it isn't necessary to register one new lock key for each
hctx(blk_flush_queue) instance, and the current way is really over-kill
because there can be lots of hw queues in one system.

The original lockdep warning can be avoided by setting one nvme_loop
specific lock class simply. If nvme_loop is backed against another nvme_loop,
we still can avoid the warning by killing the direct end io chain, or
assign another lock class.

Will prepare one formal patch tomorrow.

Thanks,
Ming
diff mbox series

Patch

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 861f7140f52e..6960922d0d7f 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -37,6 +37,7 @@ 
 #include <linux/poll.h>
 #include <linux/vmalloc.h>
 #include <linux/irq_poll.h>
+#include <linux/blk-mq-pci.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -113,6 +114,10 @@  unsigned int enable_sdev_max_qd;
 module_param(enable_sdev_max_qd, int, 0444);
 MODULE_PARM_DESC(enable_sdev_max_qd, "Enable sdev max qd as can_queue. Default: 0");
 
+int host_tagset_enable = 1;
+module_param(host_tagset_enable, int, 0444);
+MODULE_PARM_DESC(host_tagset_enable, "Shared host tagset enable/disable Default: enable(1)");
+
 MODULE_LICENSE("GPL");
 MODULE_VERSION(MEGASAS_VERSION);
 MODULE_AUTHOR("megaraidlinux.pdl@broadcom.com");
@@ -3119,6 +3124,19 @@  megasas_bios_param(struct scsi_device *sdev, struct block_device *bdev,
 	return 0;
 }
 
+static int megasas_map_queues(struct Scsi_Host *shost)
+{
+	struct megasas_instance *instance;
+
+	instance = (struct megasas_instance *)shost->hostdata;
+
+	if (shost->nr_hw_queues == 1)
+		return 0;
+
+	return blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
+			instance->pdev, instance->low_latency_index_start);
+}
+
 static void megasas_aen_polling(struct work_struct *work);
 
 /**
@@ -3427,6 +3445,7 @@  static struct scsi_host_template megasas_template = {
 	.eh_timed_out = megasas_reset_timer,
 	.shost_attrs = megaraid_host_attrs,
 	.bios_param = megasas_bios_param,
+	.map_queues = megasas_map_queues,
 	.change_queue_depth = scsi_change_queue_depth,
 	.max_segment_size = 0xffffffff,
 };
@@ -6808,6 +6827,26 @@  static int megasas_io_attach(struct megasas_instance *instance)
 	host->max_lun = MEGASAS_MAX_LUN;
 	host->max_cmd_len = 16;
 
+	/* Use shared host tagset only for fusion adaptors
+	 * if there are managed interrupts (smp affinity enabled case).
+	 * Single msix_vectors in kdump, so shared host tag is also disabled.
+	 */
+
+	host->host_tagset = 0;
+	host->nr_hw_queues = 1;
+
+	if ((instance->adapter_type != MFI_SERIES) &&
+		(instance->msix_vectors > instance->low_latency_index_start) &&
+		host_tagset_enable &&
+		instance->smp_affinity_enable) {
+		host->host_tagset = 1;
+		host->nr_hw_queues = instance->msix_vectors -
+			instance->low_latency_index_start;
+	}
+
+	dev_info(&instance->pdev->dev,
+		"Max firmware commands: %d shared with nr_hw_queues = %d\n",
+		instance->max_fw_cmds, host->nr_hw_queues);
 	/*
 	 * Notify the mid-layer about the new controller
 	 */
diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index 0824410f78f8..a4251121f173 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -359,24 +359,29 @@  megasas_get_msix_index(struct megasas_instance *instance,
 {
 	int sdev_busy;
 
-	/* nr_hw_queue = 1 for MegaRAID */
-	struct blk_mq_hw_ctx *hctx =
-		scmd->device->request_queue->queue_hw_ctx[0];
-
-	sdev_busy = atomic_read(&hctx->nr_active);
+	/* TBD - if sml remove device_busy in future, driver
+	 * should track counter in internal structure.
+	 */
+	sdev_busy = atomic_read(&scmd->device->device_busy);
 
 	if (instance->perf_mode == MR_BALANCED_PERF_MODE &&
-	    sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH))
+	    sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH)) {
 		cmd->request_desc->SCSIIO.MSIxIndex =
 			mega_mod64((atomic64_add_return(1, &instance->high_iops_outstanding) /
 					MR_HIGH_IOPS_BATCH_COUNT), instance->low_latency_index_start);
-	else if (instance->msix_load_balance)
+	} else if (instance->msix_load_balance) {
 		cmd->request_desc->SCSIIO.MSIxIndex =
 			(mega_mod64(atomic64_add_return(1, &instance->total_io_count),
 				instance->msix_vectors));
-	else
+	} else if (instance->host->nr_hw_queues > 1) {
+		u32 tag = blk_mq_unique_tag(scmd->request);
+
+		cmd->request_desc->SCSIIO.MSIxIndex = blk_mq_unique_tag_to_hwq(tag) +
+			instance->low_latency_index_start;
+	} else {
 		cmd->request_desc->SCSIIO.MSIxIndex =
 			instance->reply_map[raw_smp_processor_id()];
+	}
 }
 
 /**
@@ -956,9 +961,6 @@  megasas_alloc_cmds_fusion(struct megasas_instance *instance)
 	if (megasas_alloc_cmdlist_fusion(instance))
 		goto fail_exit;
 
-	dev_info(&instance->pdev->dev, "Configured max firmware commands: %d\n",
-		 instance->max_fw_cmds);
-
 	/* The first 256 bytes (SMID 0) is not used. Don't add to the cmd list */
 	io_req_base = fusion->io_request_frames + MEGA_MPI2_RAID_DEFAULT_IO_FRAME_SIZE;
 	io_req_base_phys = fusion->io_request_frames_phys + MEGA_MPI2_RAID_DEFAULT_IO_FRAME_SIZE;
@@ -1102,8 +1104,9 @@  megasas_ioc_init_fusion(struct megasas_instance *instance)
 		MR_HIGH_IOPS_QUEUE_COUNT) && cur_intr_coalescing)
 		instance->perf_mode = MR_BALANCED_PERF_MODE;
 
-	dev_info(&instance->pdev->dev, "Performance mode :%s\n",
-		MEGASAS_PERF_MODE_2STR(instance->perf_mode));
+	dev_info(&instance->pdev->dev, "Performance mode :%s (latency index = %d)\n",
+		MEGASAS_PERF_MODE_2STR(instance->perf_mode),
+		instance->low_latency_index_start);
 
 	instance->fw_sync_cache_support = (scratch_pad_1 &
 		MR_CAN_HANDLE_SYNC_CACHE_OFFSET) ? 1 : 0;