[5.15,172/179] io_uring: fix soft lockup when call __io_remove_buffers

From: Ye Bin <yebin10@huawei.com>

From: Ye Bin <yebin10@huawei.com>

commit 1d0254e6b47e73222fd3d6ae95cccbaafe5b3ecf upstream.

I got issue as follows:
[ 567.094140] __io_remove_buffers: [1]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680
[  594.360799] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108]
[  594.364987] Modules linked in:
[  594.365405] irq event stamp: 604180238
[  594.365906] hardirqs last  enabled at (604180237): [<ffffffff93fec9bd>] _raw_spin_unlock_irqrestore+0x2d/0x50
[  594.367181] hardirqs last disabled at (604180238): [<ffffffff93fbbadb>] sysvec_apic_timer_interrupt+0xb/0xc0
[  594.368420] softirqs last  enabled at (569080666): [<ffffffff94200654>] __do_softirq+0x654/0xa9e
[  594.369551] softirqs last disabled at (569080575): [<ffffffff913e1d6a>] irq_exit_rcu+0x1ca/0x250
[  594.370692] CPU: 2 PID: 108 Comm: kworker/u32:5 Tainted: G            L    5.15.0-next-20211112+ #88
[  594.371891] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
[  594.373604] Workqueue: events_unbound io_ring_exit_work
[  594.374303] RIP: 0010:_raw_spin_unlock_irqrestore+0x33/0x50
[  594.375037] Code: 48 83 c7 18 53 48 89 f3 48 8b 74 24 10 e8 55 f5 55 fd 48 89 ef e8 ed a7 56 fd 80 e7 02 74 06 e8 43 13 7b fd fb bf 01 00 00 00 <e8> f8 78 474
[  594.377433] RSP: 0018:ffff888101587a70 EFLAGS: 00000202
[  594.378120] RAX: 0000000024030f0d RBX: 0000000000000246 RCX: 1ffffffff2f09106
[  594.379053] RDX: 0000000000000000 RSI: ffffffff9449f0e0 RDI: 0000000000000001
[  594.379991] RBP: ffffffff9586cdc0 R08: 0000000000000001 R09: fffffbfff2effcab
[  594.380923] R10: ffffffff977fe557 R11: fffffbfff2effcaa R12: ffff8881b8f3def0
[  594.381858] R13: 0000000000000246 R14: ffff888153a8b070 R15: 0000000000000000
[  594.382787] FS:  0000000000000000(0000) GS:ffff888399c00000(0000) knlGS:0000000000000000
[  594.383851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  594.384602] CR2: 00007fcbe71d2000 CR3: 00000000b4216000 CR4: 00000000000006e0
[  594.385540] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  594.386474] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  594.387403] Call Trace:
[  594.387738]  <TASK>
[  594.388042]  find_and_remove_object+0x118/0x160
[  594.389321]  delete_object_full+0xc/0x20
[  594.389852]  kfree+0x193/0x470
[  594.390275]  __io_remove_buffers.part.0+0xed/0x147
[  594.390931]  io_ring_ctx_free+0x342/0x6a2
[  594.392159]  io_ring_exit_work+0x41e/0x486
[  594.396419]  process_one_work+0x906/0x15a0
[  594.399185]  worker_thread+0x8b/0xd80
[  594.400259]  kthread+0x3bf/0x4a0
[  594.401847]  ret_from_fork+0x22/0x30
[  594.402343]  </TASK>

Message from syslogd@localhost at Nov 13 09:09:54 ...
kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108]
[  596.793660] __io_remove_buffers: [2099199]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680

We can reproduce this issue by follow syzkaller log:
r0 = syz_io_uring_setup(0x401, &(0x7f0000000300), &(0x7f0000003000/0x2000)=nil, &(0x7f0000ff8000/0x4000)=nil, &(0x7f0000000280)=<r1=>0x0, &(0x7f0000000380)=<r2=>0x0)
sendmsg$ETHTOOL_MSG_FEATURES_SET(0xffffffffffffffff, &(0x7f0000003080)={0x0, 0x0, &(0x7f0000003040)={&(0x7f0000000040)=ANY=[], 0x18}}, 0x0)
syz_io_uring_submit(r1, r2, &(0x7f0000000240)=@IORING_OP_PROVIDE_BUFFERS={0x1f, 0x5, 0x0, 0x401, 0x1, 0x0, 0x100, 0x0, 0x1, {0xfffd}}, 0x0)
io_uring_enter(r0, 0x3a2d, 0x0, 0x0, 0x0, 0x0)

The reason above issue  is 'buf->list' has 2,100,000 nodes, occupied cpu lead
to soft lockup.
To solve this issue, we need add schedule point when do while loop in
'__io_remove_buffers'.
After add  schedule point we do regression, get follow data.
[  240.141864] __io_remove_buffers: [1]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00
[  268.408260] __io_remove_buffers: [1]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180
[  275.899234] __io_remove_buffers: [2099199]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00
[  296.741404] __io_remove_buffers: [1]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380
[  305.090059] __io_remove_buffers: [2099199]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180
[  325.415746] __io_remove_buffers: [1]start ctx=0xffff8881b92d1000 bgid=65533 buf=0xffff8881a17d8f00
[  333.160318] __io_remove_buffers: [2099199]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380
...

Fixes:8bab4c09f24e("io_uring: allow conditional reschedule for intensive iterators")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20211122024737.2198530-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/io_uring.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Message ID	20211129181724.597416357@linuxfoundation.org
State	New
Headers	show Return-Path: <stable-owner@kernel.org> From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, stable@vger.kernel.org, Ye Bin <yebin10@huawei.com>, Jens Axboe <axboe@kernel.dk> Subject: [PATCH 5.15 172/179] io_uring: fix soft lockup when call __io_remove_buffers Date: Mon, 29 Nov 2021 19:19:26 +0100 Message-Id: <20211129181724.597416357@linuxfoundation.org> In-Reply-To: <20211129181718.913038547@linuxfoundation.org> References: <20211129181718.913038547@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	None \| expand [5.15,002/179] ACPI: Get acpi_devices parent from the parent field [5.15,004/179] USB: serial: pl2303: fix GC type detection [5.15,005/179] USB: serial: option: add Telit LE910S1 0x9200 composition [5.15,007/179] usb: dwc2: gadget: Fix ISOC flow for elapsed frames [5.15,008/179] usb: dwc2: hcd_queue: Fix use of floating point literal [5.15,009/179] usb: dwc3: leave default DMA for PCI devices [5.15,010/179] usb: dwc3: core: Revise GHWPARAMS9 offset [5.15,012/179] usb: dwc3: gadget: Check for L1/L2/U3 for Start Transfer [5.15,013/179] usb: dwc3: gadget: Fix null pointer exception [5.15,014/179] net: usb: Correct PHY handling of smsc95xx [5.15,016/179] usb: chipidea: ci_hdrc_imx: fix potential error pointer dereference in probe [5.15,017/179] usb: typec: fusb302: Fix masking of comparator and bc_lvl interrupts [5.15,020/179] usb: hub: Fix locking issues with address0_mutex [5.15,021/179] binder: fix test regression due to sender_euid change [5.15,022/179] ALSA: ctxfi: Fix out-of-range access [5.15,023/179] ALSA: hda/realtek: Add quirk for ASRock NUC Box 1100 [5.15,024/179] ALSA: hda/realtek: Fix LED on HP ProBook 435 G7 [5.15,026/179] Revert "parisc: Fix backtrace to always include init funtion names" [5.15,027/179] HID: wacom: Use "Confidence" flag to prevent reporting invalid contacts [5.15,028/179] staging/fbtft: Fix backlight [5.15,029/179] staging: greybus: Add missing rwsem around snd_ctl_remove() calls [5.15,030/179] staging: rtl8192e: Fix use after free in _rtl92e_pci_disconnect() [5.15,031/179] staging: r8188eu: Use kzalloc() with GFP_ATOMIC in atomic context [5.15,032/179] staging: r8188eu: Fix breakage introduced when 5G code was removed [5.15,033/179] staging: r8188eu: use GFP_ATOMIC under spinlock [5.15,034/179] staging: r8188eu: fix a memory leak in rtw_wx_read32() [5.15,035/179] fuse: release pipe buf after last use [5.15,036/179] xen: dont continue xenstore initialization in case of errors [5.15,038/179] io_uring: correct link-list traversal locking [5.15,039/179] io_uring: fail cancellation for EXITING tasks [5.15,040/179] io_uring: fix link traversal locking [5.15,041/179] drm/amdgpu: IH process reset count when restart [5.15,042/179] drm/amdgpu/pm: fix powerplay OD interface [5.15,043/179] drm/nouveau: recognise GA106 [5.15,045/179] ksmbd: contain default data stream even if xattr is empty [5.15,046/179] ksmbd: fix memleak in get_file_stream_info() [5.15,047/179] KVM: PPC: Book3S HV: Prevent POWER7/8 TLB flush flushing SLB [5.15,049/179] tracing: Fix pid filtering when triggers are attached [5.15,050/179] mmc: sdhci-esdhc-imx: disable CMDQ support [5.15,051/179] mmc: sdhci: Fix ADMA for PAGE_SIZE >= 64KiB [5.15,052/179] mdio: aspeed: Fix "Link is Down" issue [5.15,053/179] arm64: mm: Fix VM_BUG_ON(mm != &init_mm) for trans_pgd [5.15,054/179] cpufreq: intel_pstate: Fix active mode offline/online EPP handling [5.15,057/179] NFSv42: Fix pagecache invalidation after COPY/CLONE [5.15,060/179] PCI: aardvark: Simplify initialization of rootcap on virtual bridge [5.15,061/179] PCI: aardvark: Fix link training [5.15,062/179] drm/amd/display: Fix OLED brightness control on eDP [5.15,063/179] proc/vmcore: fix clearing user buffer by properly using clear_user() [5.15,064/179] ASoC: SOF: Intel: hda: fix hotplug when only codec is suspended [5.15,065/179] netfilter: ctnetlink: fix filtering with CTA_TUPLE_REPLY [5.15,067/179] netfilter: ipvs: Fix reuse connection if RS weight is 0 [5.15,069/179] media: v4l2-core: fix VIDIOC_DQEVENT handling on non-x86 [5.15,070/179] firmware: arm_scmi: Fix null de-reference on error path [5.15,071/179] ARM: dts: BCM5301X: Fix I2C controller interrupt [5.15,073/179] ARM: dts: bcm2711: Fix PCIe interrupts [5.15,075/179] ASoC: qdsp6: q6asm: fix q6asm_dai_prepare error handling [5.15,076/179] ASoC: topology: Add missing rwsem around snd_ctl_remove() calls [5.15,078/179] ASoC: codecs: wcd934x: return error code correctly from hw_params [5.15,079/179] ASoC: codecs: lpass-rx-macro: fix HPHR setting CLSH mask [5.15,080/179] net: ieee802154: handle iftypes as u32 [5.15,082/179] firmware: arm_scmi: pm: Propagate return value to caller [5.15,083/179] ASoC: stm32: i2s: fix 32 bits channel length without mclk [5.15,084/179] NFSv42: Dont fail clone() unless the OP_CLONE operation failed [5.15,085/179] ARM: socfpga: Fix crash with CONFIG_FORTIRY_SOURCE [5.15,086/179] drm/nouveau/acr: fix a couple NULL vs IS_ERR() checks [5.15,087/179] scsi: qla2xxx: edif: Fix off by one bug in qla_edif_app_getfcinfo() [5.15,089/179] scsi: mpt3sas: Fix system going into read-only mode [5.15,090/179] scsi: mpt3sas: Fix incorrect system timestamp [5.15,091/179] drm/vc4: fix error code in vc4_create_object() [5.15,092/179] drm/aspeed: Fix vga_pw sysfs output [5.15,093/179] net: marvell: prestera: fix brige port operation [5.15,094/179] net: marvell: prestera: fix double free issue on err path [5.15,095/179] HID: input: Fix parsing of HID_CP_CONSUMER_CONTROL fields [5.15,096/179] HID: input: set usage type to key on keycode remap [5.15,097/179] HID: magicmouse: prevent division by 0 on scroll [5.15,099/179] iavf: Fix refreshing iavf adapter stats on ethtool request [5.15,100/179] iavf: Fix VLAN feature flags after VFR [5.15,101/179] x86/pvh: add prototype for xen_pvh_init() [5.15,102/179] xen/pvh: add missing prototype to header [5.15,104/179] mptcp: fix delack timer [5.15,105/179] mptcp: use delegate action to schedule 3rd ack retrans [5.15,107/179] firmware: smccc: Fix check for ARCH_SOC_ID not implemented [5.15,109/179] nfp: checking parameter process for rx-usecs/tx-usecs is invalid [5.15,110/179] net: stmmac: retain PTP clock time during SIOCSHWTSTAMP ioctls [5.15,111/179] net: ipv6: add fib6_nh_release_dsts stub [5.15,112/179] net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group [5.15,113/179] ice: fix vsi->txq_map sizing [5.15,114/179] ice: avoid bpf_prog refcount underflow [5.15,115/179] scsi: core: sysfs: Fix setting device state to SDEV_RUNNING [5.15,116/179] scsi: scsi_debug: Zero clear zones at reset write pointer [5.15,118/179] i2c: virtio: disable timeout handling [5.15,120/179] mlxsw: spectrum: Protect driver from buggy firmware [5.15,121/179] net: ipa: directly disable ipa-setup-ready interrupt [5.15,122/179] net: ipa: separate disabling setup from modem stop [5.15,123/179] net: ipa: kill ipa_cmd_pipeline_clear() [5.15,124/179] net: marvell: mvpp2: increase MTU limit when XDP enabled [5.15,125/179] cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs [5.15,126/179] nvmet-tcp: fix incomplete data digest send [5.15,128/179] arm64: uaccess: avoid blocking within critical sections [5.15,129/179] net/ncsi : Add payload to be 32-bit aligned to fix dropped packets [5.15,131/179] drm/amd/display: Fix DPIA outbox timeout after GPU reset [5.15,133/179] tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows [5.15,135/179] net: phylink: Force link down and retrigger resolve on interface change [5.15,136/179] net: phylink: Force retrigger in case of latched link-fail indicator [5.15,137/179] net/smc: Fix NULL pointer dereferencing in smc_vlan_by_tcpsk() [5.15,138/179] net/smc: Fix loop in smc_listen [5.15,139/179] nvmet: use IOCB_NOWAIT only if the filesystem supports it [5.15,141/179] MIPS: loongson64: fix FTLB configuration [5.15,143/179] tls: splice_read: fix record type check [5.15,144/179] tls: splice_read: fix accessing pre-processed records [5.15,146/179] net: stmmac: Disable Tx queues when reconfiguring the interface [5.15,148/179] ethtool: ioctl: fix potential NULL deref in ethtool_set_coalesce() [5.15,149/179] net: vlan: fix underflow for the real_dev refcnt [5.15,151/179] net: hns3: fix VF RSS failed problem after PF enable multi-TCs [5.15,153/179] net: mscc: ocelot: dont downgrade timestamping RX filters in SIOCSHWTSTAMP [5.15,154/179] net: mscc: ocelot: correctly report the timestamping RX filters in ethtool [5.15,155/179] locking/rwsem: Make handoff bit handling more consistent [5.15,156/179] perf: Ignore sigtrap for tracepoints destined for other tasks [5.15,157/179] sched/scs: Reset task stack state in bringup_cpu() [5.15,158/179] iommu/rockchip: Fix PAGE_DESC_HI_MASKs for RK3568 [5.15,159/179] iommu/vt-d: Fix unmap_pages support [5.15,160/179] f2fs: quota: fix potential deadlock [5.15,161/179] f2fs: set SBI_NEED_FSCK flag when inconsistent node block found [5.15,162/179] riscv: dts: microchip: fix board compatible [5.15,164/179] cifs: nosharesock should not share socket with future sessions [5.15,166/179] iommu/amd: Clarify AMD IOMMUv2 initialization messages [5.15,167/179] vdpa_sim: avoid putting an uninitialized iova_domain [5.15,168/179] vhost/vsock: fix incorrect used length reported to the guest [5.15,169/179] ksmbd: Fix an error handling path in smb2_sess_setup() [5.15,171/179] cifs: nosharesock should be set on new server [5.15,172/179] io_uring: fix soft lockup when call __io_remove_buffers [5.15,174/179] firmware: arm_scmi: Fix type error in sensor protocol [5.15,175/179] docs: accounting: update delay-accounting.rst reference [5.15,176/179] blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release() [5.15,177/179] block: avoid to quiesce queue in elevator_init_mq [5.15,178/179] drm/amdgpu/gfx10: add wraparound gpu counter check for APUs as well

[5.15,172/179] io_uring: fix soft lockup when call __io_remove_buffers

Commit Message

Patch