[5.4,146/314] habanalabs: increase timeout during reset

Message ID	20200623195345.818635556@linuxfoundation.org
State	New
Headers	show Return-Path: <SRS0=+x35=AE=vger.kernel.org=stable-owner@kernel.org> From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, stable@vger.kernel.org, Omer Shpigelman <oshpigelman@habana.ai>, Oded Gabbay <oded.gabbay@gmail.com>, Sasha Levin <sashal@kernel.org> Subject: [PATCH 5.4 146/314] habanalabs: increase timeout during reset Date: Tue, 23 Jun 2020 21:55:41 +0200 Message-Id: <20200623195345.818635556@linuxfoundation.org> In-Reply-To: <20200623195338.770401005@linuxfoundation.org> References: <20200623195338.770401005@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: stable-owner@vger.kernel.org Precedence: bulk
Series	None \| expand [5.4,004/314] i2c: piix4: Detect secondary SMBus controller on AMD AM4 chipsets [5.4,006/314] iio: pressure: bmp280: Tolerate IRQ before registering [5.4,007/314] remoteproc: Fix IDR initialisation in rproc_alloc() [5.4,008/314] clk: qcom: msm8916: Fix the address location of pll->config_reg [5.4,009/314] ASoC: fsl_esai: Disable exception interrupt before scheduling tasklet [5.4,011/314] ARM: dts: renesas: Fix IOMMU device node names [5.4,013/314] ARM: integrator: Add some Kconfig selections [5.4,014/314] ARM: dts: stm32: Add missing ethernet PHY reset on AV96 [5.4,016/314] scsi: qedi: Check for buffer overflow in qedi_set_path() [5.4,018/314] ALSA: hda/realtek - Introduce polarity for micmute LED GPIO [5.4,020/314] PCI: Allow pci_resize_resource() for devices on root bus [5.4,021/314] scsi: qla2xxx: Fix issue with adapters stopping state [5.4,022/314] Input: edt-ft5x06 - fix get_default register write access [5.4,025/314] iio: bmp280: fix compensation of humidity [5.4,026/314] f2fs: report delalloc reserve as non-free in statfs for project quota [5.4,028/314] remoteproc: qcom_q6v5_mss: map/unmap mpss segments before/after use [5.4,029/314] clk: samsung: Mark top ISP and CAM clocks on Exynos542x as critical [5.4,032/314] misc: fastrpc: Fix an incomplete memory release in fastrpc_rpmsg_probe() [5.4,033/314] misc: fastrpc: fix potential fastrpc_invoke_ctx leak [5.4,035/314] arm64: dts: armada-3720-turris-mox: forbid SDR104 on SDIO for FCC purposes [5.4,037/314] arm64: dts: juno: Fix GIC child nodes [5.4,039/314] clk: renesas: cpg-mssr: Fix STBCR suspend/resume handling [5.4,040/314] ASoC: SOF: Do nothing when DSP PM callbacks are not set [5.4,043/314] ps3disk: use the default segment boundary [5.4,044/314] arm64: dts: fvp/juno: Fix node address fields [5.4,045/314] vfio/pci: fix memory leaks in alloc_perm_bits() [5.4,047/314] RDMA/mlx5: Add init2init as a modify command [5.4,049/314] PCI: pci-bridge-emul: Fix PCIe bit conflicts [5.4,051/314] gpio: dwapb: Call acpi_gpiochip_free_interrupts() on GPIO chip de-registration [5.4,052/314] usb: gadget: core: sync interrupt before unbind the udc [5.4,054/314] mfd: wm8994: Fix driver operation if loaded as modules [5.4,057/314] scsi: lpfc: Fix lpfc_nodelist leak when processing unsolicited event [5.4,058/314] scsi: vhost: Notify TCM about the maximum sg entries supported per command [5.4,059/314] clk: clk-flexgen: fix clock-critical handling [5.4,060/314] IB/mlx5: Fix DEVX support for MLX5_CMD_OP_INIT2INIT_QP command [5.4,063/314] PCI: vmd: Filter resource type bits from shadow register [5.4,066/314] ASoC: qcom: q6asm-dai: kCFI fix [5.4,067/314] powerpc/crashkernel: Take "mem=" option into account [5.4,069/314] sparc32: mm: Dont try to free page-table pages if ctor() fails [5.4,071/314] NTB: ntb_pingpong: Choose doorbells based on port number [5.4,074/314] apparmor: fix introspection of of task mode for unconfined tasks [5.4,075/314] net: dsa: lantiq_gswip: fix and improve the unsupported interface error [5.4,078/314] ASoC: meson: add missing free_irq() in error path [5.4,079/314] bpf, sockhash: Fix memory leak when unlinking sockets in sock_hash_free [5.4,081/314] scsi: ibmvscsi: Dont send host info in adapter info MAD after LPM [5.4,084/314] staging: greybus: fix a missing-check bug in gb_lights_light_config() [5.4,085/314] staging: rtl8712: fix multiline derefernce warnings [5.4,087/314] scsi: qedi: Do not flush offload work if ARP not resolved [5.4,088/314] arm64: dts: qcom: msm8916: remove unit name for thermal trip points [5.4,089/314] ARM: dts: sun8i-h2-plus-bananapi-m2-zero: Fix led polarity [5.4,091/314] gpio: dwapb: Append MODULE_ALIAS for platform driver [5.4,092/314] scsi: qedf: Fix crash when MFW calls for protocol stats while function is still pro... [5.4,095/314] arm64: dts: qcom: fix pm8150 gpio interrupts [5.4,096/314] firmware: qcom_scm: fix bogous abuse of dma-direct internals [5.4,098/314] staging: gasket: Fix mapping refcnt leak when register/store fails [5.4,099/314] ALSA: usb-audio: Improve frames size computation [5.4,101/314] Input: mms114 - add extra compatible for mms345l [5.4,104/314] slimbus: ngd: get drvdata from correct device [5.4,107/314] clk: meson: meson8b: Fix the vclk_div{1, 2, 4, 6, 12}_en gate bits [5.4,110/314] clk: meson: meson8b: Dont rely on u-boot to init all GP_PLL registers [5.4,111/314] ASoC: max98373: reorder max98373_reset() in resume [5.4,113/314] HID: intel-ish-hid: avoid bogus uninitialized-variable warning [5.4,116/314] staging: wilc1000: Increase the size of wid_list array [5.4,118/314] PCI: v3-semi: Fix a memory leak in v3_pci_probe() error handling paths [5.4,119/314] i2c: pxa: fix i2c_pxa_scream_blue_murder() debug output [5.4,122/314] PCI: rcar: Fix incorrect programming of OB windows [5.4,123/314] PCI/ASPM: Allow ASPM on links to PCIe-to-PCI/PCI-X Bridges [5.4,124/314] scsi: qla2xxx: Fix warning after FC target reset [5.4,125/314] ALSA: firewire-lib: fix invalid assignment to union data for directional parameter [5.4,129/314] arm64: dts: msm8996: Fix CSI IRQ types [5.4,132/314] SoC: rsnd: add interrupt support for SSI BUSIF buffer [5.4,133/314] ASoC: ux500: mop500: Fix some refcounted resources issues [5.4,135/314] pinctrl: rockchip: fix memleak in rockchip_dt_node_to_map [5.4,136/314] dlm: remove BUG() before panic() [5.4,138/314] clk: ti: composite: fix memory leak [5.4,139/314] PCI: Fix pci_register_host_bridge() device_register() error handling [5.4,140/314] powerpc/64: Dont initialise init_task->thread.regs [5.4,143/314] ALSA: usb-audio: Add duplex sound support for USB devices using implicit feedback [5.4,145/314] PCI/PM: Assume ports without DLL Link Active train links in 100 ms [5.4,146/314] habanalabs: increase timeout during reset [5.4,148/314] powerpc/64s/exception: Fix machine check no-loss idle wakeup [5.4,149/314] powerpc/pseries/ras: Fix FWNMI_VALID off by one [5.4,152/314] vfio-pci: Mask cap zero [5.4,154/314] drm/msm/mdp5: Fix mdp5_init error path for failed mdp5_kms allocation [5.4,156/314] USB: host: ehci-mxc: Add error handling in ehci_mxc_drv_probe() [5.4,157/314] tty: n_gsm: Fix bogus i++ in gsm_data_kick [5.4,158/314] fpga: dfl: afu: Corrected error handling levels [5.4,159/314] clk: samsung: exynos5433: Add IGNORE_UNUSED flag to sclk_i2s1 [5.4,161/314] RDMA/hns: Fix cmdq parameter of querying pf timer resource [5.4,163/314] firmware: imx: scu: Fix possible memory leak in imx_scu_probe() [5.4,164/314] fuse: fix copy_file_range cache issues [5.4,166/314] arm64: tegra: Fix ethernet phy-mode for Jetson Xavier [5.4,169/314] dm zoned: return NULL if dmz_get_zone_for_reclaim() fails to find a zone [5.4,171/314] PCI: dwc: Fix inner MSI IRQ domain registration [5.4,172/314] PCI: amlogic: meson: Dont use FAST_LINK_MODE to set up link [5.4,173/314] IB/cma: Fix ports memory leak in cma_configfs [5.4,176/314] USB: gadget: udc: s3c2410_udc: Remove pointless NULL check in s3c2410_udc_nuke [5.4,179/314] usb: gadget: Fix issue with config_ep_by_speed function [5.4,181/314] RDMA/iw_cxgb4: cleanup device debugfs entries on ULD remove [5.4,183/314] mfd: stmfx: Reset chip on resume as supply was disabled [5.4,186/314] powerpc/32s: Dont warn when mapping RO data ROX. [5.4,187/314] ASoC: fix incomplete error-handling in img_i2s_in_probe. [5.4,188/314] scsi: target: tcmu: Fix a use after free in tcmu_check_expired_queue_cmd() [5.4,192/314] KVM: PPC: Book3S: Fix some RCU-list locks [5.4,193/314] clk: sprd: return correct type of value for _sprd_pll_recalc_rate [5.4,195/314] misc: xilinx-sdfec: improve get_user_pages_fast() error handling [5.4,197/314] net: sunrpc: Fix off-by-one issues in rpc_ntop6 [5.4,199/314] of: Fix a refcounting bug in __of_attach_node_sysfs() [5.4,201/314] powerpc/4xx: Dont unmap NULL mbase [5.4,203/314] ASoC: fsl_asrc_dma: Fix dma_chan leak when config DMA channel failed [5.4,204/314] vfio/mdev: Fix reference count leak in add_mdev_supported_type [5.4,208/314] openrisc: Fix issue with argument clobbering for clone/fork [5.4,210/314] ceph: dont return -ESTALE if theres still an open file [5.4,211/314] nfsd4: make drc_slab global, not per-net [5.4,212/314] gfs2: Allow lock_nolock mount to specify jid=X [5.4,215/314] pinctrl: imxl: Fix an error handling path in imx1_pinctrl_core_probe() [5.4,216/314] pinctrl: freescale: imx: Fix an error handling path in imx_pinctrl_probe() [5.4,217/314] pinctrl: freescale: imx: Use devm_of_iomap() to avoid a resource leak in case of er... [5.4,219/314] drm/amd/display: Revalidate bandwidth before commiting DC updates [5.4,220/314] crypto: omap-sham - add proper load balancing support for multicore [5.4,223/314] include/linux/bitops.h: avoid clang shift-count-overflow warnings [5.4,225/314] blktrace: use errno instead of bi_status [5.4,227/314] blktrace: fix endianness for blk_log_remap() [5.4,229/314] net: marvell: Fix OF_MDIO config check [5.4,232/314] NTB: ntb_tool: reading the link file should not end in a NULL byte [5.4,235/314] NTB: perf: Fix support for hardware that doesnt have port numbers [5.4,236/314] NTB: perf: Fix race condition when run with ntb_test [5.4,238/314] i2c: icy: Fix build with CONFIG_AMIGA_PCMCIA=n [5.4,239/314] drivers/perf: hisi: Fix wrong value for all counters enable [5.4,242/314] afs: Fix memory leak in afs_put_sysnames() [5.4,244/314] ASoC: SOF: nocodec: conditionally set dpcm_capture/dpcm_playback flags [5.4,245/314] ASoC: Intel: bytcr_rt5640: Add quirk for Toshiba Encore WT10-A tablet [5.4,247/314] bpf/sockmap: Fix kernel panic at __tcp_bpf_recvmsg [5.4,249/314] tracing/probe: Fix bpf_task_fd_query() for kprobes and uprobes [5.4,250/314] drm/sun4i: hdmi ddc clk: Fix size of m divider [5.4,251/314] libbpf: Handle GCC noreturn-turned-volatile quirk [5.4,252/314] scsi: acornscsi: Fix an error handling path in acornscsi_probe() [5.4,253/314] x86/idt: Keep spurious entries unset in system_vectors [5.4,257/314] xdp: Fix xsk_generic_xmit errno [5.4,258/314] iavf: fix speed reporting over virtchnl [5.4,260/314] usb/xhci-plat: Set PM runtime as active on resume [5.4,262/314] usb/ehci-platform: Set PM runtime as active on resume [5.4,266/314] bcache: fix potential deadlock problem in btree_gc_coalesce [5.4,267/314] powerpc: Fix kernel crash in show_instructions() w/DEBUG_VIRTUAL [5.4,271/314] afs: Always include dir in bulk status fetch from afs_do_lookup() [5.4,275/314] scsi: ufs-bsg: Fix runtime PM imbalance on error [5.4,276/314] block: Fix use-after-free in blkdev_get() [5.4,277/314] mvpp2: remove module bugfix [5.4,278/314] arm64: hw_breakpoint: Dont invoke overflow handler on uaccess watchpoints [5.4,282/314] ext4: avoid utf8_strncasecmp() with unstable name [5.4,284/314] drm/qxl: Use correct notify port address when creating cursor ring [5.4,286/314] selinux: fix double free [5.4,287/314] jbd2: clean __jbd2_journal_abort_hard() and __journal_abort_soft() [5.4,289/314] drm/dp_mst: Increase ACT retry timeout to 3s [5.4,291/314] x86/boot/compressed: Relax sed symbol type regex for LLVM ld.lld [5.4,293/314] net/mlx5: DR, Fix freeing in dr_create_rc_qp() [5.4,295/314] f2fs: avoid utf8_strncasecmp() with unstable name [5.4,297/314] drm/i915: Fix AUX power domain toggling across TypeC mode resets [5.4,300/314] drm/i915: Whitelist context-local timestamp in the gen9 cmdparser [5.4,302/314] drm/amd/display: Use kvfree() to free coeff in build_regamma() [5.4,303/314] drm/i915/icl+: Fix hotplug interrupt disabling after storm detection [5.4,304/314] Revert "drm/amd/display: disable dcn20 abm feature for bring up" [5.4,307/314] tracing/probe: Fix memleak in fetch_op_data operations [5.4,308/314] kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutex [5.4,310/314] e1000e: Do not wake up the system via WOL if device wakeup is disabled [5.4,312/314] pwm: jz4740: Enhance precision in calculation of duty cycle [5.4,313/314] sched/rt, net: Use CONFIG_PREEMPTION.patch

Message ID

20200623195345.818635556@linuxfoundation.org

State

New

Headers

From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Omer Shpigelman <oshpigelman@habana.ai>,
	Oded Gabbay <oded.gabbay@gmail.com>, Sasha Levin <sashal@kernel.org>
Subject: [PATCH 5.4 146/314] habanalabs: increase timeout during reset
Date: Tue, 23 Jun 2020 21:55:41 +0200
Message-Id: <20200623195345.818635556@linuxfoundation.org>
In-Reply-To: <20200623195338.770401005@linuxfoundation.org>
References: <20200623195338.770401005@linuxfoundation.org>
User-Agent: quilt/0.66
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Sender: stable-owner@vger.kernel.org
Precedence: bulk

Series

None | expand

Commit Message

Greg Kroah-Hartman June 23, 2020, 7:55 p.m. UTC

From: Oded Gabbay <oded.gabbay@gmail.com>

[ Upstream commit 7a65ee046b2238e053f6ebb610e1a082cfc49490 ]

When doing training, the DL framework (e.g. tensorflow) performs hundreds
of thousands of memory allocations and mappings. In case the driver needs
to perform hard-reset during training, the driver kills the application and
unmaps all those memory allocations. Unfortunately, because of that large
amount of mappings, the driver isn't able to do that in the current timeout
(5 seconds). Therefore, increase the timeout significantly to 30 seconds
to avoid situation where the driver resets the device with active mappings,
which sometime can cause a kernel bug.

BTW, it doesn't mean we will spend all the 30 seconds because the reset
thread checks every one second if the unmap operation is done.

Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/misc/habanalabs/habanalabs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 75862be53c60e..30addffd76f53 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -23,7 +23,7 @@ 
 
 #define HL_MMAP_CB_MASK			(0x8000000000000000ull >> PAGE_SHIFT)
 
-#define HL_PENDING_RESET_PER_SEC	5
+#define HL_PENDING_RESET_PER_SEC	30
 
 #define HL_DEVICE_TIMEOUT_USEC		1000000 /* 1 s */

[5.4,146/314] habanalabs: increase timeout during reset

Commit Message

Patch