[5.4,004/232] bpf: Fix deadlock with rq_lock in bpf_send_signal()

From: Yonghong Song <yhs@fb.com>

From: Yonghong Song <yhs@fb.com>

[ Upstream commit 1bc7896e9ef44fd77858b3ef0b8a6840be3a4494 ]

When experimenting with bpf_send_signal() helper in our production
environment (5.2 based), we experienced a deadlock in NMI mode:
   #5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24
   #6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012
   #7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd
   #8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55
   #9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602
  #10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a
  #11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227
  #12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140
  #13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf
  #14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09
  #15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47
  #16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d
  #17 [ffffc9002219fc90] schedule at ffffffff81a3e219
  #18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9
  #19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529
  #20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc
  #21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c
  #22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602
  #23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068

The above call stack is actually very similar to an issue
reported by Commit eac9153f2b58 ("bpf/stackmap: Fix deadlock with
rq_lock in bpf_get_stack()") by Song Liu. The only difference is
bpf_send_signal() helper instead of bpf_get_stack() helper.

The above deadlock is triggered with a perf_sw_event.
Similar to Commit eac9153f2b58, the below almost identical reproducer
used tracepoint point sched/sched_switch so the issue can be easily caught.
  /* stress_test.c */
  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/mman.h>
  #include <pthread.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <fcntl.h>

  #define THREAD_COUNT 1000
  char *filename;
  void *worker(void *p)
  {
        void *ptr;
        int fd;
        char *pptr;

        fd = open(filename, O_RDONLY);
        if (fd < 0)
                return NULL;
        while (1) {
                struct timespec ts = {0, 1000 + rand() % 2000};

                ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
                usleep(1);
                if (ptr == MAP_FAILED) {
                        printf("failed to mmap\n");
                        break;
                }
                munmap(ptr, 4096 * 64);
                usleep(1);
                pptr = malloc(1);
                usleep(1);
                pptr[0] = 1;
                usleep(1);
                free(pptr);
                usleep(1);
                nanosleep(&ts, NULL);
        }
        close(fd);
        return NULL;
  }

  int main(int argc, char *argv[])
  {
        void *ptr;
        int i;
        pthread_t threads[THREAD_COUNT];

        if (argc < 2)
                return 0;

        filename = argv[1];

        for (i = 0; i < THREAD_COUNT; i++) {
                if (pthread_create(threads + i, NULL, worker, NULL)) {
                        fprintf(stderr, "Error creating thread\n");
                        return 0;
                }
        }

        for (i = 0; i < THREAD_COUNT; i++)
                pthread_join(threads[i], NULL);
        return 0;
  }
and the following command:
  1. run `stress_test /bin/ls` in one windown
  2. hack bcc trace.py with the following change:
#     --- a/tools/trace.py
#     +++ b/tools/trace.py
     @@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s);
              __data.tgid = __tgid;
              __data.pid = __pid;
              bpf_get_current_comm(&__data.comm, sizeof(__data.comm));
     +        bpf_send_signal(10);
      %s
      %s
              %s.perf_submit(%s, &__data, sizeof(__data));
  3. in a different window run
     ./trace.py -p $(pidof stress_test) t:sched:sched_switch

The deadlock can be reproduced in our production system.

Similar to Song's fix, the fix is to delay sending signal if
irqs is disabled to avoid deadlocks involving with rq_lock.
With this change, my above stress-test in our production system
won't cause deadlock any more.

I also implemented a scale-down version of reproducer in the
selftest (a subsequent commit). With latest bpf-next,
it complains for the following potential deadlock.
  [   32.832450] -> #1 (&p->pi_lock){-.-.}:
  [   32.833100]        _raw_spin_lock_irqsave+0x44/0x80
  [   32.833696]        task_rq_lock+0x2c/0xa0
  [   32.834182]        task_sched_runtime+0x59/0xd0
  [   32.834721]        thread_group_cputime+0x250/0x270
  [   32.835304]        thread_group_cputime_adjusted+0x2e/0x70
  [   32.835959]        do_task_stat+0x8a7/0xb80
  [   32.836461]        proc_single_show+0x51/0xb0
  ...
  [   32.839512] -> #0 (&(&sighand->siglock)->rlock){....}:
  [   32.840275]        __lock_acquire+0x1358/0x1a20
  [   32.840826]        lock_acquire+0xc7/0x1d0
  [   32.841309]        _raw_spin_lock_irqsave+0x44/0x80
  [   32.841916]        __lock_task_sighand+0x79/0x160
  [   32.842465]        do_send_sig_info+0x35/0x90
  [   32.842977]        bpf_send_signal+0xa/0x10
  [   32.843464]        bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000
  [   32.844301]        trace_call_bpf+0x115/0x270
  [   32.844809]        perf_trace_run_bpf_submit+0x4a/0xc0
  [   32.845411]        perf_trace_sched_switch+0x10f/0x180
  [   32.846014]        __schedule+0x45d/0x880
  [   32.846483]        schedule+0x5f/0xd0
  ...

  [   32.853148] Chain exists of:
  [   32.853148]   &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock
  [   32.853148]
  [   32.854451]  Possible unsafe locking scenario:
  [   32.854451]
  [   32.855173]        CPU0                    CPU1
  [   32.855745]        ----                    ----
  [   32.856278]   lock(&rq->lock);
  [   32.856671]                                lock(&p->pi_lock);
  [   32.857332]                                lock(&rq->lock);
  [   32.857999]   lock(&(&sighand->siglock)->rlock);

  Deadlock happens on CPU0 when it tries to acquire &sighand->siglock
  but it has been held by CPU1 and CPU1 tries to grab &rq->lock
  and cannot get it.

  This is not exactly the callstack in our production environment,
  but sympotom is similar and both locks are using spin_lock_irqsave()
  to acquire the lock, and both involves rq_lock. The fix to delay
  sending signal when irq is disabled also fixed this issue.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200304191104.2796501-1-yhs@fb.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 kernel/trace/bpf_trace.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Message ID	20200416131317.122684915@linuxfoundation.org
State	New
Headers	show Return-Path: <SRS0=D6dW=6A=vger.kernel.org=stable-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37724C2BB85 for <stable@archiver.kernel.org>; Thu, 16 Apr 2020 15:28:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0E14F207FC for <stable@archiver.kernel.org>; Thu, 16 Apr 2020 15:28:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1587050913; bh=A5nwW0OdeCp53ZufXzYawDW4i/yshFw8OJwKq5AarJQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=LBcv/dBW6peaccuN2nXD48lfInZldcMS/PcX6bPZPmdjVf1kG7zXu5vBdt6A6VxZN BUMluQlIf+Ll6HE1jFt+ZdpOxx1qwy49m4M0u1oUEV7aYTHO4y2EYsNAlG9kYFEA4t 0BL7R3MDRzAHTAWoKyHZhVUnLzD9HvF0PSwVl2sM= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2896829AbgDPNoF (ORCPT <rfc822;stable@archiver.kernel.org>); Thu, 16 Apr 2020 09:44:05 -0400 Received: from mail.kernel.org ([198.145.29.99]:57298 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2896806AbgDPNn7 (ORCPT <rfc822;stable@vger.kernel.org>); Thu, 16 Apr 2020 09:43:59 -0400 Received: from localhost (83-86-89-107.cable.dynamic.v4.ziggo.nl [83.86.89.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 81B0120732; Thu, 16 Apr 2020 13:43:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1587044639; bh=A5nwW0OdeCp53ZufXzYawDW4i/yshFw8OJwKq5AarJQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=tc7VMXkcUM371BCDM8Xxafo3FYeAOQcogalKzq+sKJhm9gvWPsaIaiheRESqUG+vv cblcXZ1OdSYlofaqSwQHwu9SCZ69ewVGuUcw6bbU6xZFya+FZ5QzY+tczZCyvcG1Zc qOZurveu5uGCukumMNsycNqvPH34gA6Wff291iBs= From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, stable@vger.kernel.org, Yonghong Song <yhs@fb.com>, Alexei Starovoitov <ast@kernel.org>, Song Liu <songliubraving@fb.com>, Sasha Levin <sashal@kernel.org> Subject: [PATCH 5.4 004/232] bpf: Fix deadlock with rq_lock in bpf_send_signal() Date: Thu, 16 Apr 2020 15:21:38 +0200 Message-Id: <20200416131317.122684915@linuxfoundation.org> X-Mailer: git-send-email 2.26.1 In-Reply-To: <20200416131316.640996080@linuxfoundation.org> References: <20200416131316.640996080@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: <stable.vger.kernel.org> X-Mailing-List: stable@vger.kernel.org
Series	None \| expand [5.4,002/232] bus: sunxi-rsb: Return correct data when mixing 16-bit and 8-bit reads [5.4,003/232] ARM: dts: Fix dm814x Ethernet by changing to use rgmii-id mode [5.4,004/232] bpf: Fix deadlock with rq_lock in bpf_send_signal() [5.4,006/232] Input: tm2-touchkey - add support for Coreriver TC360 variant [5.4,009/232] rxrpc: Fix call interruptibility handling [5.4,010/232] net: stmmac: platform: Fix misleading interrupt error msg [5.4,012/232] hinic: fix a bug of waitting for IO stopped [5.4,014/232] hinic: fix out-of-order excution in arm cpu [5.4,017/232] selftests/net: add definition for SOL_DCCP to fix compilation errors for old libc [5.4,019/232] drm/scheduler: fix rare NULL ptr race [5.4,023/232] i2c: pca-platform: Use platform_irq_get_optional [5.4,024/232] media: rc: add keymap for Videostrong KII Pro [5.4,026/232] staging: wilc1000: avoid double unlocking of wilc->hif_cs mutex [5.4,027/232] media: venus: hfi_parser: Ignore HEVC encoding for V1 [5.4,030/232] null_blk: Handle null_add_dev() failures properly [5.4,032/232] media: imx: imx7_mipi_csis: Power off the source when stopping streaming [5.4,033/232] media: imx: imx7-media-csi: Fix video field handling [5.4,035/232] ACPI: EC: Do not clear boot_ec_is_ecdt in acpi_ec_add() [5.4,038/232] block: keep bdi->io_pages in sync with max_sectors_kb for stacked devices [5.4,040/232] irqchip/versatile-fpga: Handle chained IRQs properly [5.4,042/232] media: allegro: fix type of gop_length in channel_create message [5.4,043/232] sched: Avoid scale real weight down to zero [5.4,047/232] media: i2c: video-i2c: fix build errors due to imply hwmon [5.4,049/232] pstore/platform: fix potential mem leak if pstore_init_fs failed [5.4,050/232] gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty [5.4,054/232] efi/x86: Ignore the memory attributes table on i386 [5.4,056/232] block: Fix use-after-free issue accessing struct io_cq [5.4,057/232] media: i2c: ov5695: Fix power on and off sequences [5.4,059/232] irqchip/gic-v4: Provide irq_retrigger to avoid circular locking dependency [5.4,061/232] firmware: fix a double abort case with fw_load_sysfs_fallback [5.4,064/232] block, bfq: fix use-after-free in bfq_idle_slice_timer_body [5.4,066/232] btrfs: remove a BUG_ON() from merge_reloc_roots() [5.4,067/232] btrfs: restart relocate_tree_blocks properly [5.4,068/232] btrfs: track reloc roots based on their commit root bytenr [5.4,070/232] ASoC: dapm: connect virtual mux with default value [5.4,074/232] usb: gadget: composite: Inform controller driver of self-powered [5.4,075/232] ALSA: usb-audio: Add mixer workaround for TRX40 and co [5.4,078/232] ALSA: ice1724: Fix invalid access for enumerated ctl items [5.4,079/232] ALSA: pcm: oss: Fix regression by buffer overflow fix [5.4,081/232] ALSA: hda/realtek - a fake key event is triggered by running shutup [5.4,084/232] ALSA: hda/realtek - Remove now-unnecessary XPS 13 headphone noise fixups [5.4,085/232] ALSA: hda/realtek - Add quirk for Lenovo Carbon X1 8th gen [5.4,086/232] ALSA: hda/realtek - Add quirk for MSI GL63 [5.4,091/232] seccomp: Add missing compat_ioctl for notify [5.4,092/232] acpi/x86: ignore unspecified bit positions in the ACPI global lock field [5.4,095/232] thermal: devfreq_cooling: inline all stubs for CONFIG_DEVFREQ_THERMAL=n [5.4,096/232] nvmet-tcp: fix maxh2cdata icresp parameter [5.4,098/232] efi/x86: Add TPM related EFI tables to unencrypted mapping checks [5.4,101/232] PCI: Add boot interrupt quirk mechanism for Xeon chipsets [5.4,102/232] PCI: qcom: Fix the fixup of PCI_VENDOR_ID_QCOM [5.4,104/232] sched/fair: Fix enqueue_task_fair warning [5.4,105/232] tpm: Dont make log failures fatal [5.4,106/232] tpm: tpm1_bios_measurements_next should increase position index [5.4,108/232] KEYS: reaching the keys quotas correctly [5.4,112/232] io_uring: remove bogus RLIMIT_NOFILE check in file registration [5.4,114/232] MIPS/tlbex: Fix LDDIR usage in setup_pw() for Loongson-3 [5.4,115/232] MIPS: OCTEON: irq: Fix potential NULL pointer dereference [5.4,116/232] PM / Domains: Allow no domain-idle-states DT property in genpd when parsing [5.4,118/232] ath9k: Handle txpower changes even when TPC is disabled [5.4,120/232] x86/tsc_msr: Use named struct initializers [5.4,123/232] x86/entry/32: Add missing ASM_CLAC to general_protection entry [5.4,125/232] KVM: nVMX: Properly handle userspace interrupt window request [5.4,126/232] KVM: s390: vsie: Fix region 1 ASCE sanity shadow address checks [5.4,128/232] KVM: x86: Allocate new rmap and large page tracking when moving memslot [5.4,130/232] KVM: x86: Gracefully handle __vmalloc() failure during VM allocation [5.4,131/232] KVM: VMX: Add a trampoline to fix VMREAD error handling [5.4,133/232] smb3: fix performance regression with setting mtime [5.4,134/232] CIFS: Fix bug which the return value by asynchronous read is error [5.4,136/232] mtd: spinand: Do not erase the block before writing a bad block marker [5.4,138/232] Btrfs: fix crash during unmount due to race with delayed inode workers [5.4,141/232] btrfs: drop block from cache on error in relocation [5.4,145/232] btrfs: use nofs allocations for running delayed items [5.4,146/232] remoteproc: qcom_q6v5_mss: Dont reassign mpss region on shutdown [5.4,147/232] remoteproc: qcom_q6v5_mss: Reload the mba region on coredump [5.4,150/232] crypto: mxs-dcp - fix scatterlist linearization for hash [5.4,152/232] io_uring: honor original task RLIMIT_FSIZE [5.4,155/232] tools: gpio: Fix out-of-tree build regression [5.4,157/232] arm64: dts: allwinner: h6: Fix PMU compatible [5.4,159/232] arm64: dts: allwinner: h5: Fix PMU compatible [5.4,160/232] mm, memcg: do not high throttle allocators based on wraparound [5.4,163/232] dm verity fec: fix memory leak in verity_fec_dtr [5.4,165/232] dm clone metadata: Fix return type of dm_clone_nr_of_hydrated_regions() [5.4,166/232] XArray: Fix xas_pause for large multi-index entries [5.4,167/232] xarray: Fix early termination of xas_for_each_marked [5.4,168/232] crypto: caam/qi2 - fix chacha20 data size error [5.4,174/232] scsi: ufs: fix Auto-Hibern8 error detection [5.4,175/232] scsi: lpfc: Fix lpfc_io_buf resource leak in lpfc_get_scsi_buf_s4 error path [5.4,178/232] arm64: armv8_deprecated: Fix undef_hook mask for thumb setend [5.4,182/232] vfio: platform: Switch to platform_get_irq_optional() [5.4,185/232] drm: Remove PageReserved manipulation from drm_pci_alloc [5.4,187/232] drm/amdgpu: unify fw_write_wait for new gfx9 asics [5.4,191/232] NFS: Fix a page leak in nfs_destroy_unlinked_subrequests() [5.4,193/232] fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once() [5.4,194/232] ocfs2: no need try to truncate file beyond i_size [5.4,195/232] perf tools: Support Python 3.8+ in Makefile [5.4,197/232] Input: i8042 - add Acer Aspire 5738z to nomux list [5.4,199/232] clk: ingenic/jz4770: Exit with error if CGU init failed [5.4,200/232] clk: ingenic/TCU: Fix round_rate returning error [5.4,203/232] hfsplus: fix crash and filesystem corruption when deleting files [5.4,204/232] libata: Return correct status in sata_pmp_eh_recover_pm() when ATA_DFLAG_DETACH is set [5.4,207/232] powerpc/64/tm: Dont let userspace set regs->trap via sigreturn [5.4,209/232] powerpc/hash64/devmap: Use H_PAGE_THP_HUGE when setting up huge devmap PTE entries [5.4,210/232] powerpc/xive: Use XIVE_BAD_IRQ instead of zero to catch non configured IPIs [5.4,211/232] powerpc/64: Setup a paca before parsing device tree etc. [5.4,215/232] scsi: mpt3sas: Fix kernel panic observed on soft HBA unplug [5.4,216/232] powerpc: Make setjmp/longjmp signature standard [5.4,218/232] dm zoned: remove duplicate nr_rnd_zones increase in dmz_init_zone() [5.4,221/232] dm clone: Add missing casts to prevent overflows and data corruption [5.4,222/232] scsi: lpfc: Add registration for CPU Offline/Online events [5.4,225/232] scsi: lpfc: Fix broken Credit Recovery after driver load [5.4,228/232] drm/amdgpu: fix gfx hang during suspend with video playback (v2) [5.4,229/232] drm/i915/icl+: Dont enable DDI IO power on a TypeC port in TBT mode [5.4,231/232] mmc: sdhci: Convert sdhci_set_timeout_irq() to non-static [5.4,232/232] mmc: sdhci: Refactor sdhci_set_timeout()

[5.4,004/232] bpf: Fix deadlock with rq_lock in bpf_send_signal()

Commit Message

Patch