[5.4,304/309] x86/apic/msi: Plug non-maskable MSI affinity race

From: Thomas Gleixner <tglx@linutronix.de>

From: Thomas Gleixner <tglx@linutronix.de>

commit 6f1a4891a5928a5969c87fa5a584844c983ec823 upstream.

Evan tracked down a subtle race between the update of the MSI message and
the device raising an interrupt internally on PCI devices which do not
support MSI masking. The update of the MSI message is non-atomic and
consists of either 2 or 3 sequential 32bit wide writes to the PCI config
space.

   - Write address low 32bits
   - Write address high 32bits (If supported by device)
   - Write data

When an interrupt is migrated then both address and data might change, so
the kernel attempts to mask the MSI interrupt first. But for MSI masking is
optional, so there exist devices which do not provide it. That means that
if the device raises an interrupt internally between the writes then a MSI
message is sent built from half updated state.

On x86 this can lead to spurious interrupts on the wrong interrupt
vector when the affinity setting changes both address and data. As a
consequence the device interrupt can be lost causing the device to
become stuck or malfunctioning.

Evan tried to handle that by disabling MSI accross an MSI message
update. That's not feasible because disabling MSI has issues on its own:

 If MSI is disabled the PCI device is routing an interrupt to the legacy
 INTx mechanism. The INTx delivery can be disabled, but the disablement is
 not working on all devices.

 Some devices lose interrupts when both MSI and INTx delivery are disabled.

Another way to solve this would be to enforce the allocation of the same
vector on all CPUs in the system for this kind of screwed devices. That
could be done, but it would bring back the vector space exhaustion problems
which got solved a few years ago.

Fortunately the high address (if supported by the device) is only relevant
when X2APIC is enabled which implies interrupt remapping. In the interrupt
remapping case the affinity setting is happening at the interrupt remapping
unit and the PCI MSI message is programmed only once when the PCI device is
initialized.

That makes it possible to solve it with a two step update:

  1) Target the MSI msg to the new vector on the current target CPU

  2) Target the MSI msg to the new vector on the new target CPU

In both cases writing the MSI message is only changing a single 32bit word
which prevents the issue of inconsistency.

After writing the final destination it is necessary to check whether the
device issued an interrupt while the intermediate state #1 (new vector,
current CPU) was in effect.

This is possible because the affinity change is always happening on the
current target CPU. The code runs with interrupts disabled, so the
interrupt can be detected by checking the IRR of the local APIC. If the
vector is pending in the IRR then the interrupt is retriggered on the new
target CPU by sending an IPI for the associated vector on the target CPU.

This can cause spurious interrupts on both the local and the new target
CPU.

 1) If the new vector is not in use on the local CPU and the device
    affected by the affinity change raised an interrupt during the
    transitional state (step #1 above) then interrupt entry code will
    ignore that spurious interrupt. The vector is marked so that the
    'No irq handler for vector' warning is supressed once.

 2) If the new vector is in use already on the local CPU then the IRR check
    might see an pending interrupt from the device which is using this
    vector. The IPI to the new target CPU will then invoke the handler of
    the device, which got the affinity change, even if that device did not
    issue an interrupt

 3) If the new vector is in use already on the local CPU and the device
    affected by the affinity change raised an interrupt during the
    transitional state (step #1 above) then the handler of the device which
    uses that vector on the local CPU will be invoked.

expose issues in device driver interrupt handlers which are not prepared to
handle a spurious interrupt correctly. This not a regression, it's just
exposing something which was already broken as spurious interrupts can
happen for a lot of reasons and all driver handlers need to be able to deal
with them.

Reported-by: Evan Green <evgreen@chromium.org>
Debugged-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Evan Green <evgreen@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 arch/x86/include/asm/apic.h |    8 ++
 arch/x86/kernel/apic/msi.c  |  128 ++++++++++++++++++++++++++++++++++++++++++--
 include/linux/irq.h         |   18 ++++++
 include/linux/irqdomain.h   |    7 ++
 kernel/irq/debugfs.c        |    1 
 kernel/irq/msi.c            |    5 +
 6 files changed, 163 insertions(+), 4 deletions(-)

Message ID	20200210122436.167042333@linuxfoundation.org
State	Superseded
Headers	show Return-Path: <SRS0=fEgN=36=vger.kernel.org=stable-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69CD6C352A4 for <stable@archiver.kernel.org>; Mon, 10 Feb 2020 13:10:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 398862070A for <stable@archiver.kernel.org>; Mon, 10 Feb 2020 13:10:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1581340252; bh=erWsI79FJ9OBduJCggDM2IPGZqjM5g1crkCOVGS4nro=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=xn8yusCWBI1bBC1GLY03N3rpkfBjOlMPsJhdDxK7bTtC383jJjm5Efte1F3wo7L// 8GHX/57TFGfqDqB5rDA7j+JUqHD+5tz0aZ3Dwv1tKAZx9xAMljtuwgNDv6mrdun3SX j9b75Mtsub/QV6HChDlN9diSffONlQGUNcgEz9/Q= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729412AbgBJNKp (ORCPT <rfc822;stable@archiver.kernel.org>); Mon, 10 Feb 2020 08:10:45 -0500 Received: from mail.kernel.org ([198.145.29.99]:36214 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728803AbgBJMjJ (ORCPT <rfc822;stable@vger.kernel.org>); Mon, 10 Feb 2020 07:39:09 -0500 Received: from localhost (unknown [209.37.97.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id F07382051A; Mon, 10 Feb 2020 12:39:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1581338348; bh=erWsI79FJ9OBduJCggDM2IPGZqjM5g1crkCOVGS4nro=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=XKtFDgZt6te1Dqa+d7PXAeNorWcN9j4oNKuvLp23fbWNkN+XlaeWpD6cWUyP9Ej2w zspWVwPl6MWDU3S6td5IDPsm/jPeLwmBcMeE3emkzi9oJuDe6hoNYFEX6KK46Dgt2J A0iiT5aQU6mqADQvXT0JSRrK3A6uvqhQgzQL9asg= From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, stable@vger.kernel.org, Evan Green <evgreen@chromium.org>, Thomas Gleixner <tglx@linutronix.de> Subject: [PATCH 5.4 304/309] x86/apic/msi: Plug non-maskable MSI affinity race Date: Mon, 10 Feb 2020 04:34:20 -0800 Message-Id: <20200210122436.167042333@linuxfoundation.org> X-Mailer: git-send-email 2.25.0 In-Reply-To: <20200210122406.106356946@linuxfoundation.org> References: <20200210122406.106356946@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: <stable.vger.kernel.org> X-Mailing-List: stable@vger.kernel.org
Series	None \| expand [5.4,004/309] gtp: use __GFP_NOWARN to avoid memalloc warning [5.4,005/309] l2tp: Allow duplicate session creation with UDP [5.4,006/309] net: hsr: fix possible NULL deref in hsr_handle_frame() [5.4,007/309] net_sched: fix an OOB access in cls_tcindex [5.4,009/309] bnxt_en: Fix TC queue mapping. [5.4,012/309] rxrpc: Fix missing active use pinning of rxrpc_local object [5.4,013/309] rxrpc: Fix NULL pointer deref due to call->conn being cleared on disconnect [5.4,016/309] tcp: clear tp->data_segs{in\|out} in tcp_disconnect() [5.4,019/309] MAINTAINERS: correct entries for ISDN/mISDN section [5.4,021/309] bnxt_en: Fix logic that disables Bus Master during firmware reset. [5.4,022/309] media: uvcvideo: Avoid cyclic entity chains due to malformed USB descriptors [5.4,024/309] netfilter: ipset: fix suspicious RCU usage in find_set_and_id [5.4,026/309] tracing/kprobes: Have uname use __get_str() in print_fmt [5.4,028/309] rcu: Use _ONCE() to protect lockless ->expmask accesses [5.4,030/309] srcu: Apply _ONCE() to ->srcu_last_gp_end [5.4,031/309] rcu: Use READ_ONCE() for ->expmask in rcu_read_unlock_special() [5.4,033/309] nvmet: Fix controller use after free [5.4,035/309] Bluetooth: btusb: Disable runtime suspend on Realtek devices [5.4,039/309] usb: typec: tcpci: mask event interrupts when remove driver [5.4,042/309] usb: gadget: legacy: set max_speed to super-speed [5.4,043/309] usb: gadget: f_ncm: Use atomic_t to track in-flight request [5.4,045/309] ALSA: usb-audio: Fix endianess in descriptor validation [5.4,048/309] memcg: fix a crash in wb_workfn when a device disappears [5.4,049/309] mm/sparse.c: reset sections mem_map when fully deactivated [5.4,050/309] mmc: sdhci-pci: Make function amd_sdhci_reset static [5.4,052/309] mm/memory_hotplug: fix remove_memory() lockdep splat [5.4,057/309] media: v4l2-rect.h: fix v4l2_rect_map_inside() top/left adjustments [5.4,058/309] lib/test_kasan.c: fix memory leak in kmalloc_oob_krealloc_more() [5.4,059/309] irqdomain: Fix a memory leak in irq_domain_push_irq() [5.4,060/309] x86/cpu: Update cached HLE state on write to TSX_CTRL_CPUID_CLEAR [5.4,063/309] ALSA: hda: Add Clevo W65_67SB the power_save blacklist [5.4,064/309] ALSA: hda: Add JasperLake PCI ID and codec vid [5.4,067/309] KVM: arm/arm64: Correct CPSR on exception entry [5.4,071/309] MIPS: fix indentation of the RELOCS message [5.4,072/309] MIPS: boot: fix typo in vmlinux.lzma.its target [5.4,074/309] powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case [5.4,077/309] powerpc/pseries: Advance pfn if section is not present in lmb_is_removable() [5.4,079/309] powerpc/32s: Fix CPU wake-up from sleep mode [5.4,080/309] tracing: Fix now invalid var_ref_vals assumption in trace action [5.4,081/309] PCI: tegra: Fix return value check of pm_runtime_get_sync() [5.4,082/309] PCI: keystone: Fix outbound region mapping [5.4,084/309] PCI: keystone: Fix error handling when "num-viewport" DT property is not populated [5.4,086/309] ACPI: video: Do not export a non working backlight interface on MSI MS-7721 boards [5.4,087/309] ACPI / battery: Deal with design or full capacity being reported as -1 [5.4,093/309] ubifs: Fix wrong memory allocation [5.4,095/309] ubifs: Fix deadlock in concurrent bulk-read and writepage [5.4,097/309] ASoC: SOF: core: free trace on errors [5.4,100/309] nvmem: core: fix memory abort in cleanup path [5.4,102/309] crypto: ccree - fix backlog memory leak [5.4,105/309] crypto: ccree - fix FDE descriptor sequence [5.4,107/309] padata: Remove broken queue flushing [5.4,110/309] erofs: fix out-of-bound read for shifted uncompressed block [5.4,111/309] scsi: megaraid_sas: Do not initiate OCR if controller is not in ready state [5.4,114/309] power: supply: axp20x_ac_power: Fix reporting online status [5.4,116/309] ovl: fix wrong WARN_ON() in ovl_cache_update_ino() [5.4,118/309] f2fs: choose hardlimit when softlimit is larger than hardlimit in f2fs_statfs_proje... [5.4,120/309] f2fs: code cleanup for f2fs_statfs_project() [5.4,121/309] f2fs: fix dcache lookup of !casefolded directories [5.4,123/309] PM: core: Fix handling of devices deleted during system-wide resume [5.4,125/309] of: Add OF_DMA_DEFAULT_COHERENT & select it on powerpc [5.4,127/309] dm zoned: support zone sizes smaller than 128MiB [5.4,128/309] dm space map common: fix to ensure new block isnt already in use [5.4,129/309] dm writecache: fix incorrect flush sequence when doing SSD mode commit [5.4,130/309] dm crypt: fix GFP flags passed to skcipher_request_alloc() [5.4,134/309] scsi: qla2xxx: Fix stuck login session using prli_pend_timer [5.4,135/309] ASoC: SOF: Introduce state machine for FW boot [5.4,139/309] ftrace: Add comment to why rcu_dereference_sched() is open coded [5.4,141/309] crypto: pcrypt - Avoid deadlock by using per-instance padata queues [5.4,143/309] btrfs: Handle another split brain scenario with metadata uuid feature [5.4,145/309] selftests/bpf: Fix perf_buffer test on systems w/ offline CPUs [5.4,148/309] tc-testing: fix eBPF tests failure on linux fresh clones [5.4,149/309] samples/bpf: Dont try to remove users homedir on clean [5.4,150/309] samples/bpf: Xdp_redirect_cpu fix missing tracepoint attach [5.4,154/309] selftests: bpf: Ignore FIN packets for reuseport tests [5.4,155/309] crypto: api - fix unexpectedly getting generic implementation [5.4,156/309] crypto: hisilicon - Use the offset fields in sqe to avoid need to split scatterlists [5.4,157/309] crypto: ccp - set max RSA modulus size for v3 platform devices as well [5.4,159/309] crypto: pcrypt - Do not clear MAY_SLEEP flag in original request [5.4,162/309] crypto: picoxcell - adjust the position of tasklet_init and fix missed tasklet_kill [5.4,164/309] scsi: qla2xxx: Fix unbound NVME response length [5.4,165/309] NFS: Fix memory leaks and corruption in readdir [5.4,168/309] jbd2_seq_info_next should increase position index [5.4,169/309] ext4: fix deadlock allocating crypto bounce page from mempool [5.4,170/309] ext4: fix race conditions in ->d_compare() and ->d_hash() [5.4,172/309] Btrfs: make deduplication with range including the last block work [5.4,174/309] btrfs: set trans->drity in btrfs_commit_transaction [5.4,178/309] btrfs: Correctly handle empty trees in find_first_clear_extent_bit [5.4,181/309] mwifiex: fix unbalanced locking in mwifiex_process_country_ie() [5.4,183/309] gfs2: fix gfs2_find_jhead that returns uninitialized jhead with seq 0 [5.4,184/309] gfs2: move setting current->backing_dev_info [5.4,186/309] drm: atmel-hlcdc: use double rate for pixel clock only if supported [5.4,188/309] drm: atmel-hlcdc: prefer a lower pixel-clock than requested [5.4,190/309] media: iguanair: fix endpoint sanity check [5.4,191/309] media: rc: ensure lirc is initialized before registering input device [5.4,193/309] xen/balloon: Support xend-based toolstack take two [5.4,195/309] bcache: add readahead cache policy options via sysfs interface [5.4,196/309] eventfd: track eventfd_signal() recursion depth [5.4,197/309] aio: prevent potential eventfd recursion on poll [5.4,201/309] KVM: x86: Protect DR-based index computations from Spectre-v1/L1TF attacks [5.4,203/309] KVM: x86: Protect kvm_hv_msr_[get\|set]_crash_data() from Spectre-v1/L1TF attacks [5.4,204/309] KVM: x86: Protect ioapic_write_indirect() from Spectre-v1/L1TF attacks [5.4,207/309] KVM: x86: Protect MSR-based index computations from Spectre-v1/L1TF attacks in x86.c [5.4,209/309] KVM: x86: Protect MSR-based index computations in fixed_msr_to_seg_unit() from Spec... [5.4,210/309] KVM: x86: Fix potential put_fpu() w/o load_fpu() on MPX platform [5.4,211/309] KVM: PPC: Book3S HV: Uninit vCPU if vcore creation fails [5.4,214/309] x86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bit [5.4,217/309] x86/kvm: Cache gfn to pfn translation [5.4,218/309] x86/KVM: Clean up hosts steal time structure [5.4,222/309] KVM: x86: Handle TIF_NEED_FPU_LOAD in kvm_{load,put}_guest_fpu() [5.4,225/309] KVM: s390: do not clobber registers during guest reset/store status [5.4,227/309] mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section [5.4,230/309] clk: tegra: Mark fuse clock as critical [5.4,231/309] drm/amd/dm/mst: Ignore payload update failures [5.4,235/309] broken ping to ipv6 linklocal addresses on debian buster [5.4,236/309] percpu: Separate decrypted varaibles anytime encryption can be enabled [5.4,238/309] scsi: qla2xxx: Fix the endianness of the qla82xx_get_fw_size() return type [5.4,241/309] scsi: ufs: Recheck bkops level if bkops is disabled [5.4,242/309] mtd: spi-nor: Split mt25qu512a (n25q512a) entry into two [5.4,243/309] phy: qualcomm: Adjust indentation in read_poll_timeout [5.4,244/309] ext2: Adjust indentation in ext2_fill_super [5.4,245/309] powerpc/44x: Adjust indentation in ibm4xx_denali_fixup_memsize [5.4,250/309] net: tulip: Adjust indentation in {dmfe, uli526x}_init_module [5.4,252/309] IB/core: Fix ODP get user pages flow [5.4,255/309] nfsd: Return the correct number of bytes written to the file [5.4,256/309] virtio-balloon: Fix memory leak when unloading while hinting is in progress [5.4,259/309] ubi: Fix an error pointer dereference in error handling code [5.4,260/309] ubifs: Fix memory leak from c->sup_node [5.4,261/309] regulator: core: Add regulator_is_equal() helper [5.4,264/309] devlink: report 0 after hitting end in region read [5.4,265/309] dpaa_eth: support all modes with rate adapting PHYs [5.4,269/309] net: mvneta: move rx_dropped and rx_errors in per-cpu stats [5.4,271/309] net: stmmac: fix a possible endless loop [5.4,272/309] net: systemport: Avoid RBUF stuck in Wake-on-LAN mode [5.4,273/309] net/mlx5: IPsec, Fix esp modify function attribute [5.4,275/309] net: macb: Remove unnecessary alignment check for TSO [5.4,278/309] taprio: Fix still allowing changing the flags during runtime [5.4,279/309] taprio: Add missing policy validation for flags [5.4,280/309] taprio: Use taprio_reset_tc() to reset Traffic Classes configuration [5.4,283/309] qed: Fix timestamping issue for L2 unicast ptp packets. [5.4,287/309] ASoC: Intel: skl_hda_dsp_common: Fix global-out-of-bounds bug [5.4,289/309] mfd: rn5t618: Mark ADC control register volatile [5.4,290/309] mfd: bd70528: Fix hour register mask [5.4,292/309] btrfs: use bool argument in free_root_pointers() [5.4,293/309] btrfs: free block groups after freeing fs trees [5.4,294/309] drm/dp_mst: Remove VCPI while disabling topology mgr [5.4,298/309] KVM: x86: fix overlap between SPTE_MMIO_MASK and generation [5.4,299/309] KVM: nVMX: vmread should not set rflags to specify success in case of #PF [5.4,300/309] KVM: Use vcpu-specific gva->hva translation when querying host page size [5.4,302/309] cifs: fail i/o on soft mounts if sessionsetup errors out [5.4,304/309] x86/apic/msi: Plug non-maskable MSI affinity race [5.4,306/309] perf/core: Fix mlock accounting in perf_mmap() [5.4,308/309] regulator fix for "regulator: core: Add regulator_is_equal() helper"

[5.4,304/309] x86/apic/msi: Plug non-maskable MSI affinity race

Commit Message

Patch