From patchwork Wed Sep 27 15:49:27 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Will Deacon X-Patchwork-Id: 114374 Delivered-To: patch@linaro.org Received: by 10.140.106.117 with SMTP id d108csp5203383qgf; Wed, 27 Sep 2017 08:53:57 -0700 (PDT) X-Google-Smtp-Source: AOwi7QD50EJbDM5Cm38CpM0JLGLYz0qfNsn9ptt1Ck3dw99poQ9IS+M4BJI9cgdAf6DxnQ9mzwcP X-Received: by 10.98.74.4 with SMTP id x4mr1717160pfa.222.1506527637473; Wed, 27 Sep 2017 08:53:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1506527637; cv=none; d=google.com; s=arc-20160816; b=JhfxmwV55Dx605ZXWbkbiVn9OivKHJtIWeTtMMGWNDk8KecFlZr4lNbpSu9cXkRH5h FA6DCoy6H+3qbWsj2xc6vmnlen973V/J4Goq1Az2uTGPG+7tDCOOqkWRxWQIo2DPOust ih+WDcD4Vf57/IMNgL9lsQOdNBmFkKqL4so/RwK8k+R8g074tGG2w+2quT/2dghWtmcs /n3ioWYo+uUE9hvyMg5oVX5zRB1N6lL3OaGcpGFuGMUg+DXCVQ9pQpb3J/Iu+rGBeT7i L7d6dEkxFO16IFrhue6aVtf5eSGW1sEXPWp8Pt6dTK1QG9fMRO/8zhM5Slg+gpY4eqoG TF9A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :arc-authentication-results; bh=fSUN2OyDwYYOYe5Vf/xtOTWtxScmj7Nn5xscacmc4ZA=; b=cz4LlnJXmgdZl/rVBciUDwyEQJtnV8H4IHvGMlqA9JhhVPNPEgc7dph5PJ30he+HnU e8JskrPcEHR7aC4EpLifLh5d6V4I/jTyerblF3aLp6luoWveqOvcYbWk0v79RBh6TG9o 2u7CcFFeAWFZsq/aQmyI/Prr5w+MMal9gEKIH7ofaCg+C2krfTIrZklud0qjX8zf0IOK isC9i+4Asw+HZlluPgEhp8Rm6roM6KMGCFS8a+uGXAhDczPVdfWxPK16ws7QLj/ROP06 Vl9k2MmLKYIgD3sEv7k7b4FL9qKwvw+DtPK6SVoa7aKDLdc5g5RTrJDEVQlcMVWhIRWV yToQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c79si3485207pfe.388.2017.09.27.08.53.57; Wed, 27 Sep 2017 08:53:57 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932112AbdI0Pxy (ORCPT + 26 others); Wed, 27 Sep 2017 11:53:54 -0400 Received: from foss.arm.com ([217.140.101.70]:47362 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752190AbdI0Pxw (ORCPT ); Wed, 27 Sep 2017 11:53:52 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C03241435; Wed, 27 Sep 2017 08:53:51 -0700 (PDT) Received: from edgewater-inn.cambridge.arm.com (usa-sjc-imap-foss1.foss.arm.com [10.72.51.249]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 911003F53D; Wed, 27 Sep 2017 08:53:51 -0700 (PDT) Received: by edgewater-inn.cambridge.arm.com (Postfix, from userid 1000) id C70DA1AE3096; Wed, 27 Sep 2017 16:54:05 +0100 (BST) From: Will Deacon To: peterz@infradead.org, paulmck@linux.vnet.ibm.com, kirill.shutemov@linux.intel.com Cc: linux-kernel@vger.kernel.org, ynorov@caviumnetworks.com, rruigrok@codeaurora.org, linux-arch@vger.kernel.org, akpm@linux-foundation.org, catalin.marinas@arm.com, Will Deacon Subject: [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes Date: Wed, 27 Sep 2017 16:49:27 +0100 Message-Id: <1506527369-19535-1-git-send-email-will.deacon@arm.com> X-Mailer: git-send-email 2.1.4 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, We recently had a crash report[1] on arm64 that involved a bad dereference in the page_vma_mapped code during ext4 writeback with THP active. I can reproduce this on -rc2: [ 254.032812] PC is at check_pte+0x20/0x170 [ 254.032948] LR is at page_vma_mapped_walk+0x2e0/0x540 [...] [ 254.036114] Process doio (pid: 2463, stack limit = 0xffff00000f2e8000) [ 254.036361] Call trace: [ 254.038977] [] check_pte+0x20/0x170 [ 254.039137] [] page_vma_mapped_walk+0x2e0/0x540 [ 254.039332] [] page_mkclean_one+0xac/0x278 [ 254.039489] [] rmap_walk_file+0xf0/0x238 [ 254.039642] [] rmap_walk+0x64/0xa0 [ 254.039784] [] page_mkclean+0x90/0xa8 [ 254.040029] [] clear_page_dirty_for_io+0x84/0x2a8 [ 254.040311] [] mpage_submit_page+0x34/0x98 [ 254.040518] [] mpage_process_page_bufs+0x164/0x170 [ 254.040743] [] mpage_prepare_extent_to_map+0x134/0x2b8 [ 254.040969] [] ext4_writepages+0x484/0xe30 [ 254.041175] [] do_writepages+0x44/0xe8 [ 254.041372] [] __filemap_fdatawrite_range+0xbc/0x110 [ 254.041568] [] file_write_and_wait_range+0x48/0xd8 [ 254.041739] [] ext4_sync_file+0x80/0x4b8 [ 254.041907] [] vfs_fsync_range+0x64/0xc0 [ 254.042106] [] SyS_msync+0x194/0x1e8 After digging into the issue, I found that we appear to be racing with a concurrent pmd update in page_vma_mapped_walk, assumedly due a THP splitting operation. Looking at the code there: pvmw->pmd = pmd_offset(pud, pvmw->address); if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) { [...] } else { if (!check_pmd(pvmw)) return false; } if (!map_pte(pvmw)) goto next_pte; what happens in the crashing scenario is that we see all zeroes for the PMD in pmd_trans_huge(*pvmw->pmd), and so go to the 'else' case (migration isn't enabled, so the test is removed at compile-time). check_pmd then does: pmde = READ_ONCE(*pvmw->pmd); return pmd_present(pmde) && !pmd_trans_huge(pmde); and reads a valid table entry for the PMD because the splitting has completed (i.e. the first dereference reads from the pmdp_invalidate in the splitting code, whereas the second dereferenced reads from the following pmd_populate). It returns true because we should descend to the PTE level in map_pte. map_pte does: pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address); which on arm64 (and this appears to be the same on x86) ends up doing: (pmd_page_paddr((*(pvmw->pmd))) + pte_index(pvmw->address) * sizeof(pte_t)) as part of its calculation. However, this is horribly broken because GCC inlines everything and reuses the register it loaded for the initial pmd_trans_huge check (when we loaded the value of zero) here, so we end up calculating a junk pointer and crashing when we dereference it. Disassembly at the end of the mail[2] for those who are curious. The moral of the story is that read-after-read (same address) ordering *only* applies if READ_ONCE is used consistently. This means we need to fix page table dereferences in the core code as well as the arch code to avoid this problem. The two RFC patches in this series fix arm64 (which is a bigger fix that necessary since I clean things up too) and page_vma_mapped_walk. Comments welcome. Will [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-September/532786.html [2] // page_vma_mapped_walk // pvmw->pmd = pmd_offset(pud, pvmw->address); ldr x0, [x19, #24] // pvmw->pmd // if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) { ldr x1, [x0] // *pvmw->pmd cbz x1, ffff0000082336a0 tbz w1, #1, ffff000008233788 // pmd_trans_huge? // else if (!check_pmd(pvmw)) ldr x0, [x0] // READ_ONCE in check_pmd tst x0, x24 // pmd_present? b.eq ffff000008233538 // b.none tbz w0, #1, ffff000008233538 // pmd_trans_huge? // if (!map_pte(pvmw)) ldr x0, [x19, #16] // pvmw->address // pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address); and x1, x1, #0xfffffffff000 // Reusing the old value of *pvmw->pmd!!! [...] --->8 Will Deacon (2): arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables mm: page_vma_mapped: Ensure pmd is loaded with READ_ONCE outside of lock arch/arm64/include/asm/hugetlb.h | 2 +- arch/arm64/include/asm/kvm_mmu.h | 18 +-- arch/arm64/include/asm/mmu_context.h | 4 +- arch/arm64/include/asm/pgalloc.h | 42 +++--- arch/arm64/include/asm/pgtable.h | 29 ++-- arch/arm64/kernel/hibernate.c | 148 +++++++++--------- arch/arm64/mm/dump.c | 54 ++++--- arch/arm64/mm/fault.c | 44 +++--- arch/arm64/mm/hugetlbpage.c | 94 ++++++------ arch/arm64/mm/kasan_init.c | 62 ++++---- arch/arm64/mm/mmu.c | 281 ++++++++++++++++++----------------- arch/arm64/mm/pageattr.c | 30 ++-- mm/page_vma_mapped.c | 25 ++-- 13 files changed, 427 insertions(+), 406 deletions(-) -- 2.1.4 Tested-by: Yury Norov Tested-by: Richard Ruigrok