From patchwork Mon Feb 24 03:29:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 230575 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6ADDC35669 for ; Mon, 24 Feb 2020 03:29:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B130E2071C for ; Mon, 24 Feb 2020 03:29:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1582514984; bh=Q1oHwsf8C1ubN8OWcf5Wza1wlFbIWPMTRbaLs5M3sWI=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=zWcboXZV/BHRa/4iOmxOb/aJy6f1kaulEEf9csEUTIy2COYQHTBhBYWa8mO9dM545 d+0V5pNsNfkM2apjsAwHFdnL0b9O8apX0VVhh5V9KrQ72NNqZ19H3cmUOGz+4rvUqX 55IjEPNRhlJIY7W9inHJhKNNs2IbN6SrtCk6qAmg= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727202AbgBXD3o (ORCPT ); Sun, 23 Feb 2020 22:29:44 -0500 Received: from mail.kernel.org ([198.145.29.99]:45020 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727158AbgBXD3o (ORCPT ); Sun, 23 Feb 2020 22:29:44 -0500 Received: from localhost.localdomain (c-73-231-172-41.hsd1.ca.comcast.net [73.231.172.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id D4C892067D; Mon, 24 Feb 2020 03:29:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1582514983; bh=Q1oHwsf8C1ubN8OWcf5Wza1wlFbIWPMTRbaLs5M3sWI=; h=Date:From:To:Subject:In-Reply-To:From; b=olEJrgI6QUh2boKDt1dY9OJD+kGwWaqPyT2Q51jl3dbtGwUaXjYfsZna5+cXo6KsN s9CsJxPu8xy6siu61KCUGMbRve2nrbD6h6VLCQjfImd8jcXP2XzIygI9PGve+4sF2J gM092qs1+laBAmIyYEx9VF3h+AQRvOYBAydNF4so= Date: Sun, 23 Feb 2020 19:29:42 -0800 From: Andrew Morton To: longpeng2@huawei.com, mike.kravetz@oracle.com, mm-commits@vger.kernel.org, sean.j.christopherson@intel.com, stable@vger.kernel.org, willy@infradead.org Subject: + mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch added to -mm tree Message-ID: <20200224032942.j9ECUiVyu%akpm@linux-foundation.org> In-Reply-To: <20200203173311.6269a8be06a05e5a4aa08a93@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org The patch titled Subject: mm/hugetlb.c: fix a addressing exception caused by huge_pte_offset() has been added to the -mm tree. Its filename is mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Longpeng Subject: mm/hugetlb.c: fix a addressing exception caused by huge_pte_offset() Our machine encountered a panic(addressing exception) after running for a long time. The calltrace is: RIP: 0010:[] [] hugetlb_fault+0x307/0xbe0 RSP: 0018:ffff9567fc27f808 EFLAGS: 00010286 RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48 RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48 RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080 R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8 R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074 FS: 00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: [] ? unlock_page+0x2b/0x30 [] ? hugetlb_fault+0x222/0xbe0 [] follow_hugetlb_page+0x175/0x540 [] ? cpumask_next_and+0x35/0x50 [] __get_user_pages+0x2a0/0x7e0 [] __get_user_pages_unlocked+0x15d/0x210 [] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm] [] try_async_pf+0x6e/0x2a0 [kvm] [] tdp_page_fault+0x151/0x2d0 [kvm] [] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel] [] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel] [] kvm_mmu_page_fault+0x31/0x140 [kvm] [] handle_ept_violation+0x9e/0x170 [kvm_intel] [] vmx_handle_exit+0x2bc/0xc70 [kvm_intel] [] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel] [] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel] [] vcpu_enter_guest+0x7be/0x13a0 [kvm] [] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm] [] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm] [] kvm_vcpu_ioctl+0x309/0x6d0 [kvm] [] ? dequeue_signal+0x32/0x180 [] ? do_sigtimedwait+0xcd/0x230 [] do_vfs_ioctl+0x3f0/0x540 [] SyS_ioctl+0xa1/0xc0 [] system_call_fastpath+0x22/0x27 The kernel we used is older, but we think the latest kernel also has this bug after digging into this problem. For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it may return a wrong 'pmdp' if there is a race. Please look at the following code snippet: ... pud = pud_offset(p4d, addr); if (sz != PUD_SIZE && pud_none(*pud)) return NULL; /* hugepage or swap? */ if (pud_huge(*pud) || !pud_present(*pud)) return (pte_t *)pud; pmd = pmd_offset(pud, addr); if (sz != PMD_SIZE && pmd_none(*pmd)) return NULL; /* hugepage or swap? */ if (pmd_huge(*pmd) || !pmd_present(*pmd)) return (pte_t *)pmd; ... The following sequence would trigger this bug: 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue 1. CPU0: "pud_huge(*pud)" is false 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT) 3. CPU0: "!pud_present(*pud)" is false, continue 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp However, we want CPU0 to return NULL or pudp. We can avoid this race by reading the pud only once. What's more, we also use READ_ONCE to access the entries for safety (i.e. avoid the compilier mischief) Link: http://lkml.kernel.org/r/1582342427-230392-1-git-send-email-longpeng2@huawei.com Signed-off-by: Longpeng Cc: Matthew Wilcox Cc: Sean Christopherson Cc: Mike Kravetz Cc: Signed-off-by: Andrew Morton --- mm/hugetlb.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) --- a/mm/hugetlb.c~mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset +++ a/mm/hugetlb.c @@ -4910,28 +4910,30 @@ pte_t *huge_pte_offset(struct mm_struct { pgd_t *pgd; p4d_t *p4d; - pud_t *pud; - pmd_t *pmd; + pud_t *pud, pud_entry; + pmd_t *pmd, pmd_entry; pgd = pgd_offset(mm, addr); - if (!pgd_present(*pgd)) + if (!pgd_present(READ_ONCE(*pgd))) return NULL; p4d = p4d_offset(pgd, addr); - if (!p4d_present(*p4d)) + if (!p4d_present(READ_ONCE(*p4d))) return NULL; pud = pud_offset(p4d, addr); - if (sz != PUD_SIZE && pud_none(*pud)) + pud_entry = READ_ONCE(*pud); + if (sz != PUD_SIZE && pud_none(pud_entry)) return NULL; /* hugepage or swap? */ - if (pud_huge(*pud) || !pud_present(*pud)) + if (pud_huge(pud_entry) || !pud_present(pud_entry)) return (pte_t *)pud; pmd = pmd_offset(pud, addr); - if (sz != PMD_SIZE && pmd_none(*pmd)) + pmd_entry = READ_ONCE(*pmd); + if (sz != PMD_SIZE && pmd_none(pmd_entry)) return NULL; /* hugepage or swap? */ - if (pmd_huge(*pmd) || !pmd_present(*pmd)) + if (pmd_huge(pmd_entry) || !pmd_present(pmd_entry)) return (pte_t *)pmd; return NULL; From patchwork Tue Feb 4 01:33:48 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 232075 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96960C35247 for ; Tue, 4 Feb 2020 01:33:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 6509921582 for ; Tue, 4 Feb 2020 01:33:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1580780031; bh=RAqQCHcGG882aBwgMd4kEmG8ECKtW4l7nH+ZyDpPiQY=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=TDJK75sBGdPa7LRPOvUbOKnEu1qCT2dLurmS8JdLb5Cc3XVTccmcGE1XUekhusk1c jWrLIg3jsZvdniTKoLp6WQbhn+sqpV1hNKUPvDT1eGRwojYdpS5TQfVgxCO6yAEYjr 018QFhI3yztb1sXGOuGj/95M2hVaBb3YYOxiF7ok= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726930AbgBDBdu (ORCPT ); Mon, 3 Feb 2020 20:33:50 -0500 Received: from mail.kernel.org ([198.145.29.99]:58560 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726369AbgBDBdu (ORCPT ); Mon, 3 Feb 2020 20:33:50 -0500 Received: from localhost.localdomain (c-73-231-172-41.hsd1.ca.comcast.net [73.231.172.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 2EEEF2086A; Tue, 4 Feb 2020 01:33:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1580780029; bh=RAqQCHcGG882aBwgMd4kEmG8ECKtW4l7nH+ZyDpPiQY=; h=Date:From:To:Subject:In-Reply-To:From; b=UKm39DiS6nc87wcQDClkfiAuVjXnoYl6k0DQmm4o9MvD5XMYCM5neP+CcaDmcoHkC qvfc73IWHo4/rJwny+N9K/RMs3LUhYZLE8xx956eRXo6gVAmTtgFNq1IwQ7mpheWTt QQQVU5EfQmyR6sIY+Nz+4l25FqeHz0qJtkOHbUKA= Date: Mon, 03 Feb 2020 17:33:48 -0800 From: Andrew Morton To: adobriyan@gmail.com, akpm@linux-foundation.org, bob.picco@oracle.com, dan.j.williams@intel.com, daniel.m.jordan@oracle.com, david@redhat.com, linux-mm@kvack.org, mhocko@kernel.org, mhocko@suse.com, mm-commits@vger.kernel.org, n-horiguchi@ah.jp.nec.com, osalvador@suse.de, pasha.tatashin@oracle.com, sfr@canb.auug.org.au, stable@vger.kernel.org, steven.sistare@oracle.com, torvalds@linux-foundation.org Subject: [patch 02/67] mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section Message-ID: <20200204013348.GxUmZtFO4%akpm@linux-foundation.org> In-Reply-To: <20200203173311.6269a8be06a05e5a4aa08a93@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: David Hildenbrand Subject: mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section Patch series "mm: fix max_pfn not falling on section boundary", v2. Playing with different memory sizes for a x86-64 guest, I discovered that some memmaps (highest section if max_mem does not fall on the section boundary) are marked as being valid and online, but contain garbage. We have to properly initialize these memmaps. Looking at /proc/kpageflags and friends, I found some more issues, partially related to this. This patch (of 3): If max_pfn is not aligned to a section boundary, we can easily run into BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a memory size that is not a multiple of 128MB (e.g., 4097MB, but also 4160MB). I was told that on real HW, we can easily have this scenario (esp., one of the main reasons sub-section hotadd of devmem was added). The issue is, that we have a valid memmap (pfn_valid()) for the whole section, and the whole section will be marked "online". pfn_to_online_page() will succeed, but the memmap contains garbage. E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m 4160M" - (see tools/vm/page-types.c): [ 200.476376] BUG: unable to handle page fault for address: fffffffffffffffe [ 200.477500] #PF: supervisor read access in kernel mode [ 200.478334] #PF: error_code(0x0000) - not-present page [ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0 [ 200.479557] Oops: 0000 [#4] SMP NOPTI [ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W 5.5.0-rc1-next-20191209 #93 [ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 [ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410 [ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f [ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202 [ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000 [ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246 [ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000 [ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001 [ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08 [ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000 [ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0 [ 200.488897] Call Trace: [ 200.489115] kpageflags_read+0xe9/0x140 [ 200.489447] proc_reg_read+0x3c/0x60 [ 200.489755] vfs_read+0xc2/0x170 [ 200.490037] ksys_pread64+0x65/0xa0 [ 200.490352] do_syscall_64+0x5c/0xa0 [ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe But it can be triggered much easier via "cat /proc/kpageflags > /dev/null" after cold/hot plugging a DIMM to such a system: [root@localhost ~]# cat /proc/kpageflags > /dev/null [ 111.517275] BUG: unable to handle page fault for address: fffffffffffffffe [ 111.517907] #PF: supervisor read access in kernel mode [ 111.518333] #PF: error_code(0x0000) - not-present page [ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0 This patch fixes that by at least zero-ing out that memmap (so e.g., page_to_pfn() will not crash). Commit 907ec5fca3dc ("mm: zero remaining unavailable struct pages") tried to fix a similar issue, but forgot to consider this special case. After this patch, there are still problems to solve. E.g., not all of these pages falling into a memory hole will actually get initialized later and set PageReserved - they are only zeroed out - but at least the immediate crashes are gone. A follow-up patch will take care of this. Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: David Hildenbrand Tested-by: Daniel Jordan Cc: Naoya Horiguchi Cc: Pavel Tatashin Cc: Andrew Morton Cc: Steven Sistare Cc: Michal Hocko Cc: Daniel Jordan Cc: Bob Picco Cc: Oscar Salvador Cc: Alexey Dobriyan Cc: Dan Williams Cc: Michal Hocko Cc: Stephen Rothwell Cc: [4.15+] Signed-off-by: Andrew Morton --- mm/page_alloc.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) --- a/mm/page_alloc.c~mm-fix-uninitialized-memmaps-on-a-partially-populated-last-section +++ a/mm/page_alloc.c @@ -6947,7 +6947,8 @@ static u64 zero_pfn_range(unsigned long * This function also addresses a similar issue where struct pages are left * uninitialized because the physical address range is not covered by * memblock.memory or memblock.reserved. That could happen when memblock - * layout is manually configured via memmap=. + * layout is manually configured via memmap=, or when the highest physical + * address (max_pfn) does not end on a section boundary. */ void __init zero_resv_unavail(void) { @@ -6965,7 +6966,16 @@ void __init zero_resv_unavail(void) pgcnt += zero_pfn_range(PFN_DOWN(next), PFN_UP(start)); next = end; } - pgcnt += zero_pfn_range(PFN_DOWN(next), max_pfn); + + /* + * Early sections always have a fully populated memmap for the whole + * section - see pfn_valid(). If the last section has holes at the + * end and that section is marked "online", the memmap will be + * considered initialized. Make sure that memmap has a well defined + * state. + */ + pgcnt += zero_pfn_range(PFN_DOWN(next), + round_up(max_pfn, PAGES_PER_SECTION)); /* * Struct pages that do not have backing memory. This could be because From patchwork Tue Feb 4 01:36:49 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 232074 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.3 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, URIBL_DBL_ABUSE_MALW autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D8DBC35247 for ; Tue, 4 Feb 2020 01:36:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4B8C12084E for ; Tue, 4 Feb 2020 01:36:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1580780212; bh=rzX3V/YrdbMKLDAbd3+CAoDIJR+VWITCz0UAKL+yYyw=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=Ff4nLBzEgRN9zKZhl3uuDM0sq+zEus1bYN4wSIFoTEe5sCqMxOfQEu91oax8CzVrG CQ+DKn8EjWfoDKYFHiqqw6lQLIiUAv2lu8KChnQtNCki+o0XlY5eOEbAhSTo1AQlgs YWe89bBjStVk1o0XntT8te/E0wHfI1fa7r5pFMmo= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727063AbgBDBgw (ORCPT ); Mon, 3 Feb 2020 20:36:52 -0500 Received: from mail.kernel.org ([198.145.29.99]:37996 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726992AbgBDBgv (ORCPT ); Mon, 3 Feb 2020 20:36:51 -0500 Received: from localhost.localdomain (c-73-231-172-41.hsd1.ca.comcast.net [73.231.172.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 238952086A; Tue, 4 Feb 2020 01:36:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1580780210; bh=rzX3V/YrdbMKLDAbd3+CAoDIJR+VWITCz0UAKL+yYyw=; h=Date:From:To:Subject:In-Reply-To:From; b=e9kvm56x00QJQNxLvwCZZPy6Pj7Uss49En3UuGKr13BwKDOieV7smJcR7v4sv1vOA z3epXlp+haLxCEKezqJnCR9OiUrFGsY367vrZCLkeZqGpv0FZV93uVoiNoeMLM6tY8 69a2usa8ekEzJPYHUZQ6YxSbc3wWYLuHAHHz9woQ= Date: Mon, 03 Feb 2020 17:36:49 -0800 From: Andrew Morton To: akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, linux-mm@kvack.org, mm-commits@vger.kernel.org, mpe@ellerman.id.au, peterz@infradead.org, stable@vger.kernel.org, torvalds@linux-foundation.org Subject: [patch 49/67] mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush Message-ID: <20200204013649.Nwx3yVd8_%akpm@linux-foundation.org> In-Reply-To: <20200203173311.6269a8be06a05e5a4aa08a93@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Peter Zijlstra Subject: mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush Architectures for which we have hardware walkers of Linux page table should flush TLB on mmu gather batch allocation failures and batch flush. Some architectures like POWER supports multiple translation modes (hash and radix) and in the case of POWER only radix translation mode needs the above TLBI. This is because for hash translation mode kernel wants to avoid this extra flush since there are no hardware walkers of linux page table. With radix translation, the hardware also walks linux page table and with that, kernel needs to make sure to TLB invalidate page walk cache before page table pages are freed. More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE") The changes to sparc are to make sure we keep the old behavior since we are now removing HAVE_RCU_TABLE_NO_INVALIDATE. The default value for tlb_needs_table_invalidate is to always force an invalidate and sparc can avoid the table invalidate. Hence we define tlb_needs_table_invalidate to false for sparc architecture. Link: http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.kumar@linux.ibm.com Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes") Signed-off-by: Peter Zijlstra (Intel) Acked-by: Michael Ellerman [powerpc] Cc: [4.14+] Signed-off-by: Andrew Morton --- arch/Kconfig | 3 --- arch/powerpc/Kconfig | 1 - arch/powerpc/include/asm/tlb.h | 11 +++++++++++ arch/sparc/Kconfig | 1 - arch/sparc/include/asm/tlb_64.h | 9 +++++++++ include/asm-generic/tlb.h | 22 +++++++++++++++------- mm/mmu_gather.c | 16 ++++++++-------- 7 files changed, 43 insertions(+), 20 deletions(-) --- a/arch/Kconfig~mm-mmu_gather-invalidate-tlb-correctly-on-batch-allocation-failure-and-flush +++ a/arch/Kconfig @@ -396,9 +396,6 @@ config HAVE_ARCH_JUMP_LABEL_RELATIVE config HAVE_RCU_TABLE_FREE bool -config HAVE_RCU_TABLE_NO_INVALIDATE - bool - config HAVE_MMU_GATHER_PAGE_SIZE bool --- a/arch/powerpc/include/asm/tlb.h~mm-mmu_gather-invalidate-tlb-correctly-on-batch-allocation-failure-and-flush +++ a/arch/powerpc/include/asm/tlb.h @@ -26,6 +26,17 @@ #define tlb_flush tlb_flush extern void tlb_flush(struct mmu_gather *tlb); +/* + * book3s: + * Hash does not use the linux page-tables, so we can avoid + * the TLB invalidate for page-table freeing, Radix otoh does use the + * page-tables and needs the TLBI. + * + * nohash: + * We still do TLB invalidate in the __pte_free_tlb routine before we + * add the page table pages to mmu gather table batch. + */ +#define tlb_needs_table_invalidate() radix_enabled() /* Get the generic bits... */ #include --- a/arch/powerpc/Kconfig~mm-mmu_gather-invalidate-tlb-correctly-on-batch-allocation-failure-and-flush +++ a/arch/powerpc/Kconfig @@ -223,7 +223,6 @@ config PPC select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP select HAVE_RCU_TABLE_FREE - select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE select HAVE_MMU_GATHER_PAGE_SIZE select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_RELIABLE_STACKTRACE if PPC_BOOK3S_64 && CPU_LITTLE_ENDIAN --- a/arch/sparc/include/asm/tlb_64.h~mm-mmu_gather-invalidate-tlb-correctly-on-batch-allocation-failure-and-flush +++ a/arch/sparc/include/asm/tlb_64.h @@ -28,6 +28,15 @@ void flush_tlb_pending(void); #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0) #define tlb_flush(tlb) flush_tlb_pending() +/* + * SPARC64's hardware TLB fill does not use the Linux page-tables + * and therefore we don't need a TLBI when freeing page-table pages. + */ + +#ifdef CONFIG_HAVE_RCU_TABLE_FREE +#define tlb_needs_table_invalidate() (false) +#endif + #include #endif /* _SPARC64_TLB_H */ --- a/arch/sparc/Kconfig~mm-mmu_gather-invalidate-tlb-correctly-on-batch-allocation-failure-and-flush +++ a/arch/sparc/Kconfig @@ -65,7 +65,6 @@ config SPARC64 select HAVE_KRETPROBES select HAVE_KPROBES select HAVE_RCU_TABLE_FREE if SMP - select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE select HAVE_MEMBLOCK_NODE_MAP select HAVE_ARCH_TRANSPARENT_HUGEPAGE select HAVE_DYNAMIC_FTRACE --- a/include/asm-generic/tlb.h~mm-mmu_gather-invalidate-tlb-correctly-on-batch-allocation-failure-and-flush +++ a/include/asm-generic/tlb.h @@ -137,13 +137,6 @@ * When used, an architecture is expected to provide __tlb_remove_table() * which does the actual freeing of these pages. * - * HAVE_RCU_TABLE_NO_INVALIDATE - * - * This makes HAVE_RCU_TABLE_FREE avoid calling tlb_flush_mmu_tlbonly() before - * freeing the page-table pages. This can be avoided if you use - * HAVE_RCU_TABLE_FREE and your architecture does _NOT_ use the Linux - * page-tables natively. - * * MMU_GATHER_NO_RANGE * * Use this if your architecture lacks an efficient flush_tlb_range(). @@ -189,8 +182,23 @@ struct mmu_table_batch { extern void tlb_remove_table(struct mmu_gather *tlb, void *table); +/* + * This allows an architecture that does not use the linux page-tables for + * hardware to skip the TLBI when freeing page tables. + */ +#ifndef tlb_needs_table_invalidate +#define tlb_needs_table_invalidate() (true) #endif +#else + +#ifdef tlb_needs_table_invalidate +#error tlb_needs_table_invalidate() requires HAVE_RCU_TABLE_FREE +#endif + +#endif /* CONFIG_HAVE_RCU_TABLE_FREE */ + + #ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER /* * If we can't allocate a page to make a big batch of page pointers --- a/mm/mmu_gather.c~mm-mmu_gather-invalidate-tlb-correctly-on-batch-allocation-failure-and-flush +++ a/mm/mmu_gather.c @@ -102,14 +102,14 @@ bool __tlb_remove_page_size(struct mmu_g */ static inline void tlb_table_invalidate(struct mmu_gather *tlb) { -#ifndef CONFIG_HAVE_RCU_TABLE_NO_INVALIDATE - /* - * Invalidate page-table caches used by hardware walkers. Then we still - * need to RCU-sched wait while freeing the pages because software - * walkers can still be in-flight. - */ - tlb_flush_mmu_tlbonly(tlb); -#endif + if (tlb_needs_table_invalidate()) { + /* + * Invalidate page-table caches used by hardware walkers. Then + * we still need to RCU-sched wait while freeing the pages + * because software walkers can still be in-flight. + */ + tlb_flush_mmu_tlbonly(tlb); + } } static void tlb_remove_table_smp_sync(void *arg)