[5.4,34/85] mm/khugepaged: fix filemap page_to_pgoff(page) != offset

From: Hugh Dickins <hughd@google.com>

From: Hugh Dickins <hughd@google.com>

commit 033b5d77551167f8c24ca862ce83d3e0745f9245 upstream.

There have been elusive reports of filemap_fault() hitting its
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built
with CONFIG_READ_ONLY_THP_FOR_FS=y.

Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and
CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged
without NUMA reuses the same huge page after collapse_file() failed
(whereas NUMA targets its allocation to the respective node each time).
And most of us were usually testing with CONFIG_NUMA=y kernels.

collapse_file(old start)
  new_page = khugepaged_alloc_page(hpage)
  __SetPageLocked(new_page)
  new_page->index = start // hpage->index=old offset
  new_page->mapping = mapping
  xas_store(&xas, new_page)

                          filemap_fault
                            page = find_get_page(mapping, offset)
                            // if offset falls inside hpage then
                            // compound_head(page) == hpage
                            lock_page_maybe_drop_mmap()
                              __lock_page(page)

  // collapse fails
  xas_store(&xas, old page)
  new_page->mapping = NULL
  unlock_page(new_page)

collapse_file(new start)
  new_page = khugepaged_alloc_page(hpage)
  __SetPageLocked(new_page)
  new_page->index = start // hpage->index=new offset
  new_page->mapping = mapping // mapping becomes valid again

                            // since compound_head(page) == hpage
                            // page_to_pgoff(page) got changed
                            VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)

An initial patch replaced __SetPageLocked() by lock_page(), which did
fix the race which Suren illustrates above.  But testing showed that it's
not good enough: if the racing task's __lock_page() gets delayed long
after its find_get_page(), then it may follow collapse_file(new start)'s
successful final unlock_page(), and crash on the same VM_BUG_ON_PAGE.

It could be fixed by relaxing filemap_fault()'s VM_BUG_ON_PAGE to a
check and retry (as is done for mapping), with similar relaxations in
find_lock_entry() and pagecache_get_page(): but it's not obvious what
else might get caught out; and khugepaged non-NUMA appears to be unique
in exposing a page to page cache, then revoking, without going through
a full cycle of freeing before reuse.

Instead, non-NUMA khugepaged_prealloc_page() release the old page
if anyone else has a reference to it (1% of cases when I tested).

Although never reported on huge tmpfs, I believe its find_lock_entry()
has been at similar risk; but huge tmpfs does not rely on khugepaged
for its normal working nearly so much as READ_ONLY_THP_FOR_FS does.

Reported-by: Denis Lisov <dennis.lissov@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206569
Link: https://lore.kernel.org/linux-mm/?q=20200219144635.3b7417145de19b65f258c943%40linux-foundation.org
Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-xfs/?q=20200616013309.GB815%40lca.pw
Reported-and-analyzed-by: Suren Baghdasaryan <surenb@google.com>
Fixes: 87c460a0bded ("mm/khugepaged: collapse_shmem() without freezing new_page")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v4.9+
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 mm/khugepaged.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

Message ID	20201012132634.499402561@linuxfoundation.org
State	New
Headers	show Return-Path: <SRS0=uWP8=DT=vger.kernel.org=stable-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C4F49C433E7 for <stable@archiver.kernel.org>; Mon, 12 Oct 2020 13:42:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 979B4221EB for <stable@archiver.kernel.org>; Mon, 12 Oct 2020 13:42:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1602510146; bh=S+YFtUA5XDumNX/F9yPoBb0FU8aoU0vSWe/PnjTLB2Y=; h=From:To:Cc:Subject:Date:In-Reply-To:References:List-ID:From; b=Qw3ComUAyS/8QmazQj6c2+aknPZQJbY7Hnk75wvsStFEvRhifG4zv0U3VVpAOxWxf +a27bAVUYXRTcR7fa5qoGmY88NFbbDEAvXK5S4kd95Mir+oWniYS1nqojae+BiP5Nl 8r+iuxipafWHH22Idxma/hT9pgCzuvJ4fWJ5ExEE= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388055AbgJLNmZ (ORCPT <rfc822;stable@archiver.kernel.org>); Mon, 12 Oct 2020 09:42:25 -0400 Received: from mail.kernel.org ([198.145.29.99]:46214 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731077AbgJLNlh (ORCPT <rfc822;stable@vger.kernel.org>); Mon, 12 Oct 2020 09:41:37 -0400 Received: from localhost (83-86-74-64.cable.dynamic.v4.ziggo.nl [83.86.74.64]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 81DDF221FC; Mon, 12 Oct 2020 13:41:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1602510087; bh=S+YFtUA5XDumNX/F9yPoBb0FU8aoU0vSWe/PnjTLB2Y=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=axl+PVe0dAuJa1xMc5/mUkzcK85QXTz09UbuxNzErx2nvnEu9YYTFWKtrdxoOYdPy swzrDb9T0palA014EZG5MAsWdBu9QjNXPV0jo/qAFtE6czONY6VhQb4MSTMhxGJJ/E trsXULNK5UzaLHsw7V+kVHGumqjDoKm2xIXLEyTM= From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, stable@vger.kernel.org, Denis Lisov <dennis.lissov@gmail.com>, Qian Cai <cai@lca.pw>, Hugh Dickins <hughd@google.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Linus Torvalds <torvalds@linux-foundation.org>, Suren Baghdasaryan <surenb@google.com> Subject: [PATCH 5.4 34/85] mm/khugepaged: fix filemap page_to_pgoff(page) != offset Date: Mon, 12 Oct 2020 15:26:57 +0200 Message-Id: <20201012132634.499402561@linuxfoundation.org> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20201012132632.846779148@linuxfoundation.org> References: <20201012132632.846779148@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <stable.vger.kernel.org> X-Mailing-List: stable@vger.kernel.org
Series	None \| expand [5.4,03/85] fbcon: Fix global-out-of-bounds read in fbcon_get_font() [5.4,05/85] io_uring: Fix resource leaking when kill the process [5.4,08/85] io_uring: Fix double list add in io_queue_async_work() [5.4,09/85] net: wireless: nl80211: fix out-of-bounds access in nl80211_del_key() [5.4,14/85] Platform: OLPC: Fix memleak in olpc_ec_probe [5.4,17/85] bpf: Fix sysfs export of empty BTF section [5.4,21/85] driver core: Fix probe_count imbalance in really_probe() [5.4,22/85] perf test session topology: Fix data path [5.4,23/85] perf top: Fix stdio interface input handling with glibc 2.28+ [5.4,25/85] arm64: dts: stratix10: add status to qspi dts node [5.4,26/85] Btrfs: send, allow clone operations within the same file [5.4,28/85] btrfs: volumes: Use more straightforward way to calculate map length [5.4,31/85] btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation [5.4,32/85] nvme-core: put ctrl ref when module ref get fail [5.4,34/85] mm/khugepaged: fix filemap page_to_pgoff(page) != offset [5.4,35/85] net: introduce helper sendpage_ok() in include/linux/net.h [5.4,37/85] nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage() [5.4,38/85] xfrmi: drop ignore_df check before updating pmtu [5.4,39/85] cifs: Fix incomplete memory allocation on setxattr path [5.4,40/85] i2c: meson: fix clock setting overwrite [5.4,43/85] sctp: fix sctp_auth_init_hmacs() error path [5.4,45/85] net: team: fix memory leak in __team_options_register [5.4,46/85] openvswitch: handle DNAT tuple collision [5.4,49/85] xfrm: clone XFRMA_SET_MARK in xfrm_do_migrate [5.4,50/85] xfrm: clone XFRMA_REPLAY_ESN_VAL in xfrm_do_migrate [5.4,51/85] xfrm: clone XFRMA_SEC_CTX in xfrm_do_migrate [5.4,54/85] platform/x86: fix kconfig dependency warning for FUJITSU_LAPTOP [5.4,56/85] iavf: use generic power management [5.4,57/85] iavf: Fix incorrect adapter get in iavf_resume [5.4,58/85] net: ethernet: cavium: octeon_mgmt: use phy_start and phy_stop [5.4,60/85] mdio: fix mdio-thunder.c dependency & build error [5.4,63/85] net: usb: ax88179_178a: fix missing stop entry in driver_info [5.4,64/85] virtio-net: dont disable guest csum when disable LRO [5.4,65/85] net/mlx5: Avoid possible free of command entry while timeout comp handler [5.4,70/85] rxrpc: Fix rxkad token xdr encoding [5.4,73/85] rxrpc: The server keyring isnt network-namespaced [5.4,75/85] perf: Fix task_function_call() error handling [5.4,76/85] mmc: core: dont set limits.discard_granularity as 0 [5.4,79/85] net/core: check length before updating Ethertype in skb_mpls_{push,pop} [5.4,80/85] net/tls: race causes kernel panic [5.4,82/85] Input: ati_remote2 - add missing newlines when printing module parameters [5.4,85/85] net_sched: commit action insertions together

[5.4,34/85] mm/khugepaged: fix filemap page_to_pgoff(page) != offset

Commit Message

Patch