mbox series

[v4,0/5] mm: Fix pfn_to_online_page() with respect to ZONE_DEVICE

Message ID 161058499000.1840162.702316708443239771.stgit@dwillia2-desk3.amr.corp.intel.com
Headers show
Series mm: Fix pfn_to_online_page() with respect to ZONE_DEVICE | expand

Message

Dan Williams Jan. 14, 2021, 12:43 a.m. UTC
Changes since v3 [1]:
- Switch to "if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) &&
  !pfn_valid(pfn))" (David)
- Finish collecting reviewed-bys across all patches in the series
- Drop the libnvdimm fixup, to be merged through nvdimm.git not -mm

[1]: http://lore.kernel.org/r/161052331545.1805594.2356512831689786960.stgit@dwillia2-desk3.amr.corp.intel.com

---

Andrew,

All patches in this series have been reviewed and the kbuild-robot
reports a build-success over 172 configs. They pass an updated version
of the nvdimm unit tests to exercise corner cases of
pfn_to_online_page() and get_dev_pagemap() [2], and apply cleanly to
current -next.

Please apply, thanks.

[2]: http://lore.kernel.org/r/161052209289.1804207.11599120961607513911.stgit@dwillia2-desk3.amr.corp.intel.com

---

Michal reminds that the discussion about how to ensure pfn-walkers do
not get confused by ZONE_DEVICE pages never resolved. A pfn-walker that
uses pfn_to_online_page() may inadvertently translate a pfn as online
and in the page allocator, when it is offline managed by a ZONE_DEVICE
mapping (details in Patch 3: ("mm: Teach pfn_to_online_page() about
ZONE_DEVICE section collisions")).

The 2 proposals under consideration are teach pfn_to_online_page() to be
precise in the presence of mixed-zone sections, or teach the memory-add
code to drop the System RAM associated with ZONE_DEVICE collisions. In
order to not regress memory capacity by a few 10s to 100s of MiB the
approach taken in this set is to add precision to pfn_to_online_page().

In the course of validating pfn_to_online_page() a couple other fixes
fell out:

1/ soft_offline_page() fails to drop the reference taken in the
   madvise(..., MADV_SOFT_OFFLINE) case.

2/ memory_failure() uses get_dev_pagemap() to lookup ZONE_DEVICE pages,
   however that mapping may contain data pages and metadata raw pfns.
   Introduce pgmap_pfn_valid() to delineate the 2 types and fail the
   handling of raw metadata pfns.

---

Dan Williams (5):
      mm: Move pfn_to_online_page() out of line
      mm: Teach pfn_to_online_page() to consider subsection validity
      mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions
      mm: Fix page reference leak in soft_offline_page()
      mm: Fix memory_failure() handling of dax-namespace metadata


 include/linux/memory_hotplug.h |   17 +---------
 include/linux/memremap.h       |    6 +++
 include/linux/mmzone.h         |   22 +++++++++----
 mm/memory-failure.c            |   26 +++++++++++++--
 mm/memory_hotplug.c            |   69 ++++++++++++++++++++++++++++++++++++++++
 mm/memremap.c                  |   15 +++++++++
 6 files changed, 128 insertions(+), 27 deletions(-)

Comments

Oscar Salvador Jan. 14, 2021, 6:30 a.m. UTC | #1
On 2021-01-14 07:18, Dan Williams wrote:
> To be honest I dislike the entire flags based scheme for communicating
> the fact that page reference obtained by madvise needs to be dropped
> later. I'd rather pass a non-NULL 'struct page *' than redo
> pfn_to_page() conversions in the leaf functions, but that's a much
> larger change.

We tried to remove that flag in the past but for different reasons.
I will have another look to see what can be done.

Thanks
HORIGUCHI NAOYA(堀口 直也) Jan. 14, 2021, 7:10 a.m. UTC | #2
On Wed, Jan 13, 2021 at 10:18:09PM -0800, Dan Williams wrote:
> On Wed, Jan 13, 2021 at 5:50 PM HORIGUCHI NAOYA(堀口 直也)
> <naoya.horiguchi@nec.com> wrote:
> >
> > On Wed, Jan 13, 2021 at 04:43:32PM -0800, Dan Williams wrote:
> > > The conversion to move pfn_to_online_page() internal to
> > > soft_offline_page() missed that the get_user_pages() reference taken by
> > > the madvise() path needs to be dropped when pfn_to_online_page() fails.
> > > Note the direct sysfs-path to soft_offline_page() does not perform a
> > > get_user_pages() lookup.
> > >
> > > When soft_offline_page() is handed a pfn_valid() &&
> > > !pfn_to_online_page() pfn the kernel hangs at dax-device shutdown due to
> > > a leaked reference.
> > >
> > > Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn")
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
> > > Cc: Michal Hocko <mhocko@kernel.org>
> > > Reviewed-by: David Hildenbrand <david@redhat.com>
> > > Reviewed-by: Oscar Salvador <osalvador@suse.de>
> > > Cc: <stable@vger.kernel.org>
> > > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >
> > I'm OK if we don't have any other better approach, but the proposed changes
> > make code a little messy, and I feel that get_user_pages() might be the
> > right place to fix. Is get_user_pages() expected to return struct page with
> > holding refcount for offline valid pages?  I thought that such pages are
> > only used by drivers for dax-devices, but that might be wrong. Can I ask for
> > a little more explanation from this perspective?
> 
> The motivation for ZONE_DEVICE is to allow get_user_pages() for "offline" pfns.

I missed this point, thank you.

> 
> soft_offline_page() wants to only operate on "online" pfns.

Right.

> 
> get_user_pages() on a dax-mapping returns an "offline" ZONE_DEVICE page.

OK.

> 
> When pfn_to_online_page() fails the get_user_pages() reference needs
> to be dropped.

The background is clear to me, and I agree with this patch.

Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

> 
> To be honest I dislike the entire flags based scheme for communicating
> the fact that page reference obtained by madvise needs to be dropped
> later. I'd rather pass a non-NULL 'struct page *' than redo
> pfn_to_page() conversions in the leaf functions, but that's a much
> larger change.

As Oscar mentions in another email, removing MF_COUNT_INCREASED was
recently discussed and rejected. I think that if pfn-page conversion
layer is somehow factored out from soft/hard offline handler, code would
be more readable/maintainable.

Thanks,
Naoya Horiguchi