[0/4] mm: permit guard regions for file-backed/shmem mappings

Message ID	cover.1739469950.git.lorenzo.stoakes@oracle.com
Headers	show Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 69BF72036FF; Thu, 13 Feb 2025 18:19:11 +0000 (UTC) From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Suren Baghdasaryan <surenb@google.com>, "Liam R . Howlett" <Liam.Howlett@oracle.com>, Matthew Wilcox <willy@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, "Paul E . McKenney" <paulmck@kernel.org>, Jann Horn <jannh@google.com>, David Hildenbrand <david@redhat.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Shuah Khan <shuah@kernel.org>, linux-kselftest@vger.kernel.org, linux-api@vger.kernel.org, John Hubbard <jhubbard@nvidia.com>, Juan Yescas <jyescas@google.com>, Kalesh Singh <kaleshsingh@google.com> Subject: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings Date: Thu, 13 Feb 2025 18:16:59 +0000 Message-ID: <cover.1739469950.git.lorenzo.stoakes@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	mm: permit guard regions for file-backed/shmem mappings \| expand [0/4] mm: permit guard regions for file-backed/shmem mappings [1/4] mm: allow guard regions in file-backed and read-only mappings [2/4] selftests/mm: rename guard-pages to guard-regions [3/4] tools/selftests: expand all guard region tests to file-backed [4/4] tools/selftests: add file/shmem-backed mapping guard region tests

Lorenzo Stoakes Feb. 13, 2025, 6:16 p.m. UTC

The guard regions feature was initially implemented to support anonymous
mappings only, excluding shmem.

This was done such as to introduce the feature carefully and incrementally
and to be conservative when considering the various caveats and corner
cases that are applicable to file-backed mappings but not to anonymous
ones.

Now this feature has landed in 6.13, it is time to revisit this and to
extend this functionality to file-backed and shmem mappings.

In order to make this maximally useful, and since one may map file-backed
mappings read-only (for instance ELF images), we also remove the
restriction on read-only mappings and permit the establishment of guard
regions in any non-hugetlb, non-mlock()'d mapping.

It is permissible to permit the establishment of guard regions in read-only
mappings because the guard regions only reduce access to the mapping, and
when removed simply reinstate the existing attributes of the underlying
VMA, meaning no access violations can occur.

While the change in kernel code introduced in this series is small, the
majority of the effort here is spent in extending the testing to assert
that the feature works correctly across numerous file-backed mapping
scenarios.

Every single guard region self-test performed against anonymous memory
(which is relevant and not anon-only) has now been updated to also be
performed against shmem and a mapping of a file in the working directory.

This confirms that all cases also function correctly for file-backed guard
regions.

In addition a number of other tests are added for specific file-backed
mapping scenarios.

There are a number of other concerns that one might have with regard to
guard regions, addressed below:

Readahead
~~~~~~~~~

Readahead is a process through which the page cache is populated on the
assumption that sequential reads will occur, thus amortising I/O and,
through a clever use of the PG_readahead folio flag establishing during
major fault and checked upon minor fault, provides for asynchronous I/O to
occur as dat is processed, reducing I/O stalls as data is faulted in.

Guard regions do not alter this mechanism which operations at the folio and
fault level, but do of course prevent the faulting of folios that would
otherwise be mapped.

In the instance of a major fault prior to a guard region, synchronous
readahead will occur including populating folios in the page cache which
the guard regions will, in the case of the mapping in question, prevent
access to.

In addition, if PG_readahead is placed in a folio that is now inaccessible,
this will prevent asynchronous readahead from occurring as it would
otherwise do.

However, there are mechanisms for heuristically resetting this within
readahead regardless, which will 'recover' correct readahead behaviour.

Readahead presumes sequential data access, the presence of a guard region
clearly indicates that, at least in the guard region, no such sequential
access will occur, as it cannot occur there.

So this should have very little impact on any real workload. The far more
important point is as to whether readahead causes incorrect or
inappropriate mapping of ranges disallowed by the presence of guard
regions - this is not the case, as readahead does not 'pre-fault' memory in
this fashion.

At any rate, any mechanism which would attempt to do so would hit the usual
page fault paths, which correctly handle PTE markers as with anonymous
mappings.

Fault-Around
~~~~~~~~~~~~

The fault-around logic, in a similar vein to readahead, attempts to improve
efficiency with regard to file-backed memory mappings, however it differs
in that it does not try to fetch folios into the page cache that are about
to be accessed, but rather pre-maps a range of folios around the faulting
address.

Guard regions making use of PTE markers makes this relatively trivial, as
this case is already handled - see filemap_map_folio_range() and
filemap_map_order0_folio() - in both instances, the solution is to simply
keep the established page table mappings and let the fault handler take
care of PTE markers, as per the comment:

	/*
	 * NOTE: If there're PTE markers, we'll leave them to be
	 * handled in the specific fault path, and it'll prohibit
	 * the fault-around logic.
	 */

This works, as establishing guard regions results in page table mappings
with PTE markers, and clearing them removes them.

Truncation
~~~~~~~~~~

File truncation will not eliminate existing guard regions, as the
truncation operation will ultimately zap the range via
unmap_mapping_range(), which specifically excludes PTE markers.

Zapping
~~~~~~~

Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
which specifically deals with guard entries, leaving them intact except in
instances such as process teardown or munmap() where they need to be
removed.

Reclaim
~~~~~~~

When reclaim is performed on file-backed folios, it ultimately invokes
try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
will ultimately abort the operation for the guard region mapping. If large,
then check_pte() will determine that this is a non-device private
entry/device-exclusive entry 'swap' PTE and thus abort the operation in
that instance.

Therefore, no odd things happen in the instance of reclaim being attempted
upon a file-backed guard region.

Hole Punching
~~~~~~~~~~~~~

This updates the page cache and ultimately invokes unmap_mapping_range(),
which explicitly leaves PTE markers in place.

Because the establishment of guard regions zapped any existing mappings to
file-backed folios, once the guard regions are removed then the
hole-punched region will be faulted in as usual and everything will behave
as expected.

Lorenzo Stoakes (4):
  mm: allow guard regions in file-backed and read-only mappings
  selftests/mm: rename guard-pages to guard-regions
  tools/selftests: expand all guard region tests to file-backed
  tools/selftests: add file/shmem-backed mapping guard region tests

 mm/madvise.c                                  |   8 +-
 tools/testing/selftests/mm/.gitignore         |   2 +-
 tools/testing/selftests/mm/Makefile           |   2 +-
 .../mm/{guard-pages.c => guard-regions.c}     | 921 ++++++++++++++++--
 4 files changed, 821 insertions(+), 112 deletions(-)
 rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)

--
2.48.1

David Hildenbrand Feb. 18, 2025, 3:20 p.m. UTC | #1

> Right yeah that'd be super weird. And I don't want to add that logic.
> 
>> Also not sure what happens if one does an mlock()/mlockall() after
>> already installing PTE markers.
> 
> The existing logic already handles non-present cases by skipping them, in
> mlock_pte_range():
> 
> 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
> 		ptent = ptep_get(pte);
> 		if (!pte_present(ptent))
> 			continue;
> 
> 		...
> 	}

I *think* that code only updates already-mapped folios, to properly call 
mlock_folio()/munlock_folio().

It is not the code that populates pages on mlock()/mlockall(). I think 
all that goes via mm_populate()/__mm_populate(), where "ordinary GUP" 
should apply.

See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.

> 
> Which covers off guard regions. Removing the guard regions after this will
> leave you in a weird situation where these entries will be zapped... maybe
> we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
> also populate?

Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.

> 
> Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE
> if you since locked the range? The code currently allows this on the
> proviso that 'you aren't zapping locked mappings' but leaves the VMA in a
> state such that some entries would not be locked.
> 
> It'd be pretty weird to lock guard regions like this.
> 
> Having said all that, given what you say below, maybe it's not an issue
> after all?...
> 
>>
>> __mm_populate() would skip whole VMAs in case populate_vma_page_range()
>> fails. And I would assume populate_vma_page_range() fails on the first
>> guard when it triggers a page fault.
>>
>> OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where
>> MADV_DONTNEED_LOCKED originates from:
>>
>> commit 9457056ac426e5ed0671356509c8dcce69f8dee0
>> Author: Johannes Weiner <hannes@cmpxchg.org>
>> Date:   Thu Mar 24 18:14:12 2022 -0700
>>
>>      mm: madvise: MADV_DONTNEED_LOCKED
>>      MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
>>      and MCL_ONFAULT allowing to mlock without populating, there are valid use
>>      cases for depopulating locked ranges as well.
> 
> ...Hm this seems to imply the current guard remove stuff isn't quite so
> bad, so maybe the assumption that VM_LOCKED implies 'everything is
> populated' isn't quite as stringent then.

Right, with MCL_ONFAULT at least. Without MCL_ONFAULT, the assumption is 
that everything is populated (unless, apparently one uses 
MADV_DONTNEED_LOCKED or population failed, maybe).

VM_LOCKONFAULT seems to be the sane case. I wonder why 
MADV_DONTNEED_LOCKED didn't explicitly check for that one ... maybe 
there is a history to that.

> 
> The restriction is as simple as:
> 
> 		if (behavior != MADV_DONTNEED_LOCKED)
> 			forbidden |= VM_LOCKED;
> 
>>
>>
>> Adding support for that would be indeed nice.
> 
> I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED
> locked ranges, but I really don't understand why you'd want to add guard
> regions to mlock()'ed regions?

Somme apps use mlockall(), and it might be nice to just be able to use 
guard pages as if "Nothing happened".

E.g., QEMU has the option to use mlockall().

> 
> Then again we're currently asymmetric as you can add them _before_
> mlock()'ing...

Right.

David Hildenbrand Feb. 18, 2025, 5:14 p.m. UTC | #2

On 18.02.25 17:43, Lorenzo Stoakes wrote:
> On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
>>> Right yeah that'd be super weird. And I don't want to add that logic.
>>>
>>>> Also not sure what happens if one does an mlock()/mlockall() after
>>>> already installing PTE markers.
>>>
>>> The existing logic already handles non-present cases by skipping them, in
>>> mlock_pte_range():
>>>
>>> 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
>>> 		ptent = ptep_get(pte);
>>> 		if (!pte_present(ptent))
>>> 			continue;
>>>
>>> 		...
>>> 	}
>>
>> I *think* that code only updates already-mapped folios, to properly call
>> mlock_folio()/munlock_folio().
> 
> Guard regions _are_ 'already mapped' :) so it leaves them in place.

"mapped folios" -- there is no folio mapped. Yes, the VMA is in place.

> 
> do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range()
> implies this will be invoked.

Yes, to process any already mapped folios, to then continue population 
later.

> 
>>
>> It is not the code that populates pages on mlock()/mlockall(). I think all
>> that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should
>> apply.
> 
> OK I want to correct what I said earlier.
> 
> Installing a guard region then attempting mlock() will result in an error. The
> populate will -EFAULT and stop at the guard region, which causes mlock() to
> error out.

Right, that's my expectation.

> 
> This is a partial failure, so the VMA is split and has VM_LOCKED applied, but
> the populate halts at the guard region.
> 
> This is ok as per previous discussion on aggregate operation failure, there can
> be no expectation of 'unwinding' of partially successful operations that form
> part of a requested aggregate one.
> 
> However, given there's stuff to clean up, and on error a user _may_ wish to then
> remove guard regions and try again, I guess there's no harm in keeping the code
> as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.

Likely yes; it's all weird code.

> 
>>
>> See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
> 
> Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly'
> yeah wrongly, very wrongly...
> 
>>
>>>
>>> Which covers off guard regions. Removing the guard regions after this will
>>> leave you in a weird situation where these entries will be zapped... maybe
>>> we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
>>> also populate?
>>
>> Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
> 
> See above, no we should not :P this is only good for cleanup after mlock()
> failure, although no sane program should really be trying to do this, a sane
> program would give up here (and it's a _programmatic error_ to try to mlock() a
> range with guard regions).
 >>>> Somme apps use mlockall(), and it might be nice to just be able to 
use guard
>> pages as if "Nothing happened".
> 
> Sadly I think not given above :P

QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); 
when requested to then exit(); if it fails.

Assume glibc or any lib uses it, QEMU would have no real way of figuring 
that out or instructing offending libraries to disabled that, at least 
for now  ...

... turning RT VMs less usable if any library uses guard regions. :(

There is upcoming support for MCL_ONFAULT in QEMU [1] (see below).

[1] https://lkml.kernel.org/r/20250212173823.214429-3-peterx@redhat.com

> 
>>
>> E.g., QEMU has the option to use mlockall().
>>
>>>
>>> Then again we're currently asymmetric as you can add them _before_
>>> mlock()'ing...
>>
>> Right.
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> I think the _LOCKED idea is therefore kaput, because it just won't work
> properly because populating guard regions fails.

Right, I think basic VM_LOCKED is out of the picture. VM_LOCKONFAULT 
might be interesting, because we are skipping the population stage.

> 
> It fails because it tries to 'touch' the memory, but 'touching' guard
> region memory causes a segfault. This kind of breaks the idea of
> mlock()'ing guard regions.
> 
> I think adding workarounds to make this possible in any way is not really
> worth it (and would probably be pretty gross).
> 
> We already document that 'mlock()ing lightweight guard regions will fail'
> as per man page so this is all in line with that.

Right, and I claim that supporting VM_LOCKONFAULT might likely be as 
easy as allowing install/remove of guard regions when that flag is set.

Lorenzo Stoakes Feb. 18, 2025, 5:20 p.m. UTC | #3

On Tue, Feb 18, 2025 at 06:14:00PM +0100, David Hildenbrand wrote:
> On 18.02.25 17:43, Lorenzo Stoakes wrote:
> > On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
> > > > Right yeah that'd be super weird. And I don't want to add that logic.
> > > >
> > > > > Also not sure what happens if one does an mlock()/mlockall() after
> > > > > already installing PTE markers.
> > > >
> > > > The existing logic already handles non-present cases by skipping them, in
> > > > mlock_pte_range():
> > > >
> > > > 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
> > > > 		ptent = ptep_get(pte);
> > > > 		if (!pte_present(ptent))
> > > > 			continue;
> > > >
> > > > 		...
> > > > 	}
> > >
> > > I *think* that code only updates already-mapped folios, to properly call
> > > mlock_folio()/munlock_folio().
> >
> > Guard regions _are_ 'already mapped' :) so it leaves them in place.
>
> "mapped folios" -- there is no folio mapped. Yes, the VMA is in place.

We're engaging in a moot discussion on this I think but I mean it appears
to operate by walking page tables if they are populated, which they will be
for guard regions, but when it finds it's non-present it will skip.

This amounts to the same thing as not doing anything, obviously.

>
> >
> > do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range()
> > implies this will be invoked.
>
> Yes, to process any already mapped folios, to then continue population
> later.
>
> >
> > >
> > > It is not the code that populates pages on mlock()/mlockall(). I think all
> > > that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should
> > > apply.
> >
> > OK I want to correct what I said earlier.
> >
> > Installing a guard region then attempting mlock() will result in an error. The
> > populate will -EFAULT and stop at the guard region, which causes mlock() to
> > error out.
>
> Right, that's my expectation.

OK good!

>
> >
> > This is a partial failure, so the VMA is split and has VM_LOCKED applied, but
> > the populate halts at the guard region.
> >
> > This is ok as per previous discussion on aggregate operation failure, there can
> > be no expectation of 'unwinding' of partially successful operations that form
> > part of a requested aggregate one.
> >
> > However, given there's stuff to clean up, and on error a user _may_ wish to then
> > remove guard regions and try again, I guess there's no harm in keeping the code
> > as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.
>
> Likely yes; it's all weird code.
>
> >
> > >
> > > See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
> >
> > Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly'
> > yeah wrongly, very wrongly...
> >
> > >
> > > >
> > > > Which covers off guard regions. Removing the guard regions after this will
> > > > leave you in a weird situation where these entries will be zapped... maybe
> > > > we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
> > > > also populate?
> > >
> > > Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
> >
> > See above, no we should not :P this is only good for cleanup after mlock()
> > failure, although no sane program should really be trying to do this, a sane
> > program would give up here (and it's a _programmatic error_ to try to mlock() a
> > range with guard regions).
> >>>> Somme apps use mlockall(), and it might be nice to just be able to use
> guard
> > > pages as if "Nothing happened".
> >
> > Sadly I think not given above :P
>
> QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when
> requested to then exit(); if it fails.

Hm under what circumstances? I use qemu extensively to test this stuff with
no issues. Unless you mean it's using it in the 'host' code somehow.

>
> Assume glibc or any lib uses it, QEMU would have no real way of figuring
> that out or instructing offending libraries to disabled that, at least for
> now  ...
>
> ... turning RT VMs less usable if any library uses guard regions. :(
>

This seems really stupid, to be honest. Unfortunately there's no way around
this, if software does stupid things then they get stupid prizes. There are
other ways mlock() and faulting in can fail too.

> There is upcoming support for MCL_ONFAULT in QEMU [1] (see below).

Good.

>
> [1] https://lkml.kernel.org/r/20250212173823.214429-3-peterx@redhat.com
>
> >
> > >
> > > E.g., QEMU has the option to use mlockall().
> > >
> > > >
> > > > Then again we're currently asymmetric as you can add them _before_
> > > > mlock()'ing...
> > >
> > > Right.
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
> >
> > I think the _LOCKED idea is therefore kaput, because it just won't work
> > properly because populating guard regions fails.
>
> Right, I think basic VM_LOCKED is out of the picture. VM_LOCKONFAULT might
> be interesting, because we are skipping the population stage.
>
> >
> > It fails because it tries to 'touch' the memory, but 'touching' guard
> > region memory causes a segfault. This kind of breaks the idea of
> > mlock()'ing guard regions.
> >
> > I think adding workarounds to make this possible in any way is not really
> > worth it (and would probably be pretty gross).
> >
> > We already document that 'mlock()ing lightweight guard regions will fail'
> > as per man page so this is all in line with that.
>
> Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as
> allowing install/remove of guard regions when that flag is set.

We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we
disallow.

>
> --
> Cheers,
>
> David / dhildenb
>

David Hildenbrand Feb. 18, 2025, 5:25 p.m. UTC | #4

>>
>> QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when
>> requested to then exit(); if it fails.
> 
> Hm under what circumstances? I use qemu extensively to test this stuff with
> no issues. Unless you mean it's using it in the 'host' code somehow.


-overcommit mem-lock=on

or (legacy)

-realtime mlock=on

I think.

[...]

>>>
>>> It fails because it tries to 'touch' the memory, but 'touching' guard
>>> region memory causes a segfault. This kind of breaks the idea of
>>> mlock()'ing guard regions.
>>>
>>> I think adding workarounds to make this possible in any way is not really
>>> worth it (and would probably be pretty gross).
>>>
>>> We already document that 'mlock()ing lightweight guard regions will fail'
>>> as per man page so this is all in line with that.
>>
>> Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as
>> allowing install/remove of guard regions when that flag is set.
> 
> We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we
> disallow.


See mlock2();

SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
{
	vm_flags_t vm_flags = VM_LOCKED;

	if (flags & ~MLOCK_ONFAULT)
		return -EINVAL;

	if (flags & MLOCK_ONFAULT)
		vm_flags |= VM_LOCKONFAULT;

	return do_mlock(start, len, vm_flags);
}


VM_LOCKONFAULT always as VM_LOCKED set as well.

David Hildenbrand Feb. 18, 2025, 5:31 p.m. UTC | #5

>> See mlock2();
>>
>> SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
>> {
>> 	vm_flags_t vm_flags = VM_LOCKED;
>>
>> 	if (flags & ~MLOCK_ONFAULT)
>> 		return -EINVAL;
>>
>> 	if (flags & MLOCK_ONFAULT)
>> 		vm_flags |= VM_LOCKONFAULT;
>>
>> 	return do_mlock(start, len, vm_flags);
>> }
>>
>>
>> VM_LOCKONFAULT always as VM_LOCKED set as well.
> 
> OK cool, that makes sense.
> 
> As with much kernel stuff, I knew this in the past. Then I forgot. Then I knew
> again, then... :P if only somebody would write it down in a book...

And if there would be such a book in the makings, I would consider 
hurrying up and preordering it ;)

[0/4] mm: permit guard regions for file-backed/shmem mappings

Message

Comments