[0/4] mm: permit guard regions for file-backed/shmem mappings

Message ID	cover.1739469950.git.lorenzo.stoakes@oracle.com
Headers	show Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 69BF72036FF; Thu, 13 Feb 2025 18:19:11 +0000 (UTC) From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Suren Baghdasaryan <surenb@google.com>, "Liam R . Howlett" <Liam.Howlett@oracle.com>, Matthew Wilcox <willy@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, "Paul E . McKenney" <paulmck@kernel.org>, Jann Horn <jannh@google.com>, David Hildenbrand <david@redhat.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Shuah Khan <shuah@kernel.org>, linux-kselftest@vger.kernel.org, linux-api@vger.kernel.org, John Hubbard <jhubbard@nvidia.com>, Juan Yescas <jyescas@google.com>, Kalesh Singh <kaleshsingh@google.com> Subject: [PATCH 0/4] mm: permit guard regions for file-backed/shmem mappings Date: Thu, 13 Feb 2025 18:16:59 +0000 Message-ID: <cover.1739469950.git.lorenzo.stoakes@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	mm: permit guard regions for file-backed/shmem mappings \| expand [0/4] mm: permit guard regions for file-backed/shmem mappings [1/4] mm: allow guard regions in file-backed and read-only mappings [2/4] selftests/mm: rename guard-pages to guard-regions [3/4] tools/selftests: expand all guard region tests to file-backed [4/4] tools/selftests: add file/shmem-backed mapping guard region tests

Lorenzo Stoakes Feb. 13, 2025, 6:16 p.m. UTC

The guard regions feature was initially implemented to support anonymous
mappings only, excluding shmem.

This was done such as to introduce the feature carefully and incrementally
and to be conservative when considering the various caveats and corner
cases that are applicable to file-backed mappings but not to anonymous
ones.

Now this feature has landed in 6.13, it is time to revisit this and to
extend this functionality to file-backed and shmem mappings.

In order to make this maximally useful, and since one may map file-backed
mappings read-only (for instance ELF images), we also remove the
restriction on read-only mappings and permit the establishment of guard
regions in any non-hugetlb, non-mlock()'d mapping.

It is permissible to permit the establishment of guard regions in read-only
mappings because the guard regions only reduce access to the mapping, and
when removed simply reinstate the existing attributes of the underlying
VMA, meaning no access violations can occur.

While the change in kernel code introduced in this series is small, the
majority of the effort here is spent in extending the testing to assert
that the feature works correctly across numerous file-backed mapping
scenarios.

Every single guard region self-test performed against anonymous memory
(which is relevant and not anon-only) has now been updated to also be
performed against shmem and a mapping of a file in the working directory.

This confirms that all cases also function correctly for file-backed guard
regions.

In addition a number of other tests are added for specific file-backed
mapping scenarios.

There are a number of other concerns that one might have with regard to
guard regions, addressed below:

Readahead
~~~~~~~~~

Readahead is a process through which the page cache is populated on the
assumption that sequential reads will occur, thus amortising I/O and,
through a clever use of the PG_readahead folio flag establishing during
major fault and checked upon minor fault, provides for asynchronous I/O to
occur as dat is processed, reducing I/O stalls as data is faulted in.

Guard regions do not alter this mechanism which operations at the folio and
fault level, but do of course prevent the faulting of folios that would
otherwise be mapped.

In the instance of a major fault prior to a guard region, synchronous
readahead will occur including populating folios in the page cache which
the guard regions will, in the case of the mapping in question, prevent
access to.

In addition, if PG_readahead is placed in a folio that is now inaccessible,
this will prevent asynchronous readahead from occurring as it would
otherwise do.

However, there are mechanisms for heuristically resetting this within
readahead regardless, which will 'recover' correct readahead behaviour.

Readahead presumes sequential data access, the presence of a guard region
clearly indicates that, at least in the guard region, no such sequential
access will occur, as it cannot occur there.

So this should have very little impact on any real workload. The far more
important point is as to whether readahead causes incorrect or
inappropriate mapping of ranges disallowed by the presence of guard
regions - this is not the case, as readahead does not 'pre-fault' memory in
this fashion.

At any rate, any mechanism which would attempt to do so would hit the usual
page fault paths, which correctly handle PTE markers as with anonymous
mappings.

Fault-Around
~~~~~~~~~~~~

The fault-around logic, in a similar vein to readahead, attempts to improve
efficiency with regard to file-backed memory mappings, however it differs
in that it does not try to fetch folios into the page cache that are about
to be accessed, but rather pre-maps a range of folios around the faulting
address.

Guard regions making use of PTE markers makes this relatively trivial, as
this case is already handled - see filemap_map_folio_range() and
filemap_map_order0_folio() - in both instances, the solution is to simply
keep the established page table mappings and let the fault handler take
care of PTE markers, as per the comment:

	/*
	 * NOTE: If there're PTE markers, we'll leave them to be
	 * handled in the specific fault path, and it'll prohibit
	 * the fault-around logic.
	 */

This works, as establishing guard regions results in page table mappings
with PTE markers, and clearing them removes them.

Truncation
~~~~~~~~~~

File truncation will not eliminate existing guard regions, as the
truncation operation will ultimately zap the range via
unmap_mapping_range(), which specifically excludes PTE markers.

Zapping
~~~~~~~

Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
which specifically deals with guard entries, leaving them intact except in
instances such as process teardown or munmap() where they need to be
removed.

Reclaim
~~~~~~~

When reclaim is performed on file-backed folios, it ultimately invokes
try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
will ultimately abort the operation for the guard region mapping. If large,
then check_pte() will determine that this is a non-device private
entry/device-exclusive entry 'swap' PTE and thus abort the operation in
that instance.

Therefore, no odd things happen in the instance of reclaim being attempted
upon a file-backed guard region.

Hole Punching
~~~~~~~~~~~~~

This updates the page cache and ultimately invokes unmap_mapping_range(),
which explicitly leaves PTE markers in place.

Because the establishment of guard regions zapped any existing mappings to
file-backed folios, once the guard regions are removed then the
hole-punched region will be faulted in as usual and everything will behave
as expected.

Lorenzo Stoakes (4):
  mm: allow guard regions in file-backed and read-only mappings
  selftests/mm: rename guard-pages to guard-regions
  tools/selftests: expand all guard region tests to file-backed
  tools/selftests: add file/shmem-backed mapping guard region tests

 mm/madvise.c                                  |   8 +-
 tools/testing/selftests/mm/.gitignore         |   2 +-
 tools/testing/selftests/mm/Makefile           |   2 +-
 .../mm/{guard-pages.c => guard-regions.c}     | 921 ++++++++++++++++--
 4 files changed, 821 insertions(+), 112 deletions(-)
 rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)

--
2.48.1

Vlastimil Babka Feb. 18, 2025, 12:12 p.m. UTC | #1

On 2/13/25 19:16, Lorenzo Stoakes wrote:
> The guard regions feature was initially implemented to support anonymous
> mappings only, excluding shmem.
> 
> This was done such as to introduce the feature carefully and incrementally
> and to be conservative when considering the various caveats and corner
> cases that are applicable to file-backed mappings but not to anonymous
> ones.
> 
> Now this feature has landed in 6.13, it is time to revisit this and to
> extend this functionality to file-backed and shmem mappings.
> 
> In order to make this maximally useful, and since one may map file-backed
> mappings read-only (for instance ELF images), we also remove the
> restriction on read-only mappings and permit the establishment of guard
> regions in any non-hugetlb, non-mlock()'d mapping.

Do you plan to address mlock later too? I guess somebody might find use for
those. Is there some fundamental issue or just that we need to define some
good semantics for corner cases? (i.e. if pages are already populated in the
mlocked area, discarding them by replacing with guard pte's goes against
that, so do we allow it or not?).

Otherwise nice discussion of all the potential issues here!

Lorenzo Stoakes Feb. 18, 2025, 1:05 p.m. UTC | #2

On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote:
> On 2/13/25 19:16, Lorenzo Stoakes wrote:
> > The guard regions feature was initially implemented to support anonymous
> > mappings only, excluding shmem.
> >
> > This was done such as to introduce the feature carefully and incrementally
> > and to be conservative when considering the various caveats and corner
> > cases that are applicable to file-backed mappings but not to anonymous
> > ones.
> >
> > Now this feature has landed in 6.13, it is time to revisit this and to
> > extend this functionality to file-backed and shmem mappings.
> >
> > In order to make this maximally useful, and since one may map file-backed
> > mappings read-only (for instance ELF images), we also remove the
> > restriction on read-only mappings and permit the establishment of guard
> > regions in any non-hugetlb, non-mlock()'d mapping.
>
> Do you plan to address mlock later too? I guess somebody might find use for
> those. Is there some fundamental issue or just that we need to define some
> good semantics for corner cases? (i.e. if pages are already populated in the
> mlocked area, discarding them by replacing with guard pte's goes against
> that, so do we allow it or not?).

Yeah that's the fundamental issue with mlock, it does not interact with the
zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so
for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked
_afterwards_ there will be no zapping).

We could potentially expose an equivalent, as there are for other flags, a
_LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make
this explicit.

That is probably the most sensible option, if there is a need for this!

>
> Otherwise nice discussion of all the potential issues here!
>

Thanks! :)

David Hildenbrand Feb. 18, 2025, 2:35 p.m. UTC | #3

On 18.02.25 14:05, Lorenzo Stoakes wrote:
> On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote:
>> On 2/13/25 19:16, Lorenzo Stoakes wrote:
>>> The guard regions feature was initially implemented to support anonymous
>>> mappings only, excluding shmem.
>>>
>>> This was done such as to introduce the feature carefully and incrementally
>>> and to be conservative when considering the various caveats and corner
>>> cases that are applicable to file-backed mappings but not to anonymous
>>> ones.
>>>
>>> Now this feature has landed in 6.13, it is time to revisit this and to
>>> extend this functionality to file-backed and shmem mappings.
>>>
>>> In order to make this maximally useful, and since one may map file-backed
>>> mappings read-only (for instance ELF images), we also remove the
>>> restriction on read-only mappings and permit the establishment of guard
>>> regions in any non-hugetlb, non-mlock()'d mapping.
>>
>> Do you plan to address mlock later too? I guess somebody might find use for
>> those. Is there some fundamental issue or just that we need to define some
>> good semantics for corner cases? (i.e. if pages are already populated in the
>> mlocked area, discarding them by replacing with guard pte's goes against
>> that, so do we allow it or not?).
> 
> Yeah that's the fundamental issue with mlock, it does not interact with the
> zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so
> for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked
> _afterwards_ there will be no zapping).
> 
> We could potentially expose an equivalent, as there are for other flags, a
> _LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make
> this explicit.
> 
> That is probably the most sensible option, if there is a need for this!

mlock is weird, because it assumes that memory will be faulted in in the whole VMA.

You'd likely have to populate + mlock the page when removing the guard.
Also not sure what happens if one does an mlock()/mlockall() after
already installing PTE markers.

__mm_populate() would skip whole VMAs in case populate_vma_page_range()
fails. And I would assume populate_vma_page_range() fails on the first
guard when it triggers a page fault.

OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where
MADV_DONTNEED_LOCKED originates from:

commit 9457056ac426e5ed0671356509c8dcce69f8dee0
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Thu Mar 24 18:14:12 2022 -0700

     mm: madvise: MADV_DONTNEED_LOCKED
     
     MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
     and MCL_ONFAULT allowing to mlock without populating, there are valid use
     cases for depopulating locked ranges as well.


Adding support for that would be indeed nice.

Lorenzo Stoakes Feb. 18, 2025, 2:53 p.m. UTC | #4

On Tue, Feb 18, 2025 at 03:35:19PM +0100, David Hildenbrand wrote:
> On 18.02.25 14:05, Lorenzo Stoakes wrote:
> > On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote:
> > > On 2/13/25 19:16, Lorenzo Stoakes wrote:
> > > > The guard regions feature was initially implemented to support anonymous
> > > > mappings only, excluding shmem.
> > > >
> > > > This was done such as to introduce the feature carefully and incrementally
> > > > and to be conservative when considering the various caveats and corner
> > > > cases that are applicable to file-backed mappings but not to anonymous
> > > > ones.
> > > >
> > > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > > extend this functionality to file-backed and shmem mappings.
> > > >
> > > > In order to make this maximally useful, and since one may map file-backed
> > > > mappings read-only (for instance ELF images), we also remove the
> > > > restriction on read-only mappings and permit the establishment of guard
> > > > regions in any non-hugetlb, non-mlock()'d mapping.
> > >
> > > Do you plan to address mlock later too? I guess somebody might find use for
> > > those. Is there some fundamental issue or just that we need to define some
> > > good semantics for corner cases? (i.e. if pages are already populated in the
> > > mlocked area, discarding them by replacing with guard pte's goes against
> > > that, so do we allow it or not?).
> >
> > Yeah that's the fundamental issue with mlock, it does not interact with the
> > zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so
> > for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked
> > _afterwards_ there will be no zapping).
> >
> > We could potentially expose an equivalent, as there are for other flags, a
> > _LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make
> > this explicit.
> >
> > That is probably the most sensible option, if there is a need for this!
>
> mlock is weird, because it assumes that memory will be faulted in in the whole VMA.
>
> You'd likely have to populate + mlock the page when removing the guard.

Right yeah that'd be super weird. And I don't want to add that logic.

> Also not sure what happens if one does an mlock()/mlockall() after
> already installing PTE markers.

The existing logic already handles non-present cases by skipping them, in
mlock_pte_range():

	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
		ptent = ptep_get(pte);
		if (!pte_present(ptent))
			continue;

		...
	}

Which covers off guard regions. Removing the guard regions after this will
leave you in a weird situation where these entries will be zapped... maybe
we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
also populate?

Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE
if you since locked the range? The code currently allows this on the
proviso that 'you aren't zapping locked mappings' but leaves the VMA in a
state such that some entries would not be locked.

It'd be pretty weird to lock guard regions like this.

Having said all that, given what you say below, maybe it's not an issue
after all?...

>
> __mm_populate() would skip whole VMAs in case populate_vma_page_range()
> fails. And I would assume populate_vma_page_range() fails on the first
> guard when it triggers a page fault.
>
> OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where
> MADV_DONTNEED_LOCKED originates from:
>
> commit 9457056ac426e5ed0671356509c8dcce69f8dee0
> Author: Johannes Weiner <hannes@cmpxchg.org>
> Date:   Thu Mar 24 18:14:12 2022 -0700
>
>     mm: madvise: MADV_DONTNEED_LOCKED
>     MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
>     and MCL_ONFAULT allowing to mlock without populating, there are valid use
>     cases for depopulating locked ranges as well.

...Hm this seems to imply the current guard remove stuff isn't quite so
bad, so maybe the assumption that VM_LOCKED implies 'everything is
populated' isn't quite as stringent then.

The restriction is as simple as:

		if (behavior != MADV_DONTNEED_LOCKED)
			forbidden |= VM_LOCKED;

>
>
> Adding support for that would be indeed nice.

I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED
locked ranges, but I really don't understand why you'd want to add guard
regions to mlock()'ed regions?

Then again we're currently asymmetric as you can add them _before_
mlock()'ing...

>
> --
> Cheers,
>
> David / dhildenb
>

David Hildenbrand Feb. 18, 2025, 3:20 p.m. UTC | #5

> Right yeah that'd be super weird. And I don't want to add that logic.
> 
>> Also not sure what happens if one does an mlock()/mlockall() after
>> already installing PTE markers.
> 
> The existing logic already handles non-present cases by skipping them, in
> mlock_pte_range():
> 
> 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
> 		ptent = ptep_get(pte);
> 		if (!pte_present(ptent))
> 			continue;
> 
> 		...
> 	}

I *think* that code only updates already-mapped folios, to properly call 
mlock_folio()/munlock_folio().

It is not the code that populates pages on mlock()/mlockall(). I think 
all that goes via mm_populate()/__mm_populate(), where "ordinary GUP" 
should apply.

See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.

> 
> Which covers off guard regions. Removing the guard regions after this will
> leave you in a weird situation where these entries will be zapped... maybe
> we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
> also populate?

Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.

> 
> Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE
> if you since locked the range? The code currently allows this on the
> proviso that 'you aren't zapping locked mappings' but leaves the VMA in a
> state such that some entries would not be locked.
> 
> It'd be pretty weird to lock guard regions like this.
> 
> Having said all that, given what you say below, maybe it's not an issue
> after all?...
> 
>>
>> __mm_populate() would skip whole VMAs in case populate_vma_page_range()
>> fails. And I would assume populate_vma_page_range() fails on the first
>> guard when it triggers a page fault.
>>
>> OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where
>> MADV_DONTNEED_LOCKED originates from:
>>
>> commit 9457056ac426e5ed0671356509c8dcce69f8dee0
>> Author: Johannes Weiner <hannes@cmpxchg.org>
>> Date:   Thu Mar 24 18:14:12 2022 -0700
>>
>>      mm: madvise: MADV_DONTNEED_LOCKED
>>      MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
>>      and MCL_ONFAULT allowing to mlock without populating, there are valid use
>>      cases for depopulating locked ranges as well.
> 
> ...Hm this seems to imply the current guard remove stuff isn't quite so
> bad, so maybe the assumption that VM_LOCKED implies 'everything is
> populated' isn't quite as stringent then.

Right, with MCL_ONFAULT at least. Without MCL_ONFAULT, the assumption is 
that everything is populated (unless, apparently one uses 
MADV_DONTNEED_LOCKED or population failed, maybe).

VM_LOCKONFAULT seems to be the sane case. I wonder why 
MADV_DONTNEED_LOCKED didn't explicitly check for that one ... maybe 
there is a history to that.

> 
> The restriction is as simple as:
> 
> 		if (behavior != MADV_DONTNEED_LOCKED)
> 			forbidden |= VM_LOCKED;
> 
>>
>>
>> Adding support for that would be indeed nice.
> 
> I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED
> locked ranges, but I really don't understand why you'd want to add guard
> regions to mlock()'ed regions?

Somme apps use mlockall(), and it might be nice to just be able to use 
guard pages as if "Nothing happened".

E.g., QEMU has the option to use mlockall().

> 
> Then again we're currently asymmetric as you can add them _before_
> mlock()'ing...

Right.

Lorenzo Stoakes Feb. 18, 2025, 4:43 p.m. UTC | #6

On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
> > Right yeah that'd be super weird. And I don't want to add that logic.
> >
> > > Also not sure what happens if one does an mlock()/mlockall() after
> > > already installing PTE markers.
> >
> > The existing logic already handles non-present cases by skipping them, in
> > mlock_pte_range():
> >
> > 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
> > 		ptent = ptep_get(pte);
> > 		if (!pte_present(ptent))
> > 			continue;
> >
> > 		...
> > 	}
>
> I *think* that code only updates already-mapped folios, to properly call
> mlock_folio()/munlock_folio().

Guard regions _are_ 'already mapped' :) so it leaves them in place.

do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range()
implies this will be invoked.

>
> It is not the code that populates pages on mlock()/mlockall(). I think all
> that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should
> apply.

OK I want to correct what I said earlier.

Installing a guard region then attempting mlock() will result in an error. The
populate will -EFAULT and stop at the guard region, which causes mlock() to
error out.

This is a partial failure, so the VMA is split and has VM_LOCKED applied, but
the populate halts at the guard region.

This is ok as per previous discussion on aggregate operation failure, there can
be no expectation of 'unwinding' of partially successful operations that form
part of a requested aggregate one.

However, given there's stuff to clean up, and on error a user _may_ wish to then
remove guard regions and try again, I guess there's no harm in keeping the code
as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.

>
> See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.

Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly'
yeah wrongly, very wrongly...

>
> >
> > Which covers off guard regions. Removing the guard regions after this will
> > leave you in a weird situation where these entries will be zapped... maybe
> > we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
> > also populate?
>
> Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.

See above, no we should not :P this is only good for cleanup after mlock()
failure, although no sane program should really be trying to do this, a sane
program would give up here (and it's a _programmatic error_ to try to mlock() a
range with guard regions).

>
> >
> > Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE
> > if you since locked the range? The code currently allows this on the
> > proviso that 'you aren't zapping locked mappings' but leaves the VMA in a
> > state such that some entries would not be locked.
> >
> > It'd be pretty weird to lock guard regions like this.
> >
> > Having said all that, given what you say below, maybe it's not an issue
> > after all?...
> >
> > >
> > > __mm_populate() would skip whole VMAs in case populate_vma_page_range()
> > > fails. And I would assume populate_vma_page_range() fails on the first
> > > guard when it triggers a page fault.
> > >
> > > OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where
> > > MADV_DONTNEED_LOCKED originates from:
> > >
> > > commit 9457056ac426e5ed0671356509c8dcce69f8dee0
> > > Author: Johannes Weiner <hannes@cmpxchg.org>
> > > Date:   Thu Mar 24 18:14:12 2022 -0700
> > >
> > >      mm: madvise: MADV_DONTNEED_LOCKED
> > >      MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
> > >      and MCL_ONFAULT allowing to mlock without populating, there are valid use
> > >      cases for depopulating locked ranges as well.
> >
> > ...Hm this seems to imply the current guard remove stuff isn't quite so
> > bad, so maybe the assumption that VM_LOCKED implies 'everything is
> > populated' isn't quite as stringent then.
>
> Right, with MCL_ONFAULT at least. Without MCL_ONFAULT, the assumption is
> that everything is populated (unless, apparently one uses
> MADV_DONTNEED_LOCKED or population failed, maybe).
>
> VM_LOCKONFAULT seems to be the sane case. I wonder why MADV_DONTNEED_LOCKED
> didn't explicitly check for that one ... maybe there is a history to that.

Yeah weird.

>
> >
> > The restriction is as simple as:
> >
> > 		if (behavior != MADV_DONTNEED_LOCKED)
> > 			forbidden |= VM_LOCKED;
> >
> > >
> > >
> > > Adding support for that would be indeed nice.
> >
> > I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED
> > locked ranges, but I really don't understand why you'd want to add guard
> > regions to mlock()'ed regions?
>
> Somme apps use mlockall(), and it might be nice to just be able to use guard
> pages as if "Nothing happened".

Sadly I think not given above :P

>
> E.g., QEMU has the option to use mlockall().
>
> >
> > Then again we're currently asymmetric as you can add them _before_
> > mlock()'ing...
>
> Right.
>
> --
> Cheers,
>
> David / dhildenb
>

I think the _LOCKED idea is therefore kaput, because it just won't work
properly because populating guard regions fails.

It fails because it tries to 'touch' the memory, but 'touching' guard
region memory causes a segfault. This kind of breaks the idea of
mlock()'ing guard regions.

I think adding workarounds to make this possible in any way is not really
worth it (and would probably be pretty gross).

We already document that 'mlock()ing lightweight guard regions will fail'
as per man page so this is all in line with that.

David Hildenbrand Feb. 18, 2025, 5:14 p.m. UTC | #7

On 18.02.25 17:43, Lorenzo Stoakes wrote:
> On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
>>> Right yeah that'd be super weird. And I don't want to add that logic.
>>>
>>>> Also not sure what happens if one does an mlock()/mlockall() after
>>>> already installing PTE markers.
>>>
>>> The existing logic already handles non-present cases by skipping them, in
>>> mlock_pte_range():
>>>
>>> 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
>>> 		ptent = ptep_get(pte);
>>> 		if (!pte_present(ptent))
>>> 			continue;
>>>
>>> 		...
>>> 	}
>>
>> I *think* that code only updates already-mapped folios, to properly call
>> mlock_folio()/munlock_folio().
> 
> Guard regions _are_ 'already mapped' :) so it leaves them in place.

"mapped folios" -- there is no folio mapped. Yes, the VMA is in place.

> 
> do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range()
> implies this will be invoked.

Yes, to process any already mapped folios, to then continue population 
later.

> 
>>
>> It is not the code that populates pages on mlock()/mlockall(). I think all
>> that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should
>> apply.
> 
> OK I want to correct what I said earlier.
> 
> Installing a guard region then attempting mlock() will result in an error. The
> populate will -EFAULT and stop at the guard region, which causes mlock() to
> error out.

Right, that's my expectation.

> 
> This is a partial failure, so the VMA is split and has VM_LOCKED applied, but
> the populate halts at the guard region.
> 
> This is ok as per previous discussion on aggregate operation failure, there can
> be no expectation of 'unwinding' of partially successful operations that form
> part of a requested aggregate one.
> 
> However, given there's stuff to clean up, and on error a user _may_ wish to then
> remove guard regions and try again, I guess there's no harm in keeping the code
> as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.

Likely yes; it's all weird code.

> 
>>
>> See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
> 
> Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly'
> yeah wrongly, very wrongly...
> 
>>
>>>
>>> Which covers off guard regions. Removing the guard regions after this will
>>> leave you in a weird situation where these entries will be zapped... maybe
>>> we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
>>> also populate?
>>
>> Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
> 
> See above, no we should not :P this is only good for cleanup after mlock()
> failure, although no sane program should really be trying to do this, a sane
> program would give up here (and it's a _programmatic error_ to try to mlock() a
> range with guard regions).
 >>>> Somme apps use mlockall(), and it might be nice to just be able to 
use guard
>> pages as if "Nothing happened".
> 
> Sadly I think not given above :P

QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); 
when requested to then exit(); if it fails.

Assume glibc or any lib uses it, QEMU would have no real way of figuring 
that out or instructing offending libraries to disabled that, at least 
for now  ...

... turning RT VMs less usable if any library uses guard regions. :(

There is upcoming support for MCL_ONFAULT in QEMU [1] (see below).

[1] https://lkml.kernel.org/r/20250212173823.214429-3-peterx@redhat.com

> 
>>
>> E.g., QEMU has the option to use mlockall().
>>
>>>
>>> Then again we're currently asymmetric as you can add them _before_
>>> mlock()'ing...
>>
>> Right.
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> I think the _LOCKED idea is therefore kaput, because it just won't work
> properly because populating guard regions fails.

Right, I think basic VM_LOCKED is out of the picture. VM_LOCKONFAULT 
might be interesting, because we are skipping the population stage.

> 
> It fails because it tries to 'touch' the memory, but 'touching' guard
> region memory causes a segfault. This kind of breaks the idea of
> mlock()'ing guard regions.
> 
> I think adding workarounds to make this possible in any way is not really
> worth it (and would probably be pretty gross).
> 
> We already document that 'mlock()ing lightweight guard regions will fail'
> as per man page so this is all in line with that.

Right, and I claim that supporting VM_LOCKONFAULT might likely be as 
easy as allowing install/remove of guard regions when that flag is set.

Lorenzo Stoakes Feb. 18, 2025, 5:20 p.m. UTC | #8

On Tue, Feb 18, 2025 at 06:14:00PM +0100, David Hildenbrand wrote:
> On 18.02.25 17:43, Lorenzo Stoakes wrote:
> > On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote:
> > > > Right yeah that'd be super weird. And I don't want to add that logic.
> > > >
> > > > > Also not sure what happens if one does an mlock()/mlockall() after
> > > > > already installing PTE markers.
> > > >
> > > > The existing logic already handles non-present cases by skipping them, in
> > > > mlock_pte_range():
> > > >
> > > > 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
> > > > 		ptent = ptep_get(pte);
> > > > 		if (!pte_present(ptent))
> > > > 			continue;
> > > >
> > > > 		...
> > > > 	}
> > >
> > > I *think* that code only updates already-mapped folios, to properly call
> > > mlock_folio()/munlock_folio().
> >
> > Guard regions _are_ 'already mapped' :) so it leaves them in place.
>
> "mapped folios" -- there is no folio mapped. Yes, the VMA is in place.

We're engaging in a moot discussion on this I think but I mean it appears
to operate by walking page tables if they are populated, which they will be
for guard regions, but when it finds it's non-present it will skip.

This amounts to the same thing as not doing anything, obviously.

>
> >
> > do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range()
> > implies this will be invoked.
>
> Yes, to process any already mapped folios, to then continue population
> later.
>
> >
> > >
> > > It is not the code that populates pages on mlock()/mlockall(). I think all
> > > that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should
> > > apply.
> >
> > OK I want to correct what I said earlier.
> >
> > Installing a guard region then attempting mlock() will result in an error. The
> > populate will -EFAULT and stop at the guard region, which causes mlock() to
> > error out.
>
> Right, that's my expectation.

OK good!

>
> >
> > This is a partial failure, so the VMA is split and has VM_LOCKED applied, but
> > the populate halts at the guard region.
> >
> > This is ok as per previous discussion on aggregate operation failure, there can
> > be no expectation of 'unwinding' of partially successful operations that form
> > part of a requested aggregate one.
> >
> > However, given there's stuff to clean up, and on error a user _may_ wish to then
> > remove guard regions and try again, I guess there's no harm in keeping the code
> > as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place.
>
> Likely yes; it's all weird code.
>
> >
> > >
> > > See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling.
> >
> > Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly'
> > yeah wrongly, very wrongly...
> >
> > >
> > > >
> > > > Which covers off guard regions. Removing the guard regions after this will
> > > > leave you in a weird situation where these entries will be zapped... maybe
> > > > we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case
> > > > also populate?
> > >
> > > Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED.
> >
> > See above, no we should not :P this is only good for cleanup after mlock()
> > failure, although no sane program should really be trying to do this, a sane
> > program would give up here (and it's a _programmatic error_ to try to mlock() a
> > range with guard regions).
> >>>> Somme apps use mlockall(), and it might be nice to just be able to use
> guard
> > > pages as if "Nothing happened".
> >
> > Sadly I think not given above :P
>
> QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when
> requested to then exit(); if it fails.

Hm under what circumstances? I use qemu extensively to test this stuff with
no issues. Unless you mean it's using it in the 'host' code somehow.

>
> Assume glibc or any lib uses it, QEMU would have no real way of figuring
> that out or instructing offending libraries to disabled that, at least for
> now  ...
>
> ... turning RT VMs less usable if any library uses guard regions. :(
>

This seems really stupid, to be honest. Unfortunately there's no way around
this, if software does stupid things then they get stupid prizes. There are
other ways mlock() and faulting in can fail too.

> There is upcoming support for MCL_ONFAULT in QEMU [1] (see below).

Good.

>
> [1] https://lkml.kernel.org/r/20250212173823.214429-3-peterx@redhat.com
>
> >
> > >
> > > E.g., QEMU has the option to use mlockall().
> > >
> > > >
> > > > Then again we're currently asymmetric as you can add them _before_
> > > > mlock()'ing...
> > >
> > > Right.
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
> >
> > I think the _LOCKED idea is therefore kaput, because it just won't work
> > properly because populating guard regions fails.
>
> Right, I think basic VM_LOCKED is out of the picture. VM_LOCKONFAULT might
> be interesting, because we are skipping the population stage.
>
> >
> > It fails because it tries to 'touch' the memory, but 'touching' guard
> > region memory causes a segfault. This kind of breaks the idea of
> > mlock()'ing guard regions.
> >
> > I think adding workarounds to make this possible in any way is not really
> > worth it (and would probably be pretty gross).
> >
> > We already document that 'mlock()ing lightweight guard regions will fail'
> > as per man page so this is all in line with that.
>
> Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as
> allowing install/remove of guard regions when that flag is set.

We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we
disallow.

>
> --
> Cheers,
>
> David / dhildenb
>

David Hildenbrand Feb. 18, 2025, 5:25 p.m. UTC | #9

>>
>> QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when
>> requested to then exit(); if it fails.
> 
> Hm under what circumstances? I use qemu extensively to test this stuff with
> no issues. Unless you mean it's using it in the 'host' code somehow.


-overcommit mem-lock=on

or (legacy)

-realtime mlock=on

I think.

[...]

>>>
>>> It fails because it tries to 'touch' the memory, but 'touching' guard
>>> region memory causes a segfault. This kind of breaks the idea of
>>> mlock()'ing guard regions.
>>>
>>> I think adding workarounds to make this possible in any way is not really
>>> worth it (and would probably be pretty gross).
>>>
>>> We already document that 'mlock()ing lightweight guard regions will fail'
>>> as per man page so this is all in line with that.
>>
>> Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as
>> allowing install/remove of guard regions when that flag is set.
> 
> We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we
> disallow.


See mlock2();

SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
{
	vm_flags_t vm_flags = VM_LOCKED;

	if (flags & ~MLOCK_ONFAULT)
		return -EINVAL;

	if (flags & MLOCK_ONFAULT)
		vm_flags |= VM_LOCKONFAULT;

	return do_mlock(start, len, vm_flags);
}


VM_LOCKONFAULT always as VM_LOCKED set as well.

Lorenzo Stoakes Feb. 18, 2025, 5:28 p.m. UTC | #10

On Tue, Feb 18, 2025 at 06:25:35PM +0100, David Hildenbrand wrote:
> > >
> > > QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when
> > > requested to then exit(); if it fails.
> >
> > Hm under what circumstances? I use qemu extensively to test this stuff with
> > no issues. Unless you mean it's using it in the 'host' code somehow.
>
>
> -overcommit mem-lock=on
>
> or (legacy)
>
> -realtime mlock=on
>
> I think.
>
> [...]

Thanks

>
> > > >
> > > > It fails because it tries to 'touch' the memory, but 'touching' guard
> > > > region memory causes a segfault. This kind of breaks the idea of
> > > > mlock()'ing guard regions.
> > > >
> > > > I think adding workarounds to make this possible in any way is not really
> > > > worth it (and would probably be pretty gross).
> > > >
> > > > We already document that 'mlock()ing lightweight guard regions will fail'
> > > > as per man page so this is all in line with that.
> > >
> > > Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as
> > > allowing install/remove of guard regions when that flag is set.
> >
> > We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we
> > disallow.
>
>
> See mlock2();
>
> SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
> {
> 	vm_flags_t vm_flags = VM_LOCKED;
>
> 	if (flags & ~MLOCK_ONFAULT)
> 		return -EINVAL;
>
> 	if (flags & MLOCK_ONFAULT)
> 		vm_flags |= VM_LOCKONFAULT;
>
> 	return do_mlock(start, len, vm_flags);
> }
>
>
> VM_LOCKONFAULT always as VM_LOCKED set as well.

OK cool, that makes sense.

As with much kernel stuff, I knew this in the past. Then I forgot. Then I knew
again, then... :P if only somebody would write it down in a book...

Yeah then that makes sense to check explicitly for (VM_LOCKED | VM_LOCKONFAULT)
in any MADV_GUARD_INSTALL_LOCKED variant as obviously this would be passively
excluded right now.

>
> --
> Cheers,
>
> David / dhildenb
>

David Hildenbrand Feb. 18, 2025, 5:31 p.m. UTC | #11

>> See mlock2();
>>
>> SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
>> {
>> 	vm_flags_t vm_flags = VM_LOCKED;
>>
>> 	if (flags & ~MLOCK_ONFAULT)
>> 		return -EINVAL;
>>
>> 	if (flags & MLOCK_ONFAULT)
>> 		vm_flags |= VM_LOCKONFAULT;
>>
>> 	return do_mlock(start, len, vm_flags);
>> }
>>
>>
>> VM_LOCKONFAULT always as VM_LOCKED set as well.
> 
> OK cool, that makes sense.
> 
> As with much kernel stuff, I knew this in the past. Then I forgot. Then I knew
> again, then... :P if only somebody would write it down in a book...

And if there would be such a book in the makings, I would consider 
hurrying up and preordering it ;)

Kalesh Singh Feb. 19, 2025, 8:25 a.m. UTC | #12

On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> The guard regions feature was initially implemented to support anonymous
> mappings only, excluding shmem.
>
> This was done such as to introduce the feature carefully and incrementally
> and to be conservative when considering the various caveats and corner
> cases that are applicable to file-backed mappings but not to anonymous
> ones.
>
> Now this feature has landed in 6.13, it is time to revisit this and to
> extend this functionality to file-backed and shmem mappings.
>
> In order to make this maximally useful, and since one may map file-backed
> mappings read-only (for instance ELF images), we also remove the
> restriction on read-only mappings and permit the establishment of guard
> regions in any non-hugetlb, non-mlock()'d mapping.

Hi Lorenzo,

Thank you for your work on this.

Have we thought about how guard regions are represented in /proc/*/[s]maps?

In the field, I've found that many applications read the ranges from
/proc/self/[s]maps to determine what they can access (usually related
to obfuscation techniques). If they don't know of the guard regions it
would cause them to crash; I think that we'll need similar entries to
PROT_NONE (---p) for these, and generally to maintain consistency
between the behavior and what is being said from /proc/*/[s]maps.

-- Kalesh

>
> It is permissible to permit the establishment of guard regions in read-only
> mappings because the guard regions only reduce access to the mapping, and
> when removed simply reinstate the existing attributes of the underlying
> VMA, meaning no access violations can occur.
>
> While the change in kernel code introduced in this series is small, the
> majority of the effort here is spent in extending the testing to assert
> that the feature works correctly across numerous file-backed mapping
> scenarios.
>
> Every single guard region self-test performed against anonymous memory
> (which is relevant and not anon-only) has now been updated to also be
> performed against shmem and a mapping of a file in the working directory.
>
> This confirms that all cases also function correctly for file-backed guard
> regions.
>
> In addition a number of other tests are added for specific file-backed
> mapping scenarios.
>
> There are a number of other concerns that one might have with regard to
> guard regions, addressed below:
>
> Readahead
> ~~~~~~~~~
>
> Readahead is a process through which the page cache is populated on the
> assumption that sequential reads will occur, thus amortising I/O and,
> through a clever use of the PG_readahead folio flag establishing during
> major fault and checked upon minor fault, provides for asynchronous I/O to
> occur as dat is processed, reducing I/O stalls as data is faulted in.
>
> Guard regions do not alter this mechanism which operations at the folio and
> fault level, but do of course prevent the faulting of folios that would
> otherwise be mapped.
>
> In the instance of a major fault prior to a guard region, synchronous
> readahead will occur including populating folios in the page cache which
> the guard regions will, in the case of the mapping in question, prevent
> access to.
>
> In addition, if PG_readahead is placed in a folio that is now inaccessible,
> this will prevent asynchronous readahead from occurring as it would
> otherwise do.
>
> However, there are mechanisms for heuristically resetting this within
> readahead regardless, which will 'recover' correct readahead behaviour.
>
> Readahead presumes sequential data access, the presence of a guard region
> clearly indicates that, at least in the guard region, no such sequential
> access will occur, as it cannot occur there.
>
> So this should have very little impact on any real workload. The far more
> important point is as to whether readahead causes incorrect or
> inappropriate mapping of ranges disallowed by the presence of guard
> regions - this is not the case, as readahead does not 'pre-fault' memory in
> this fashion.
>
> At any rate, any mechanism which would attempt to do so would hit the usual
> page fault paths, which correctly handle PTE markers as with anonymous
> mappings.
>
> Fault-Around
> ~~~~~~~~~~~~
>
> The fault-around logic, in a similar vein to readahead, attempts to improve
> efficiency with regard to file-backed memory mappings, however it differs
> in that it does not try to fetch folios into the page cache that are about
> to be accessed, but rather pre-maps a range of folios around the faulting
> address.
>
> Guard regions making use of PTE markers makes this relatively trivial, as
> this case is already handled - see filemap_map_folio_range() and
> filemap_map_order0_folio() - in both instances, the solution is to simply
> keep the established page table mappings and let the fault handler take
> care of PTE markers, as per the comment:
>
>         /*
>          * NOTE: If there're PTE markers, we'll leave them to be
>          * handled in the specific fault path, and it'll prohibit
>          * the fault-around logic.
>          */
>
> This works, as establishing guard regions results in page table mappings
> with PTE markers, and clearing them removes them.
>
> Truncation
> ~~~~~~~~~~
>
> File truncation will not eliminate existing guard regions, as the
> truncation operation will ultimately zap the range via
> unmap_mapping_range(), which specifically excludes PTE markers.
>
> Zapping
> ~~~~~~~
>
> Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
> which specifically deals with guard entries, leaving them intact except in
> instances such as process teardown or munmap() where they need to be
> removed.
>
> Reclaim
> ~~~~~~~
>
> When reclaim is performed on file-backed folios, it ultimately invokes
> try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
> will ultimately abort the operation for the guard region mapping. If large,
> then check_pte() will determine that this is a non-device private
> entry/device-exclusive entry 'swap' PTE and thus abort the operation in
> that instance.
>
> Therefore, no odd things happen in the instance of reclaim being attempted
> upon a file-backed guard region.
>
> Hole Punching
> ~~~~~~~~~~~~~
>
> This updates the page cache and ultimately invokes unmap_mapping_range(),
> which explicitly leaves PTE markers in place.
>
> Because the establishment of guard regions zapped any existing mappings to
> file-backed folios, once the guard regions are removed then the
> hole-punched region will be faulted in as usual and everything will behave
> as expected.
>
> Lorenzo Stoakes (4):
>   mm: allow guard regions in file-backed and read-only mappings
>   selftests/mm: rename guard-pages to guard-regions
>   tools/selftests: expand all guard region tests to file-backed
>   tools/selftests: add file/shmem-backed mapping guard region tests
>
>  mm/madvise.c                                  |   8 +-
>  tools/testing/selftests/mm/.gitignore         |   2 +-
>  tools/testing/selftests/mm/Makefile           |   2 +-
>  .../mm/{guard-pages.c => guard-regions.c}     | 921 ++++++++++++++++--
>  4 files changed, 821 insertions(+), 112 deletions(-)
>  rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
>
> --
> 2.48.1

Kalesh Singh Feb. 19, 2025, 8:35 a.m. UTC | #13

On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh <kaleshsingh@google.com> wrote:
>
> On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > The guard regions feature was initially implemented to support anonymous
> > mappings only, excluding shmem.
> >
> > This was done such as to introduce the feature carefully and incrementally
> > and to be conservative when considering the various caveats and corner
> > cases that are applicable to file-backed mappings but not to anonymous
> > ones.
> >
> > Now this feature has landed in 6.13, it is time to revisit this and to
> > extend this functionality to file-backed and shmem mappings.
> >
> > In order to make this maximally useful, and since one may map file-backed
> > mappings read-only (for instance ELF images), we also remove the
> > restriction on read-only mappings and permit the establishment of guard
> > regions in any non-hugetlb, non-mlock()'d mapping.
>
> Hi Lorenzo,
>
> Thank you for your work on this.
>
> Have we thought about how guard regions are represented in /proc/*/[s]maps?
>
> In the field, I've found that many applications read the ranges from
> /proc/self/[s]maps to determine what they can access (usually related
> to obfuscation techniques). If they don't know of the guard regions it
> would cause them to crash; I think that we'll need similar entries to
> PROT_NONE (---p) for these, and generally to maintain consistency
> between the behavior and what is being said from /proc/*/[s]maps.

To clarify why the applications may not be aware of their guard
regions -- in the case of the ELF mappings these PROT_NONE (guard
regions) would be installed by the dynamic loader; or may be inherited
from the parent (zygote in Android's case).

>
> -- Kalesh
>
> >
> > It is permissible to permit the establishment of guard regions in read-only
> > mappings because the guard regions only reduce access to the mapping, and
> > when removed simply reinstate the existing attributes of the underlying
> > VMA, meaning no access violations can occur.
> >
> > While the change in kernel code introduced in this series is small, the
> > majority of the effort here is spent in extending the testing to assert
> > that the feature works correctly across numerous file-backed mapping
> > scenarios.
> >
> > Every single guard region self-test performed against anonymous memory
> > (which is relevant and not anon-only) has now been updated to also be
> > performed against shmem and a mapping of a file in the working directory.
> >
> > This confirms that all cases also function correctly for file-backed guard
> > regions.
> >
> > In addition a number of other tests are added for specific file-backed
> > mapping scenarios.
> >
> > There are a number of other concerns that one might have with regard to
> > guard regions, addressed below:
> >
> > Readahead
> > ~~~~~~~~~
> >
> > Readahead is a process through which the page cache is populated on the
> > assumption that sequential reads will occur, thus amortising I/O and,
> > through a clever use of the PG_readahead folio flag establishing during
> > major fault and checked upon minor fault, provides for asynchronous I/O to
> > occur as dat is processed, reducing I/O stalls as data is faulted in.
> >
> > Guard regions do not alter this mechanism which operations at the folio and
> > fault level, but do of course prevent the faulting of folios that would
> > otherwise be mapped.
> >
> > In the instance of a major fault prior to a guard region, synchronous
> > readahead will occur including populating folios in the page cache which
> > the guard regions will, in the case of the mapping in question, prevent
> > access to.
> >
> > In addition, if PG_readahead is placed in a folio that is now inaccessible,
> > this will prevent asynchronous readahead from occurring as it would
> > otherwise do.
> >
> > However, there are mechanisms for heuristically resetting this within
> > readahead regardless, which will 'recover' correct readahead behaviour.
> >
> > Readahead presumes sequential data access, the presence of a guard region
> > clearly indicates that, at least in the guard region, no such sequential
> > access will occur, as it cannot occur there.
> >
> > So this should have very little impact on any real workload. The far more
> > important point is as to whether readahead causes incorrect or
> > inappropriate mapping of ranges disallowed by the presence of guard
> > regions - this is not the case, as readahead does not 'pre-fault' memory in
> > this fashion.
> >
> > At any rate, any mechanism which would attempt to do so would hit the usual
> > page fault paths, which correctly handle PTE markers as with anonymous
> > mappings.
> >
> > Fault-Around
> > ~~~~~~~~~~~~
> >
> > The fault-around logic, in a similar vein to readahead, attempts to improve
> > efficiency with regard to file-backed memory mappings, however it differs
> > in that it does not try to fetch folios into the page cache that are about
> > to be accessed, but rather pre-maps a range of folios around the faulting
> > address.
> >
> > Guard regions making use of PTE markers makes this relatively trivial, as
> > this case is already handled - see filemap_map_folio_range() and
> > filemap_map_order0_folio() - in both instances, the solution is to simply
> > keep the established page table mappings and let the fault handler take
> > care of PTE markers, as per the comment:
> >
> >         /*
> >          * NOTE: If there're PTE markers, we'll leave them to be
> >          * handled in the specific fault path, and it'll prohibit
> >          * the fault-around logic.
> >          */
> >
> > This works, as establishing guard regions results in page table mappings
> > with PTE markers, and clearing them removes them.
> >
> > Truncation
> > ~~~~~~~~~~
> >
> > File truncation will not eliminate existing guard regions, as the
> > truncation operation will ultimately zap the range via
> > unmap_mapping_range(), which specifically excludes PTE markers.
> >
> > Zapping
> > ~~~~~~~
> >
> > Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
> > which specifically deals with guard entries, leaving them intact except in
> > instances such as process teardown or munmap() where they need to be
> > removed.
> >
> > Reclaim
> > ~~~~~~~
> >
> > When reclaim is performed on file-backed folios, it ultimately invokes
> > try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
> > will ultimately abort the operation for the guard region mapping. If large,
> > then check_pte() will determine that this is a non-device private
> > entry/device-exclusive entry 'swap' PTE and thus abort the operation in
> > that instance.
> >
> > Therefore, no odd things happen in the instance of reclaim being attempted
> > upon a file-backed guard region.
> >
> > Hole Punching
> > ~~~~~~~~~~~~~
> >
> > This updates the page cache and ultimately invokes unmap_mapping_range(),
> > which explicitly leaves PTE markers in place.
> >
> > Because the establishment of guard regions zapped any existing mappings to
> > file-backed folios, once the guard regions are removed then the
> > hole-punched region will be faulted in as usual and everything will behave
> > as expected.
> >
> > Lorenzo Stoakes (4):
> >   mm: allow guard regions in file-backed and read-only mappings
> >   selftests/mm: rename guard-pages to guard-regions
> >   tools/selftests: expand all guard region tests to file-backed
> >   tools/selftests: add file/shmem-backed mapping guard region tests
> >
> >  mm/madvise.c                                  |   8 +-
> >  tools/testing/selftests/mm/.gitignore         |   2 +-
> >  tools/testing/selftests/mm/Makefile           |   2 +-
> >  .../mm/{guard-pages.c => guard-regions.c}     | 921 ++++++++++++++++--
> >  4 files changed, 821 insertions(+), 112 deletions(-)
> >  rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
> >
> > --
> > 2.48.1

Lorenzo Stoakes Feb. 19, 2025, 9:15 a.m. UTC | #14

On Wed, Feb 19, 2025 at 12:35:12AM -0800, Kalesh Singh wrote:
> On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh <kaleshsingh@google.com> wrote:
> >
> > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > The guard regions feature was initially implemented to support anonymous
> > > mappings only, excluding shmem.
> > >
> > > This was done such as to introduce the feature carefully and incrementally
> > > and to be conservative when considering the various caveats and corner
> > > cases that are applicable to file-backed mappings but not to anonymous
> > > ones.
> > >
> > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > extend this functionality to file-backed and shmem mappings.
> > >
> > > In order to make this maximally useful, and since one may map file-backed
> > > mappings read-only (for instance ELF images), we also remove the
> > > restriction on read-only mappings and permit the establishment of guard
> > > regions in any non-hugetlb, non-mlock()'d mapping.
> >
> > Hi Lorenzo,
> >
> > Thank you for your work on this.
> >
> > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> >
> > In the field, I've found that many applications read the ranges from
> > /proc/self/[s]maps to determine what they can access (usually related
> > to obfuscation techniques). If they don't know of the guard regions it
> > would cause them to crash; I think that we'll need similar entries to
> > PROT_NONE (---p) for these, and generally to maintain consistency
> > between the behavior and what is being said from /proc/*/[s]maps.
>
> To clarify why the applications may not be aware of their guard
> regions -- in the case of the ELF mappings these PROT_NONE (guard
> regions) would be installed by the dynamic loader; or may be inherited
> from the parent (zygote in Android's case).

I'm still really confused as to how you propose to use guard regions in
this scenario. If it's trying to 'clone' some set of mappings somehow then
guard regions are just not compatible with your use case at all.

I'm not sure what would be honestly... you maybe could find a way to modify
the loader to interpret PROT_NONE mappings as a cue to use guard regions
instead if that worked?

You can't both not have a VMA (something that describes mapping with common
attributes) and have page table state along with somehow having something
that describes a mapping with said page table state :)

It's a sort of non-such-beast.

Everything is a trade-off, guard regions provide the option to retain state
regarding faulting behaviour in page tables rather the VMAs, entirely
opt-in and at the behest of the user.

Now extended to file-backed and shmem!

While I hoped it would help your scenario based on your talk at LPC, this
is very clearly an important change for guard regions (for shmem most
notably) and removes an important limitation and exposes the possibility of
using it for things such as ELF mappings for those who don't also require
specific [s]maps state (which explicitly NEEDS VMA metadata state).

I suspect you can find a way of making this work in your scenario
though.

An alternative if you're willing to wait for page table state to be
acquired is to expose a new /proc/$pid/... endpoint that explicitly outputs
guard regions.

Would this solve your problem...? I had always considered implementing this
if there was a demand.

Maybe /proc/$pid/gmaps? :P 'g' could mean guard regions or... ;)

>
> >
> > -- Kalesh
> >
> > >
> > > It is permissible to permit the establishment of guard regions in read-only
> > > mappings because the guard regions only reduce access to the mapping, and
> > > when removed simply reinstate the existing attributes of the underlying
> > > VMA, meaning no access violations can occur.
> > >
> > > While the change in kernel code introduced in this series is small, the
> > > majority of the effort here is spent in extending the testing to assert
> > > that the feature works correctly across numerous file-backed mapping
> > > scenarios.
> > >
> > > Every single guard region self-test performed against anonymous memory
> > > (which is relevant and not anon-only) has now been updated to also be
> > > performed against shmem and a mapping of a file in the working directory.
> > >
> > > This confirms that all cases also function correctly for file-backed guard
> > > regions.
> > >
> > > In addition a number of other tests are added for specific file-backed
> > > mapping scenarios.
> > >
> > > There are a number of other concerns that one might have with regard to
> > > guard regions, addressed below:
> > >
> > > Readahead
> > > ~~~~~~~~~
> > >
> > > Readahead is a process through which the page cache is populated on the
> > > assumption that sequential reads will occur, thus amortising I/O and,
> > > through a clever use of the PG_readahead folio flag establishing during
> > > major fault and checked upon minor fault, provides for asynchronous I/O to
> > > occur as dat is processed, reducing I/O stalls as data is faulted in.
> > >
> > > Guard regions do not alter this mechanism which operations at the folio and
> > > fault level, but do of course prevent the faulting of folios that would
> > > otherwise be mapped.
> > >
> > > In the instance of a major fault prior to a guard region, synchronous
> > > readahead will occur including populating folios in the page cache which
> > > the guard regions will, in the case of the mapping in question, prevent
> > > access to.
> > >
> > > In addition, if PG_readahead is placed in a folio that is now inaccessible,
> > > this will prevent asynchronous readahead from occurring as it would
> > > otherwise do.
> > >
> > > However, there are mechanisms for heuristically resetting this within
> > > readahead regardless, which will 'recover' correct readahead behaviour.
> > >
> > > Readahead presumes sequential data access, the presence of a guard region
> > > clearly indicates that, at least in the guard region, no such sequential
> > > access will occur, as it cannot occur there.
> > >
> > > So this should have very little impact on any real workload. The far more
> > > important point is as to whether readahead causes incorrect or
> > > inappropriate mapping of ranges disallowed by the presence of guard
> > > regions - this is not the case, as readahead does not 'pre-fault' memory in
> > > this fashion.
> > >
> > > At any rate, any mechanism which would attempt to do so would hit the usual
> > > page fault paths, which correctly handle PTE markers as with anonymous
> > > mappings.
> > >
> > > Fault-Around
> > > ~~~~~~~~~~~~
> > >
> > > The fault-around logic, in a similar vein to readahead, attempts to improve
> > > efficiency with regard to file-backed memory mappings, however it differs
> > > in that it does not try to fetch folios into the page cache that are about
> > > to be accessed, but rather pre-maps a range of folios around the faulting
> > > address.
> > >
> > > Guard regions making use of PTE markers makes this relatively trivial, as
> > > this case is already handled - see filemap_map_folio_range() and
> > > filemap_map_order0_folio() - in both instances, the solution is to simply
> > > keep the established page table mappings and let the fault handler take
> > > care of PTE markers, as per the comment:
> > >
> > >         /*
> > >          * NOTE: If there're PTE markers, we'll leave them to be
> > >          * handled in the specific fault path, and it'll prohibit
> > >          * the fault-around logic.
> > >          */
> > >
> > > This works, as establishing guard regions results in page table mappings
> > > with PTE markers, and clearing them removes them.
> > >
> > > Truncation
> > > ~~~~~~~~~~
> > >
> > > File truncation will not eliminate existing guard regions, as the
> > > truncation operation will ultimately zap the range via
> > > unmap_mapping_range(), which specifically excludes PTE markers.
> > >
> > > Zapping
> > > ~~~~~~~
> > >
> > > Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
> > > which specifically deals with guard entries, leaving them intact except in
> > > instances such as process teardown or munmap() where they need to be
> > > removed.
> > >
> > > Reclaim
> > > ~~~~~~~
> > >
> > > When reclaim is performed on file-backed folios, it ultimately invokes
> > > try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
> > > will ultimately abort the operation for the guard region mapping. If large,
> > > then check_pte() will determine that this is a non-device private
> > > entry/device-exclusive entry 'swap' PTE and thus abort the operation in
> > > that instance.
> > >
> > > Therefore, no odd things happen in the instance of reclaim being attempted
> > > upon a file-backed guard region.
> > >
> > > Hole Punching
> > > ~~~~~~~~~~~~~
> > >
> > > This updates the page cache and ultimately invokes unmap_mapping_range(),
> > > which explicitly leaves PTE markers in place.
> > >
> > > Because the establishment of guard regions zapped any existing mappings to
> > > file-backed folios, once the guard regions are removed then the
> > > hole-punched region will be faulted in as usual and everything will behave
> > > as expected.
> > >
> > > Lorenzo Stoakes (4):
> > >   mm: allow guard regions in file-backed and read-only mappings
> > >   selftests/mm: rename guard-pages to guard-regions
> > >   tools/selftests: expand all guard region tests to file-backed
> > >   tools/selftests: add file/shmem-backed mapping guard region tests
> > >
> > >  mm/madvise.c                                  |   8 +-
> > >  tools/testing/selftests/mm/.gitignore         |   2 +-
> > >  tools/testing/selftests/mm/Makefile           |   2 +-
> > >  .../mm/{guard-pages.c => guard-regions.c}     | 921 ++++++++++++++++--
> > >  4 files changed, 821 insertions(+), 112 deletions(-)
> > >  rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
> > >
> > > --
> > > 2.48.1

Lorenzo Stoakes Feb. 19, 2025, 9:17 a.m. UTC | #15

On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
> On 19.02.25 10:03, Lorenzo Stoakes wrote:
> > On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
> > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > The guard regions feature was initially implemented to support anonymous
> > > > mappings only, excluding shmem.
> > > >
> > > > This was done such as to introduce the feature carefully and incrementally
> > > > and to be conservative when considering the various caveats and corner
> > > > cases that are applicable to file-backed mappings but not to anonymous
> > > > ones.
> > > >
> > > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > > extend this functionality to file-backed and shmem mappings.
> > > >
> > > > In order to make this maximally useful, and since one may map file-backed
> > > > mappings read-only (for instance ELF images), we also remove the
> > > > restriction on read-only mappings and permit the establishment of guard
> > > > regions in any non-hugetlb, non-mlock()'d mapping.
> > >
> > > Hi Lorenzo,
> > >
> > > Thank you for your work on this.
> >
> > You're welcome.
> >
> > >
> > > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> >
> > This is off-topic here but... Yes, extensively. No they do not appear
> > there.
> >
> > I thought you had attended LPC and my talk where I mentioned this
> > purposefully as a drawback?
> >
> > I went out of my way to advertise this limitation at the LPC talk, in the
> > original series, etc. so it's a little disappointing that this is being
> > brought up so late, but nobody else has raised objections to this issue so
> > I think in general it's not a limitation that matters in practice.
> >
> > >
> > > In the field, I've found that many applications read the ranges from
> > > /proc/self/[s]maps to determine what they can access (usually related
> > > to obfuscation techniques). If they don't know of the guard regions it
> > > would cause them to crash; I think that we'll need similar entries to
> > > PROT_NONE (---p) for these, and generally to maintain consistency
> > > between the behavior and what is being said from /proc/*/[s]maps.
> >
> > No, we cannot have these, sorry.
> >
> > Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
> > feature is to avoid having to accumulate VMAs for regions which are not
> > intended to be accessible.
> >
> > Secondly, there is no practical means for this to be accomplished in
> > /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
> > they have guard regions.
> >
> > This is intentional, because setting such metadata is simply not practical
> > - why? Because when you try to split the VMA, how do you know which bit
> > gets the metadata and which doesn't? You can't without _reading page
> > tables_.
> >
> > /proc/$pid/smaps _does_ read page tables, but we can't start pretending
> > VMAs exist when they don't, this would be completely inaccurate, would
> > break assumptions for things like mremap (which require a single VMA) and
> > would be unworkable.
> >
> > The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
> > saying 'hey this region has guard regions somewhere'.
>
> And then simply expose it in /proc/$pid/pagemap, which is a better interface
> for this pte-level information inside of VMAs. We should still have a spare
> bit for that purpose in the pagemap entries.

Ah yeah thanks David forgot about that!

This is also a possibility if that'd solve your problems Kalesh?

This bit will be fought over haha

>
> --
> Cheers,
>
> David / dhildenb
>

Liam R. Howlett Feb. 19, 2025, 5:32 p.m. UTC | #16

* Kalesh Singh <kaleshsingh@google.com> [250219 03:35]:
> On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh <kaleshsingh@google.com> wrote:
> >
> > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > The guard regions feature was initially implemented to support anonymous
> > > mappings only, excluding shmem.
> > >
> > > This was done such as to introduce the feature carefully and incrementally
> > > and to be conservative when considering the various caveats and corner
> > > cases that are applicable to file-backed mappings but not to anonymous
> > > ones.
> > >
> > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > extend this functionality to file-backed and shmem mappings.
> > >
> > > In order to make this maximally useful, and since one may map file-backed
> > > mappings read-only (for instance ELF images), we also remove the
> > > restriction on read-only mappings and permit the establishment of guard
> > > regions in any non-hugetlb, non-mlock()'d mapping.
> >
> > Hi Lorenzo,
> >
> > Thank you for your work on this.
> >
> > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> >
> > In the field, I've found that many applications read the ranges from
> > /proc/self/[s]maps to determine what they can access (usually related
> > to obfuscation techniques). If they don't know of the guard regions it
> > would cause them to crash; I think that we'll need similar entries to
> > PROT_NONE (---p) for these, and generally to maintain consistency
> > between the behavior and what is being said from /proc/*/[s]maps.
> 
> To clarify why the applications may not be aware of their guard
> regions -- in the case of the ELF mappings these PROT_NONE (guard
> regions) would be installed by the dynamic loader; or may be inherited
> from the parent (zygote in Android's case).

The guard regions are a method to reduce vma count and speed up guard
installation and removal.  This very much will change the proc interface
by its design, since it's removing information from the maps for the
advantage of speed and less resources.  I thought that was pretty clear
from the start.

The proc interface is also not a very good way to programmatically get
this information, as I'm sure you are aware.  I'm also sure you know
that reading that file takes the mmap read lock, and tearing can occur.

I think this implies you are taking a lot of time to get the information
you want in the way you are getting it (parsing a text file, and not
doing any mmap write work)?

If this is a common pattern, I think we need a better interface to query
this type of information.  We have an ioctl going in for getting vma
information, but that will lack these new guard regions as well.

David mentioned /proc/self/pagemap - that's certainly worth exploring,
and doesn't involve text parsing.

If it's not that common, then maybe your zygote/loader can communicate
the information in a way that does not involve read locking the programs
vm area?

Are you really parsing the same library information for each application
launched to find the guards instead of asking or being told what they
are?  I must be missing something..

Thanks,
Liam

Kalesh Singh Feb. 19, 2025, 6:52 p.m. UTC | #17

On Wed, Feb 19, 2025 at 1:17 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
> > On 19.02.25 10:03, Lorenzo Stoakes wrote:
> > > On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
> > > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > >
> > > > > The guard regions feature was initially implemented to support anonymous
> > > > > mappings only, excluding shmem.
> > > > >
> > > > > This was done such as to introduce the feature carefully and incrementally
> > > > > and to be conservative when considering the various caveats and corner
> > > > > cases that are applicable to file-backed mappings but not to anonymous
> > > > > ones.
> > > > >
> > > > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > > > extend this functionality to file-backed and shmem mappings.
> > > > >
> > > > > In order to make this maximally useful, and since one may map file-backed
> > > > > mappings read-only (for instance ELF images), we also remove the
> > > > > restriction on read-only mappings and permit the establishment of guard
> > > > > regions in any non-hugetlb, non-mlock()'d mapping.
> > > >
> > > > Hi Lorenzo,
> > > >
> > > > Thank you for your work on this.
> > >
> > > You're welcome.
> > >
> > > >
> > > > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> > >
> > > This is off-topic here but... Yes, extensively. No they do not appear
> > > there.
> > >
> > > I thought you had attended LPC and my talk where I mentioned this
> > > purposefully as a drawback?
> > >
> > > I went out of my way to advertise this limitation at the LPC talk, in the
> > > original series, etc. so it's a little disappointing that this is being
> > > brought up so late, but nobody else has raised objections to this issue so
> > > I think in general it's not a limitation that matters in practice.
> > >

Sorry for raising this now, yes at the time I believe we discussed
reducing the vma slab memory usage for the PROT_NONE mappings. I
didn't imagine that apps could have dependencies on the mapped ELF
ranges in /proc/self/[s]maps until recent breakages from a similar
feature. Android itself doesn't depend on this but what I've seen is
banking apps and apps that have obfuscation to prevent reverse
engineering (the particulars of such obfuscation are a black box).

> > > >
> > > > In the field, I've found that many applications read the ranges from
> > > > /proc/self/[s]maps to determine what they can access (usually related
> > > > to obfuscation techniques). If they don't know of the guard regions it
> > > > would cause them to crash; I think that we'll need similar entries to
> > > > PROT_NONE (---p) for these, and generally to maintain consistency
> > > > between the behavior and what is being said from /proc/*/[s]maps.
> > >
> > > No, we cannot have these, sorry.
> > >
> > > Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
> > > feature is to avoid having to accumulate VMAs for regions which are not
> > > intended to be accessible.
> > >
> > > Secondly, there is no practical means for this to be accomplished in
> > > /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
> > > they have guard regions.
> > >
> > > This is intentional, because setting such metadata is simply not practical
> > > - why? Because when you try to split the VMA, how do you know which bit
> > > gets the metadata and which doesn't? You can't without _reading page
> > > tables_.

Yeah the splitting becomes complicated with any vm flags for this...
meaning any attempt to expose this in /proc/*/maps have to
unconditionally walk the page tables :(

> > >
> > > /proc/$pid/smaps _does_ read page tables, but we can't start pretending
> > > VMAs exist when they don't, this would be completely inaccurate, would
> > > break assumptions for things like mremap (which require a single VMA) and
> > > would be unworkable.
> > >
> > > The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
> > > saying 'hey this region has guard regions somewhere'.
> >
> > And then simply expose it in /proc/$pid/pagemap, which is a better interface
> > for this pte-level information inside of VMAs. We should still have a spare
> > bit for that purpose in the pagemap entries.
>
> Ah yeah thanks David forgot about that!
>
> This is also a possibility if that'd solve your problems Kalesh?

I'm not sure what is the correct interface to advertise these. Maybe
smaps as you suggested since we already walk the page tables there?
and pagemap bit for the exact pages as well? It won't solve this
particular issue, as 1000s of in field apps do look at this through
/proc/*/maps. But maybe we have to live with that...

We can argue that such apps are broken since they may trip on the
SIGBUS off the end of the file -- usually this isn't the case for the
ELF segment mappings.

This is still useful for other cases, I just wanted to get some ideas
if this can be extended to further use cases.

Thanks,
Kalesh


>
> This bit will be fought over haha
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

Kalesh Singh Feb. 19, 2025, 8:56 p.m. UTC | #18

On Wed, Feb 19, 2025 at 11:20 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Feb 19, 2025 at 10:52:04AM -0800, Kalesh Singh wrote:
> > On Wed, Feb 19, 2025 at 1:17 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote:
> > > > On 19.02.25 10:03, Lorenzo Stoakes wrote:
> > > > > On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote:
> > > > > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes
> > > > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > > >
> > > > > > > The guard regions feature was initially implemented to support anonymous
> > > > > > > mappings only, excluding shmem.
> > > > > > >
> > > > > > > This was done such as to introduce the feature carefully and incrementally
> > > > > > > and to be conservative when considering the various caveats and corner
> > > > > > > cases that are applicable to file-backed mappings but not to anonymous
> > > > > > > ones.
> > > > > > >
> > > > > > > Now this feature has landed in 6.13, it is time to revisit this and to
> > > > > > > extend this functionality to file-backed and shmem mappings.
> > > > > > >
> > > > > > > In order to make this maximally useful, and since one may map file-backed
> > > > > > > mappings read-only (for instance ELF images), we also remove the
> > > > > > > restriction on read-only mappings and permit the establishment of guard
> > > > > > > regions in any non-hugetlb, non-mlock()'d mapping.
> > > > > >
> > > > > > Hi Lorenzo,
> > > > > >
> > > > > > Thank you for your work on this.
> > > > >
> > > > > You're welcome.
> > > > >
> > > > > >
> > > > > > Have we thought about how guard regions are represented in /proc/*/[s]maps?
> > > > >
> > > > > This is off-topic here but... Yes, extensively. No they do not appear
> > > > > there.
> > > > >
> > > > > I thought you had attended LPC and my talk where I mentioned this
> > > > > purposefully as a drawback?
> > > > >
> > > > > I went out of my way to advertise this limitation at the LPC talk, in the
> > > > > original series, etc. so it's a little disappointing that this is being
> > > > > brought up so late, but nobody else has raised objections to this issue so
> > > > > I think in general it's not a limitation that matters in practice.
> > > > >
> >
> > Sorry for raising this now, yes at the time I believe we discussed
> > reducing the vma slab memory usage for the PROT_NONE mappings. I
> > didn't imagine that apps could have dependencies on the mapped ELF
> > ranges in /proc/self/[s]maps until recent breakages from a similar
> > feature. Android itself doesn't depend on this but what I've seen is
> > banking apps and apps that have obfuscation to prevent reverse
> > engineering (the particulars of such obfuscation are a black box).
>
> Ack ok fair enough, sorry, but obviously you can understand it's
> frustrating when I went to great lengths to advertise this not only at the
> talk but in the original series.
>
> Really important to have these discussions early. Not that really we can do
> much about this, as inherently this feature cannot give you what you need.
>
> Is it _only_ banking apps that do this? And do they exclusively read
> /proc/$pid/maps? I mean there's nothing we can do about that, sorry.

Not only banking apps but that's a common category.

> If that's immutable, then unless you do your own very, very, very slow custom
> android maps implementation (that will absolutely break the /proc/$pid/maps
> scalability efforts atm) this is just a no-go.
>

Yeah unfortunately that's immutable as app versions are mostly
independent from the OS version.

We do have something that handles this by encoding the guard regions
in the vm_flags, but as you can imagine it's not generic enough for
upstream.

> >
> > > > > >
> > > > > > In the field, I've found that many applications read the ranges from
> > > > > > /proc/self/[s]maps to determine what they can access (usually related
> > > > > > to obfuscation techniques). If they don't know of the guard regions it
> > > > > > would cause them to crash; I think that we'll need similar entries to
> > > > > > PROT_NONE (---p) for these, and generally to maintain consistency
> > > > > > between the behavior and what is being said from /proc/*/[s]maps.
> > > > >
> > > > > No, we cannot have these, sorry.
> > > > >
> > > > > Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this
> > > > > feature is to avoid having to accumulate VMAs for regions which are not
> > > > > intended to be accessible.
> > > > >
> > > > > Secondly, there is no practical means for this to be accomplished in
> > > > > /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates
> > > > > they have guard regions.
> > > > >
> > > > > This is intentional, because setting such metadata is simply not practical
> > > > > - why? Because when you try to split the VMA, how do you know which bit
> > > > > gets the metadata and which doesn't? You can't without _reading page
> > > > > tables_.
> >
> > Yeah the splitting becomes complicated with any vm flags for this...
> > meaning any attempt to expose this in /proc/*/maps have to
> > unconditionally walk the page tables :(
>
> It's not really complicated, it's _impossible_ unless you made literally
> all VMA code walk page tables for every single operation. Which we are
> emphatically not going to do :)
>
> And no, /proc/$pid/maps is _never_ going to walk page tables. For obvious
> performance reasons.
>
> >
> > > > >
> > > > > /proc/$pid/smaps _does_ read page tables, but we can't start pretending
> > > > > VMAs exist when they don't, this would be completely inaccurate, would
> > > > > break assumptions for things like mremap (which require a single VMA) and
> > > > > would be unworkable.
> > > > >
> > > > > The best that _could_ be achieved is to have a marker in /proc/$pid/smaps
> > > > > saying 'hey this region has guard regions somewhere'.
> > > >
> > > > And then simply expose it in /proc/$pid/pagemap, which is a better interface
> > > > for this pte-level information inside of VMAs. We should still have a spare
> > > > bit for that purpose in the pagemap entries.
> > >
> > > Ah yeah thanks David forgot about that!
> > >
> > > This is also a possibility if that'd solve your problems Kalesh?
> >
> > I'm not sure what is the correct interface to advertise these. Maybe
> > smaps as you suggested since we already walk the page tables there?
> > and pagemap bit for the exact pages as well? It won't solve this
> > particular issue, as 1000s of in field apps do look at this through
> > /proc/*/maps. But maybe we have to live with that...
>
> I mean why are we even considering this if you can't change this anywhere?
> Confused by that.
>
> I'm afraid upstream can't radically change interfaces to suit this
> scenario.
>
> We also can't change smaps in the way you want, it _has_ to still give
> output per VMA information.

Sorry I wasn't suggesting to change the entries in smaps, rather
agreeing to your marker suggestion. Maybe a set of ranges for each
smaps entry that has guards? It doesn't solve the use case, but does
make these regions visible to userspace.

>
> The proposed change that would be there would be a flag or something
> indicating that the VMA has guard regions _SOMEWHERE_ in it.
>
> Since this doesn't solve your problem, adds complexity, and nobody else
> seems to need it, I would suggest this is not worthwhile and I'd rather not
> do this.
>
> Therefore for your needs there are literally only two choices here:
>
> 1. Add a bit to /proc/$pid/pagemap OR
> 2. a new interface.
>
> I am not in favour of a new interface here, if we can just extend pagemap.
>
> What you'd have to do is:
>
> 1. Find virtual ranges via /proc/$pid/maps
> 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges.
>

Could we also consider an smaps field like:

VmGuards: [AAA, BBB), [CCC, DDD), ...

or something of that sort?


> Since anything that would retrieve guard region state would need to walk
> page tables, any approach would be slow and I don't think this would be any
> less slow than any other interface.
>
> This way you'd be able to find all guard regions all the time.
>
> This is just the trade-off for this feature unfortunately - its whole
> design ethos is to allow modification of -faulting- behaviour without
> having to modify -VMA- behaviour.
>
> But if it's banking apps whose code you can't control (surprised you don't
> lock down these interfaces), I mean is this even useful to you?
>
> If your requirement is 'you have to change /proc/$pid/maps to show guard
> regions' I mean the answer is that we can't.
>
> >
> > We can argue that such apps are broken since they may trip on the
> > SIGBUS off the end of the file -- usually this isn't the case for the
> > ELF segment mappings.
>
> Or tearing of the maps interface, or things getting unmapped or or
> or... It's really not a sane thing to do.
>
> >
> > This is still useful for other cases, I just wanted to get some ideas
> > if this can be extended to further use cases.
>
> Well I'm glad that you guys find it useful for _something_ ;)
>
> Again this wasn't written only for you (it is broadly a good feature for
> upstream), but I did have your use case in mind, so I'm a little
> disappointed that it doesn't help, as I like to solve problems.
>
> But I'm glad it solves at least some for you...

I recall Liam had a proposal to store the guard ranges in the maple tree?

I wonder if that can be used in combination with this approach to have
a better representation of this?

>
> >
> > Thanks,
> > Kalesh
> >
> >
> > >
> > > This bit will be fought over haha
> > >
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > David / dhildenb
> > > >

[0/4] mm: permit guard regions for file-backed/shmem mappings

Message

Comments