Message ID | cover.1739469950.git.lorenzo.stoakes@oracle.com |
---|---|
Headers | show |
Series | mm: permit guard regions for file-backed/shmem mappings | expand |
On 2/13/25 19:16, Lorenzo Stoakes wrote: > The guard regions feature was initially implemented to support anonymous > mappings only, excluding shmem. > > This was done such as to introduce the feature carefully and incrementally > and to be conservative when considering the various caveats and corner > cases that are applicable to file-backed mappings but not to anonymous > ones. > > Now this feature has landed in 6.13, it is time to revisit this and to > extend this functionality to file-backed and shmem mappings. > > In order to make this maximally useful, and since one may map file-backed > mappings read-only (for instance ELF images), we also remove the > restriction on read-only mappings and permit the establishment of guard > regions in any non-hugetlb, non-mlock()'d mapping. Do you plan to address mlock later too? I guess somebody might find use for those. Is there some fundamental issue or just that we need to define some good semantics for corner cases? (i.e. if pages are already populated in the mlocked area, discarding them by replacing with guard pte's goes against that, so do we allow it or not?). Otherwise nice discussion of all the potential issues here!
On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote: > On 2/13/25 19:16, Lorenzo Stoakes wrote: > > The guard regions feature was initially implemented to support anonymous > > mappings only, excluding shmem. > > > > This was done such as to introduce the feature carefully and incrementally > > and to be conservative when considering the various caveats and corner > > cases that are applicable to file-backed mappings but not to anonymous > > ones. > > > > Now this feature has landed in 6.13, it is time to revisit this and to > > extend this functionality to file-backed and shmem mappings. > > > > In order to make this maximally useful, and since one may map file-backed > > mappings read-only (for instance ELF images), we also remove the > > restriction on read-only mappings and permit the establishment of guard > > regions in any non-hugetlb, non-mlock()'d mapping. > > Do you plan to address mlock later too? I guess somebody might find use for > those. Is there some fundamental issue or just that we need to define some > good semantics for corner cases? (i.e. if pages are already populated in the > mlocked area, discarding them by replacing with guard pte's goes against > that, so do we allow it or not?). Yeah that's the fundamental issue with mlock, it does not interact with the zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked _afterwards_ there will be no zapping). We could potentially expose an equivalent, as there are for other flags, a _LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make this explicit. That is probably the most sensible option, if there is a need for this! > > Otherwise nice discussion of all the potential issues here! > Thanks! :)
On 18.02.25 14:05, Lorenzo Stoakes wrote: > On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote: >> On 2/13/25 19:16, Lorenzo Stoakes wrote: >>> The guard regions feature was initially implemented to support anonymous >>> mappings only, excluding shmem. >>> >>> This was done such as to introduce the feature carefully and incrementally >>> and to be conservative when considering the various caveats and corner >>> cases that are applicable to file-backed mappings but not to anonymous >>> ones. >>> >>> Now this feature has landed in 6.13, it is time to revisit this and to >>> extend this functionality to file-backed and shmem mappings. >>> >>> In order to make this maximally useful, and since one may map file-backed >>> mappings read-only (for instance ELF images), we also remove the >>> restriction on read-only mappings and permit the establishment of guard >>> regions in any non-hugetlb, non-mlock()'d mapping. >> >> Do you plan to address mlock later too? I guess somebody might find use for >> those. Is there some fundamental issue or just that we need to define some >> good semantics for corner cases? (i.e. if pages are already populated in the >> mlocked area, discarding them by replacing with guard pte's goes against >> that, so do we allow it or not?). > > Yeah that's the fundamental issue with mlock, it does not interact with the > zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so > for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked > _afterwards_ there will be no zapping). > > We could potentially expose an equivalent, as there are for other flags, a > _LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make > this explicit. > > That is probably the most sensible option, if there is a need for this! mlock is weird, because it assumes that memory will be faulted in in the whole VMA. You'd likely have to populate + mlock the page when removing the guard. Also not sure what happens if one does an mlock()/mlockall() after already installing PTE markers. __mm_populate() would skip whole VMAs in case populate_vma_page_range() fails. And I would assume populate_vma_page_range() fails on the first guard when it triggers a page fault. OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where MADV_DONTNEED_LOCKED originates from: commit 9457056ac426e5ed0671356509c8dcce69f8dee0 Author: Johannes Weiner <hannes@cmpxchg.org> Date: Thu Mar 24 18:14:12 2022 -0700 mm: madvise: MADV_DONTNEED_LOCKED MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT and MCL_ONFAULT allowing to mlock without populating, there are valid use cases for depopulating locked ranges as well. Adding support for that would be indeed nice.
On Tue, Feb 18, 2025 at 03:35:19PM +0100, David Hildenbrand wrote: > On 18.02.25 14:05, Lorenzo Stoakes wrote: > > On Tue, Feb 18, 2025 at 01:12:05PM +0100, Vlastimil Babka wrote: > > > On 2/13/25 19:16, Lorenzo Stoakes wrote: > > > > The guard regions feature was initially implemented to support anonymous > > > > mappings only, excluding shmem. > > > > > > > > This was done such as to introduce the feature carefully and incrementally > > > > and to be conservative when considering the various caveats and corner > > > > cases that are applicable to file-backed mappings but not to anonymous > > > > ones. > > > > > > > > Now this feature has landed in 6.13, it is time to revisit this and to > > > > extend this functionality to file-backed and shmem mappings. > > > > > > > > In order to make this maximally useful, and since one may map file-backed > > > > mappings read-only (for instance ELF images), we also remove the > > > > restriction on read-only mappings and permit the establishment of guard > > > > regions in any non-hugetlb, non-mlock()'d mapping. > > > > > > Do you plan to address mlock later too? I guess somebody might find use for > > > those. Is there some fundamental issue or just that we need to define some > > > good semantics for corner cases? (i.e. if pages are already populated in the > > > mlocked area, discarding them by replacing with guard pte's goes against > > > that, so do we allow it or not?). > > > > Yeah that's the fundamental issue with mlock, it does not interact with the > > zapping part of MADV_GUARD_INSTALL, and that is why we disallow it (but not so > > for MADV_GUARD_REMOVE, as if a VMA that contains guard regions is locked > > _afterwards_ there will be no zapping). > > > > We could potentially expose an equivalent, as there are for other flags, a > > _LOCKED variant of the madvise() flag, like MADV_GUARD_INSTALL_LOCKED to make > > this explicit. > > > > That is probably the most sensible option, if there is a need for this! > > mlock is weird, because it assumes that memory will be faulted in in the whole VMA. > > You'd likely have to populate + mlock the page when removing the guard. Right yeah that'd be super weird. And I don't want to add that logic. > Also not sure what happens if one does an mlock()/mlockall() after > already installing PTE markers. The existing logic already handles non-present cases by skipping them, in mlock_pte_range(): for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) { ptent = ptep_get(pte); if (!pte_present(ptent)) continue; ... } Which covers off guard regions. Removing the guard regions after this will leave you in a weird situation where these entries will be zapped... maybe we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case also populate? Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE if you since locked the range? The code currently allows this on the proviso that 'you aren't zapping locked mappings' but leaves the VMA in a state such that some entries would not be locked. It'd be pretty weird to lock guard regions like this. Having said all that, given what you say below, maybe it's not an issue after all?... > > __mm_populate() would skip whole VMAs in case populate_vma_page_range() > fails. And I would assume populate_vma_page_range() fails on the first > guard when it triggers a page fault. > > OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where > MADV_DONTNEED_LOCKED originates from: > > commit 9457056ac426e5ed0671356509c8dcce69f8dee0 > Author: Johannes Weiner <hannes@cmpxchg.org> > Date: Thu Mar 24 18:14:12 2022 -0700 > > mm: madvise: MADV_DONTNEED_LOCKED > MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT > and MCL_ONFAULT allowing to mlock without populating, there are valid use > cases for depopulating locked ranges as well. ...Hm this seems to imply the current guard remove stuff isn't quite so bad, so maybe the assumption that VM_LOCKED implies 'everything is populated' isn't quite as stringent then. The restriction is as simple as: if (behavior != MADV_DONTNEED_LOCKED) forbidden |= VM_LOCKED; > > > Adding support for that would be indeed nice. I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED locked ranges, but I really don't understand why you'd want to add guard regions to mlock()'ed regions? Then again we're currently asymmetric as you can add them _before_ mlock()'ing... > > -- > Cheers, > > David / dhildenb >
> Right yeah that'd be super weird. And I don't want to add that logic. > >> Also not sure what happens if one does an mlock()/mlockall() after >> already installing PTE markers. > > The existing logic already handles non-present cases by skipping them, in > mlock_pte_range(): > > for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) { > ptent = ptep_get(pte); > if (!pte_present(ptent)) > continue; > > ... > } I *think* that code only updates already-mapped folios, to properly call mlock_folio()/munlock_folio(). It is not the code that populates pages on mlock()/mlockall(). I think all that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should apply. See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling. > > Which covers off guard regions. Removing the guard regions after this will > leave you in a weird situation where these entries will be zapped... maybe > we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case > also populate? Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED. > > Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE > if you since locked the range? The code currently allows this on the > proviso that 'you aren't zapping locked mappings' but leaves the VMA in a > state such that some entries would not be locked. > > It'd be pretty weird to lock guard regions like this. > > Having said all that, given what you say below, maybe it's not an issue > after all?... > >> >> __mm_populate() would skip whole VMAs in case populate_vma_page_range() >> fails. And I would assume populate_vma_page_range() fails on the first >> guard when it triggers a page fault. >> >> OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where >> MADV_DONTNEED_LOCKED originates from: >> >> commit 9457056ac426e5ed0671356509c8dcce69f8dee0 >> Author: Johannes Weiner <hannes@cmpxchg.org> >> Date: Thu Mar 24 18:14:12 2022 -0700 >> >> mm: madvise: MADV_DONTNEED_LOCKED >> MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT >> and MCL_ONFAULT allowing to mlock without populating, there are valid use >> cases for depopulating locked ranges as well. > > ...Hm this seems to imply the current guard remove stuff isn't quite so > bad, so maybe the assumption that VM_LOCKED implies 'everything is > populated' isn't quite as stringent then. Right, with MCL_ONFAULT at least. Without MCL_ONFAULT, the assumption is that everything is populated (unless, apparently one uses MADV_DONTNEED_LOCKED or population failed, maybe). VM_LOCKONFAULT seems to be the sane case. I wonder why MADV_DONTNEED_LOCKED didn't explicitly check for that one ... maybe there is a history to that. > > The restriction is as simple as: > > if (behavior != MADV_DONTNEED_LOCKED) > forbidden |= VM_LOCKED; > >> >> >> Adding support for that would be indeed nice. > > I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED > locked ranges, but I really don't understand why you'd want to add guard > regions to mlock()'ed regions? Somme apps use mlockall(), and it might be nice to just be able to use guard pages as if "Nothing happened". E.g., QEMU has the option to use mlockall(). > > Then again we're currently asymmetric as you can add them _before_ > mlock()'ing... Right.
On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote: > > Right yeah that'd be super weird. And I don't want to add that logic. > > > > > Also not sure what happens if one does an mlock()/mlockall() after > > > already installing PTE markers. > > > > The existing logic already handles non-present cases by skipping them, in > > mlock_pte_range(): > > > > for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) { > > ptent = ptep_get(pte); > > if (!pte_present(ptent)) > > continue; > > > > ... > > } > > I *think* that code only updates already-mapped folios, to properly call > mlock_folio()/munlock_folio(). Guard regions _are_ 'already mapped' :) so it leaves them in place. do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range() implies this will be invoked. > > It is not the code that populates pages on mlock()/mlockall(). I think all > that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should > apply. OK I want to correct what I said earlier. Installing a guard region then attempting mlock() will result in an error. The populate will -EFAULT and stop at the guard region, which causes mlock() to error out. This is a partial failure, so the VMA is split and has VM_LOCKED applied, but the populate halts at the guard region. This is ok as per previous discussion on aggregate operation failure, there can be no expectation of 'unwinding' of partially successful operations that form part of a requested aggregate one. However, given there's stuff to clean up, and on error a user _may_ wish to then remove guard regions and try again, I guess there's no harm in keeping the code as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place. > > See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling. Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly' yeah wrongly, very wrongly... > > > > > Which covers off guard regions. Removing the guard regions after this will > > leave you in a weird situation where these entries will be zapped... maybe > > we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case > > also populate? > > Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED. See above, no we should not :P this is only good for cleanup after mlock() failure, although no sane program should really be trying to do this, a sane program would give up here (and it's a _programmatic error_ to try to mlock() a range with guard regions). > > > > > Actually I think the simpler option is to just disallow MADV_GUARD_REMOVE > > if you since locked the range? The code currently allows this on the > > proviso that 'you aren't zapping locked mappings' but leaves the VMA in a > > state such that some entries would not be locked. > > > > It'd be pretty weird to lock guard regions like this. > > > > Having said all that, given what you say below, maybe it's not an issue > > after all?... > > > > > > > > __mm_populate() would skip whole VMAs in case populate_vma_page_range() > > > fails. And I would assume populate_vma_page_range() fails on the first > > > guard when it triggers a page fault. > > > > > > OTOH, supporting the mlock-on-fault thingy should be easy. That's precisely where > > > MADV_DONTNEED_LOCKED originates from: > > > > > > commit 9457056ac426e5ed0671356509c8dcce69f8dee0 > > > Author: Johannes Weiner <hannes@cmpxchg.org> > > > Date: Thu Mar 24 18:14:12 2022 -0700 > > > > > > mm: madvise: MADV_DONTNEED_LOCKED > > > MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT > > > and MCL_ONFAULT allowing to mlock without populating, there are valid use > > > cases for depopulating locked ranges as well. > > > > ...Hm this seems to imply the current guard remove stuff isn't quite so > > bad, so maybe the assumption that VM_LOCKED implies 'everything is > > populated' isn't quite as stringent then. > > Right, with MCL_ONFAULT at least. Without MCL_ONFAULT, the assumption is > that everything is populated (unless, apparently one uses > MADV_DONTNEED_LOCKED or population failed, maybe). > > VM_LOCKONFAULT seems to be the sane case. I wonder why MADV_DONTNEED_LOCKED > didn't explicitly check for that one ... maybe there is a history to that. Yeah weird. > > > > > The restriction is as simple as: > > > > if (behavior != MADV_DONTNEED_LOCKED) > > forbidden |= VM_LOCKED; > > > > > > > > > > > Adding support for that would be indeed nice. > > > > I mean it's sort of maybe understandable why you'd want to MADV_DONTNEED > > locked ranges, but I really don't understand why you'd want to add guard > > regions to mlock()'ed regions? > > Somme apps use mlockall(), and it might be nice to just be able to use guard > pages as if "Nothing happened". Sadly I think not given above :P > > E.g., QEMU has the option to use mlockall(). > > > > > Then again we're currently asymmetric as you can add them _before_ > > mlock()'ing... > > Right. > > -- > Cheers, > > David / dhildenb > I think the _LOCKED idea is therefore kaput, because it just won't work properly because populating guard regions fails. It fails because it tries to 'touch' the memory, but 'touching' guard region memory causes a segfault. This kind of breaks the idea of mlock()'ing guard regions. I think adding workarounds to make this possible in any way is not really worth it (and would probably be pretty gross). We already document that 'mlock()ing lightweight guard regions will fail' as per man page so this is all in line with that.
On 18.02.25 17:43, Lorenzo Stoakes wrote: > On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote: >>> Right yeah that'd be super weird. And I don't want to add that logic. >>> >>>> Also not sure what happens if one does an mlock()/mlockall() after >>>> already installing PTE markers. >>> >>> The existing logic already handles non-present cases by skipping them, in >>> mlock_pte_range(): >>> >>> for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) { >>> ptent = ptep_get(pte); >>> if (!pte_present(ptent)) >>> continue; >>> >>> ... >>> } >> >> I *think* that code only updates already-mapped folios, to properly call >> mlock_folio()/munlock_folio(). > > Guard regions _are_ 'already mapped' :) so it leaves them in place. "mapped folios" -- there is no folio mapped. Yes, the VMA is in place. > > do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range() > implies this will be invoked. Yes, to process any already mapped folios, to then continue population later. > >> >> It is not the code that populates pages on mlock()/mlockall(). I think all >> that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should >> apply. > > OK I want to correct what I said earlier. > > Installing a guard region then attempting mlock() will result in an error. The > populate will -EFAULT and stop at the guard region, which causes mlock() to > error out. Right, that's my expectation. > > This is a partial failure, so the VMA is split and has VM_LOCKED applied, but > the populate halts at the guard region. > > This is ok as per previous discussion on aggregate operation failure, there can > be no expectation of 'unwinding' of partially successful operations that form > part of a requested aggregate one. > > However, given there's stuff to clean up, and on error a user _may_ wish to then > remove guard regions and try again, I guess there's no harm in keeping the code > as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place. Likely yes; it's all weird code. > >> >> See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling. > > Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly' > yeah wrongly, very wrongly... > >> >>> >>> Which covers off guard regions. Removing the guard regions after this will >>> leave you in a weird situation where these entries will be zapped... maybe >>> we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case >>> also populate? >> >> Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED. > > See above, no we should not :P this is only good for cleanup after mlock() > failure, although no sane program should really be trying to do this, a sane > program would give up here (and it's a _programmatic error_ to try to mlock() a > range with guard regions). >>>> Somme apps use mlockall(), and it might be nice to just be able to use guard >> pages as if "Nothing happened". > > Sadly I think not given above :P QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when requested to then exit(); if it fails. Assume glibc or any lib uses it, QEMU would have no real way of figuring that out or instructing offending libraries to disabled that, at least for now ... ... turning RT VMs less usable if any library uses guard regions. :( There is upcoming support for MCL_ONFAULT in QEMU [1] (see below). [1] https://lkml.kernel.org/r/20250212173823.214429-3-peterx@redhat.com > >> >> E.g., QEMU has the option to use mlockall(). >> >>> >>> Then again we're currently asymmetric as you can add them _before_ >>> mlock()'ing... >> >> Right. >> >> -- >> Cheers, >> >> David / dhildenb >> > > I think the _LOCKED idea is therefore kaput, because it just won't work > properly because populating guard regions fails. Right, I think basic VM_LOCKED is out of the picture. VM_LOCKONFAULT might be interesting, because we are skipping the population stage. > > It fails because it tries to 'touch' the memory, but 'touching' guard > region memory causes a segfault. This kind of breaks the idea of > mlock()'ing guard regions. > > I think adding workarounds to make this possible in any way is not really > worth it (and would probably be pretty gross). > > We already document that 'mlock()ing lightweight guard regions will fail' > as per man page so this is all in line with that. Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as allowing install/remove of guard regions when that flag is set.
On Tue, Feb 18, 2025 at 06:14:00PM +0100, David Hildenbrand wrote: > On 18.02.25 17:43, Lorenzo Stoakes wrote: > > On Tue, Feb 18, 2025 at 04:20:18PM +0100, David Hildenbrand wrote: > > > > Right yeah that'd be super weird. And I don't want to add that logic. > > > > > > > > > Also not sure what happens if one does an mlock()/mlockall() after > > > > > already installing PTE markers. > > > > > > > > The existing logic already handles non-present cases by skipping them, in > > > > mlock_pte_range(): > > > > > > > > for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) { > > > > ptent = ptep_get(pte); > > > > if (!pte_present(ptent)) > > > > continue; > > > > > > > > ... > > > > } > > > > > > I *think* that code only updates already-mapped folios, to properly call > > > mlock_folio()/munlock_folio(). > > > > Guard regions _are_ 'already mapped' :) so it leaves them in place. > > "mapped folios" -- there is no folio mapped. Yes, the VMA is in place. We're engaging in a moot discussion on this I think but I mean it appears to operate by walking page tables if they are populated, which they will be for guard regions, but when it finds it's non-present it will skip. This amounts to the same thing as not doing anything, obviously. > > > > > do_mlock() -> apply_vma_lock_flags() -> mlock_fixup() -> mlock_vma_pages_range() > > implies this will be invoked. > > Yes, to process any already mapped folios, to then continue population > later. > > > > > > > > > It is not the code that populates pages on mlock()/mlockall(). I think all > > > that goes via mm_populate()/__mm_populate(), where "ordinary GUP" should > > > apply. > > > > OK I want to correct what I said earlier. > > > > Installing a guard region then attempting mlock() will result in an error. The > > populate will -EFAULT and stop at the guard region, which causes mlock() to > > error out. > > Right, that's my expectation. OK good! > > > > > This is a partial failure, so the VMA is split and has VM_LOCKED applied, but > > the populate halts at the guard region. > > > > This is ok as per previous discussion on aggregate operation failure, there can > > be no expectation of 'unwinding' of partially successful operations that form > > part of a requested aggregate one. > > > > However, given there's stuff to clean up, and on error a user _may_ wish to then > > remove guard regions and try again, I guess there's no harm in keeping the code > > as it is where we allow MADV_GUARD_REMOVE even if VM_LOCKED is in place. > > Likely yes; it's all weird code. > > > > > > > > > See populate_vma_page_range(), especially also the VM_LOCKONFAULT handling. > > > > Yeah that code is horrible, you just reminded me of it... 'rightly or wrongly' > > yeah wrongly, very wrongly... > > > > > > > > > > > > > Which covers off guard regions. Removing the guard regions after this will > > > > leave you in a weird situation where these entries will be zapped... maybe > > > > we need a patch to make MADV_GUARD_REMOVE check VM_LOCKED and in this case > > > > also populate? > > > > > > Maybe? Or we say that it behaves like MADV_DONTNEED_LOCKED. > > > > See above, no we should not :P this is only good for cleanup after mlock() > > failure, although no sane program should really be trying to do this, a sane > > program would give up here (and it's a _programmatic error_ to try to mlock() a > > range with guard regions). > >>>> Somme apps use mlockall(), and it might be nice to just be able to use > guard > > > pages as if "Nothing happened". > > > > Sadly I think not given above :P > > QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when > requested to then exit(); if it fails. Hm under what circumstances? I use qemu extensively to test this stuff with no issues. Unless you mean it's using it in the 'host' code somehow. > > Assume glibc or any lib uses it, QEMU would have no real way of figuring > that out or instructing offending libraries to disabled that, at least for > now ... > > ... turning RT VMs less usable if any library uses guard regions. :( > This seems really stupid, to be honest. Unfortunately there's no way around this, if software does stupid things then they get stupid prizes. There are other ways mlock() and faulting in can fail too. > There is upcoming support for MCL_ONFAULT in QEMU [1] (see below). Good. > > [1] https://lkml.kernel.org/r/20250212173823.214429-3-peterx@redhat.com > > > > > > > > > E.g., QEMU has the option to use mlockall(). > > > > > > > > > > > Then again we're currently asymmetric as you can add them _before_ > > > > mlock()'ing... > > > > > > Right. > > > > > > -- > > > Cheers, > > > > > > David / dhildenb > > > > > > > I think the _LOCKED idea is therefore kaput, because it just won't work > > properly because populating guard regions fails. > > Right, I think basic VM_LOCKED is out of the picture. VM_LOCKONFAULT might > be interesting, because we are skipping the population stage. > > > > > It fails because it tries to 'touch' the memory, but 'touching' guard > > region memory causes a segfault. This kind of breaks the idea of > > mlock()'ing guard regions. > > > > I think adding workarounds to make this possible in any way is not really > > worth it (and would probably be pretty gross). > > > > We already document that 'mlock()ing lightweight guard regions will fail' > > as per man page so this is all in line with that. > > Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as > allowing install/remove of guard regions when that flag is set. We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we disallow. > > -- > Cheers, > > David / dhildenb >
>> >> QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when >> requested to then exit(); if it fails. > > Hm under what circumstances? I use qemu extensively to test this stuff with > no issues. Unless you mean it's using it in the 'host' code somehow. -overcommit mem-lock=on or (legacy) -realtime mlock=on I think. [...] >>> >>> It fails because it tries to 'touch' the memory, but 'touching' guard >>> region memory causes a segfault. This kind of breaks the idea of >>> mlock()'ing guard regions. >>> >>> I think adding workarounds to make this possible in any way is not really >>> worth it (and would probably be pretty gross). >>> >>> We already document that 'mlock()ing lightweight guard regions will fail' >>> as per man page so this is all in line with that. >> >> Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as >> allowing install/remove of guard regions when that flag is set. > > We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we > disallow. See mlock2(); SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags) { vm_flags_t vm_flags = VM_LOCKED; if (flags & ~MLOCK_ONFAULT) return -EINVAL; if (flags & MLOCK_ONFAULT) vm_flags |= VM_LOCKONFAULT; return do_mlock(start, len, vm_flags); } VM_LOCKONFAULT always as VM_LOCKED set as well.
On Tue, Feb 18, 2025 at 06:25:35PM +0100, David Hildenbrand wrote: > > > > > > QEMU, for example, will issue an mlockall(MCL_CURRENT | MCL_FUTURE); when > > > requested to then exit(); if it fails. > > > > Hm under what circumstances? I use qemu extensively to test this stuff with > > no issues. Unless you mean it's using it in the 'host' code somehow. > > > -overcommit mem-lock=on > > or (legacy) > > -realtime mlock=on > > I think. > > [...] Thanks > > > > > > > > > It fails because it tries to 'touch' the memory, but 'touching' guard > > > > region memory causes a segfault. This kind of breaks the idea of > > > > mlock()'ing guard regions. > > > > > > > > I think adding workarounds to make this possible in any way is not really > > > > worth it (and would probably be pretty gross). > > > > > > > > We already document that 'mlock()ing lightweight guard regions will fail' > > > > as per man page so this is all in line with that. > > > > > > Right, and I claim that supporting VM_LOCKONFAULT might likely be as easy as > > > allowing install/remove of guard regions when that flag is set. > > > > We already allow this flag! VM_LOCKED and VM_HUGETLB are the only flags we > > disallow. > > > See mlock2(); > > SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags) > { > vm_flags_t vm_flags = VM_LOCKED; > > if (flags & ~MLOCK_ONFAULT) > return -EINVAL; > > if (flags & MLOCK_ONFAULT) > vm_flags |= VM_LOCKONFAULT; > > return do_mlock(start, len, vm_flags); > } > > > VM_LOCKONFAULT always as VM_LOCKED set as well. OK cool, that makes sense. As with much kernel stuff, I knew this in the past. Then I forgot. Then I knew again, then... :P if only somebody would write it down in a book... Yeah then that makes sense to check explicitly for (VM_LOCKED | VM_LOCKONFAULT) in any MADV_GUARD_INSTALL_LOCKED variant as obviously this would be passively excluded right now. > > -- > Cheers, > > David / dhildenb >
>> See mlock2(); >> >> SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags) >> { >> vm_flags_t vm_flags = VM_LOCKED; >> >> if (flags & ~MLOCK_ONFAULT) >> return -EINVAL; >> >> if (flags & MLOCK_ONFAULT) >> vm_flags |= VM_LOCKONFAULT; >> >> return do_mlock(start, len, vm_flags); >> } >> >> >> VM_LOCKONFAULT always as VM_LOCKED set as well. > > OK cool, that makes sense. > > As with much kernel stuff, I knew this in the past. Then I forgot. Then I knew > again, then... :P if only somebody would write it down in a book... And if there would be such a book in the makings, I would consider hurrying up and preordering it ;)
On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > The guard regions feature was initially implemented to support anonymous > mappings only, excluding shmem. > > This was done such as to introduce the feature carefully and incrementally > and to be conservative when considering the various caveats and corner > cases that are applicable to file-backed mappings but not to anonymous > ones. > > Now this feature has landed in 6.13, it is time to revisit this and to > extend this functionality to file-backed and shmem mappings. > > In order to make this maximally useful, and since one may map file-backed > mappings read-only (for instance ELF images), we also remove the > restriction on read-only mappings and permit the establishment of guard > regions in any non-hugetlb, non-mlock()'d mapping. Hi Lorenzo, Thank you for your work on this. Have we thought about how guard regions are represented in /proc/*/[s]maps? In the field, I've found that many applications read the ranges from /proc/self/[s]maps to determine what they can access (usually related to obfuscation techniques). If they don't know of the guard regions it would cause them to crash; I think that we'll need similar entries to PROT_NONE (---p) for these, and generally to maintain consistency between the behavior and what is being said from /proc/*/[s]maps. -- Kalesh > > It is permissible to permit the establishment of guard regions in read-only > mappings because the guard regions only reduce access to the mapping, and > when removed simply reinstate the existing attributes of the underlying > VMA, meaning no access violations can occur. > > While the change in kernel code introduced in this series is small, the > majority of the effort here is spent in extending the testing to assert > that the feature works correctly across numerous file-backed mapping > scenarios. > > Every single guard region self-test performed against anonymous memory > (which is relevant and not anon-only) has now been updated to also be > performed against shmem and a mapping of a file in the working directory. > > This confirms that all cases also function correctly for file-backed guard > regions. > > In addition a number of other tests are added for specific file-backed > mapping scenarios. > > There are a number of other concerns that one might have with regard to > guard regions, addressed below: > > Readahead > ~~~~~~~~~ > > Readahead is a process through which the page cache is populated on the > assumption that sequential reads will occur, thus amortising I/O and, > through a clever use of the PG_readahead folio flag establishing during > major fault and checked upon minor fault, provides for asynchronous I/O to > occur as dat is processed, reducing I/O stalls as data is faulted in. > > Guard regions do not alter this mechanism which operations at the folio and > fault level, but do of course prevent the faulting of folios that would > otherwise be mapped. > > In the instance of a major fault prior to a guard region, synchronous > readahead will occur including populating folios in the page cache which > the guard regions will, in the case of the mapping in question, prevent > access to. > > In addition, if PG_readahead is placed in a folio that is now inaccessible, > this will prevent asynchronous readahead from occurring as it would > otherwise do. > > However, there are mechanisms for heuristically resetting this within > readahead regardless, which will 'recover' correct readahead behaviour. > > Readahead presumes sequential data access, the presence of a guard region > clearly indicates that, at least in the guard region, no such sequential > access will occur, as it cannot occur there. > > So this should have very little impact on any real workload. The far more > important point is as to whether readahead causes incorrect or > inappropriate mapping of ranges disallowed by the presence of guard > regions - this is not the case, as readahead does not 'pre-fault' memory in > this fashion. > > At any rate, any mechanism which would attempt to do so would hit the usual > page fault paths, which correctly handle PTE markers as with anonymous > mappings. > > Fault-Around > ~~~~~~~~~~~~ > > The fault-around logic, in a similar vein to readahead, attempts to improve > efficiency with regard to file-backed memory mappings, however it differs > in that it does not try to fetch folios into the page cache that are about > to be accessed, but rather pre-maps a range of folios around the faulting > address. > > Guard regions making use of PTE markers makes this relatively trivial, as > this case is already handled - see filemap_map_folio_range() and > filemap_map_order0_folio() - in both instances, the solution is to simply > keep the established page table mappings and let the fault handler take > care of PTE markers, as per the comment: > > /* > * NOTE: If there're PTE markers, we'll leave them to be > * handled in the specific fault path, and it'll prohibit > * the fault-around logic. > */ > > This works, as establishing guard regions results in page table mappings > with PTE markers, and clearing them removes them. > > Truncation > ~~~~~~~~~~ > > File truncation will not eliminate existing guard regions, as the > truncation operation will ultimately zap the range via > unmap_mapping_range(), which specifically excludes PTE markers. > > Zapping > ~~~~~~~ > > Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(), > which specifically deals with guard entries, leaving them intact except in > instances such as process teardown or munmap() where they need to be > removed. > > Reclaim > ~~~~~~~ > > When reclaim is performed on file-backed folios, it ultimately invokes > try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte() > will ultimately abort the operation for the guard region mapping. If large, > then check_pte() will determine that this is a non-device private > entry/device-exclusive entry 'swap' PTE and thus abort the operation in > that instance. > > Therefore, no odd things happen in the instance of reclaim being attempted > upon a file-backed guard region. > > Hole Punching > ~~~~~~~~~~~~~ > > This updates the page cache and ultimately invokes unmap_mapping_range(), > which explicitly leaves PTE markers in place. > > Because the establishment of guard regions zapped any existing mappings to > file-backed folios, once the guard regions are removed then the > hole-punched region will be faulted in as usual and everything will behave > as expected. > > Lorenzo Stoakes (4): > mm: allow guard regions in file-backed and read-only mappings > selftests/mm: rename guard-pages to guard-regions > tools/selftests: expand all guard region tests to file-backed > tools/selftests: add file/shmem-backed mapping guard region tests > > mm/madvise.c | 8 +- > tools/testing/selftests/mm/.gitignore | 2 +- > tools/testing/selftests/mm/Makefile | 2 +- > .../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++-- > 4 files changed, 821 insertions(+), 112 deletions(-) > rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%) > > -- > 2.48.1
On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh <kaleshsingh@google.com> wrote: > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > > > The guard regions feature was initially implemented to support anonymous > > mappings only, excluding shmem. > > > > This was done such as to introduce the feature carefully and incrementally > > and to be conservative when considering the various caveats and corner > > cases that are applicable to file-backed mappings but not to anonymous > > ones. > > > > Now this feature has landed in 6.13, it is time to revisit this and to > > extend this functionality to file-backed and shmem mappings. > > > > In order to make this maximally useful, and since one may map file-backed > > mappings read-only (for instance ELF images), we also remove the > > restriction on read-only mappings and permit the establishment of guard > > regions in any non-hugetlb, non-mlock()'d mapping. > > Hi Lorenzo, > > Thank you for your work on this. > > Have we thought about how guard regions are represented in /proc/*/[s]maps? > > In the field, I've found that many applications read the ranges from > /proc/self/[s]maps to determine what they can access (usually related > to obfuscation techniques). If they don't know of the guard regions it > would cause them to crash; I think that we'll need similar entries to > PROT_NONE (---p) for these, and generally to maintain consistency > between the behavior and what is being said from /proc/*/[s]maps. To clarify why the applications may not be aware of their guard regions -- in the case of the ELF mappings these PROT_NONE (guard regions) would be installed by the dynamic loader; or may be inherited from the parent (zygote in Android's case). > > -- Kalesh > > > > > It is permissible to permit the establishment of guard regions in read-only > > mappings because the guard regions only reduce access to the mapping, and > > when removed simply reinstate the existing attributes of the underlying > > VMA, meaning no access violations can occur. > > > > While the change in kernel code introduced in this series is small, the > > majority of the effort here is spent in extending the testing to assert > > that the feature works correctly across numerous file-backed mapping > > scenarios. > > > > Every single guard region self-test performed against anonymous memory > > (which is relevant and not anon-only) has now been updated to also be > > performed against shmem and a mapping of a file in the working directory. > > > > This confirms that all cases also function correctly for file-backed guard > > regions. > > > > In addition a number of other tests are added for specific file-backed > > mapping scenarios. > > > > There are a number of other concerns that one might have with regard to > > guard regions, addressed below: > > > > Readahead > > ~~~~~~~~~ > > > > Readahead is a process through which the page cache is populated on the > > assumption that sequential reads will occur, thus amortising I/O and, > > through a clever use of the PG_readahead folio flag establishing during > > major fault and checked upon minor fault, provides for asynchronous I/O to > > occur as dat is processed, reducing I/O stalls as data is faulted in. > > > > Guard regions do not alter this mechanism which operations at the folio and > > fault level, but do of course prevent the faulting of folios that would > > otherwise be mapped. > > > > In the instance of a major fault prior to a guard region, synchronous > > readahead will occur including populating folios in the page cache which > > the guard regions will, in the case of the mapping in question, prevent > > access to. > > > > In addition, if PG_readahead is placed in a folio that is now inaccessible, > > this will prevent asynchronous readahead from occurring as it would > > otherwise do. > > > > However, there are mechanisms for heuristically resetting this within > > readahead regardless, which will 'recover' correct readahead behaviour. > > > > Readahead presumes sequential data access, the presence of a guard region > > clearly indicates that, at least in the guard region, no such sequential > > access will occur, as it cannot occur there. > > > > So this should have very little impact on any real workload. The far more > > important point is as to whether readahead causes incorrect or > > inappropriate mapping of ranges disallowed by the presence of guard > > regions - this is not the case, as readahead does not 'pre-fault' memory in > > this fashion. > > > > At any rate, any mechanism which would attempt to do so would hit the usual > > page fault paths, which correctly handle PTE markers as with anonymous > > mappings. > > > > Fault-Around > > ~~~~~~~~~~~~ > > > > The fault-around logic, in a similar vein to readahead, attempts to improve > > efficiency with regard to file-backed memory mappings, however it differs > > in that it does not try to fetch folios into the page cache that are about > > to be accessed, but rather pre-maps a range of folios around the faulting > > address. > > > > Guard regions making use of PTE markers makes this relatively trivial, as > > this case is already handled - see filemap_map_folio_range() and > > filemap_map_order0_folio() - in both instances, the solution is to simply > > keep the established page table mappings and let the fault handler take > > care of PTE markers, as per the comment: > > > > /* > > * NOTE: If there're PTE markers, we'll leave them to be > > * handled in the specific fault path, and it'll prohibit > > * the fault-around logic. > > */ > > > > This works, as establishing guard regions results in page table mappings > > with PTE markers, and clearing them removes them. > > > > Truncation > > ~~~~~~~~~~ > > > > File truncation will not eliminate existing guard regions, as the > > truncation operation will ultimately zap the range via > > unmap_mapping_range(), which specifically excludes PTE markers. > > > > Zapping > > ~~~~~~~ > > > > Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(), > > which specifically deals with guard entries, leaving them intact except in > > instances such as process teardown or munmap() where they need to be > > removed. > > > > Reclaim > > ~~~~~~~ > > > > When reclaim is performed on file-backed folios, it ultimately invokes > > try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte() > > will ultimately abort the operation for the guard region mapping. If large, > > then check_pte() will determine that this is a non-device private > > entry/device-exclusive entry 'swap' PTE and thus abort the operation in > > that instance. > > > > Therefore, no odd things happen in the instance of reclaim being attempted > > upon a file-backed guard region. > > > > Hole Punching > > ~~~~~~~~~~~~~ > > > > This updates the page cache and ultimately invokes unmap_mapping_range(), > > which explicitly leaves PTE markers in place. > > > > Because the establishment of guard regions zapped any existing mappings to > > file-backed folios, once the guard regions are removed then the > > hole-punched region will be faulted in as usual and everything will behave > > as expected. > > > > Lorenzo Stoakes (4): > > mm: allow guard regions in file-backed and read-only mappings > > selftests/mm: rename guard-pages to guard-regions > > tools/selftests: expand all guard region tests to file-backed > > tools/selftests: add file/shmem-backed mapping guard region tests > > > > mm/madvise.c | 8 +- > > tools/testing/selftests/mm/.gitignore | 2 +- > > tools/testing/selftests/mm/Makefile | 2 +- > > .../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++-- > > 4 files changed, 821 insertions(+), 112 deletions(-) > > rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%) > > > > -- > > 2.48.1
On Wed, Feb 19, 2025 at 12:35:12AM -0800, Kalesh Singh wrote: > On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh <kaleshsingh@google.com> wrote: > > > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes > > <lorenzo.stoakes@oracle.com> wrote: > > > > > > The guard regions feature was initially implemented to support anonymous > > > mappings only, excluding shmem. > > > > > > This was done such as to introduce the feature carefully and incrementally > > > and to be conservative when considering the various caveats and corner > > > cases that are applicable to file-backed mappings but not to anonymous > > > ones. > > > > > > Now this feature has landed in 6.13, it is time to revisit this and to > > > extend this functionality to file-backed and shmem mappings. > > > > > > In order to make this maximally useful, and since one may map file-backed > > > mappings read-only (for instance ELF images), we also remove the > > > restriction on read-only mappings and permit the establishment of guard > > > regions in any non-hugetlb, non-mlock()'d mapping. > > > > Hi Lorenzo, > > > > Thank you for your work on this. > > > > Have we thought about how guard regions are represented in /proc/*/[s]maps? > > > > In the field, I've found that many applications read the ranges from > > /proc/self/[s]maps to determine what they can access (usually related > > to obfuscation techniques). If they don't know of the guard regions it > > would cause them to crash; I think that we'll need similar entries to > > PROT_NONE (---p) for these, and generally to maintain consistency > > between the behavior and what is being said from /proc/*/[s]maps. > > To clarify why the applications may not be aware of their guard > regions -- in the case of the ELF mappings these PROT_NONE (guard > regions) would be installed by the dynamic loader; or may be inherited > from the parent (zygote in Android's case). I'm still really confused as to how you propose to use guard regions in this scenario. If it's trying to 'clone' some set of mappings somehow then guard regions are just not compatible with your use case at all. I'm not sure what would be honestly... you maybe could find a way to modify the loader to interpret PROT_NONE mappings as a cue to use guard regions instead if that worked? You can't both not have a VMA (something that describes mapping with common attributes) and have page table state along with somehow having something that describes a mapping with said page table state :) It's a sort of non-such-beast. Everything is a trade-off, guard regions provide the option to retain state regarding faulting behaviour in page tables rather the VMAs, entirely opt-in and at the behest of the user. Now extended to file-backed and shmem! While I hoped it would help your scenario based on your talk at LPC, this is very clearly an important change for guard regions (for shmem most notably) and removes an important limitation and exposes the possibility of using it for things such as ELF mappings for those who don't also require specific [s]maps state (which explicitly NEEDS VMA metadata state). I suspect you can find a way of making this work in your scenario though. An alternative if you're willing to wait for page table state to be acquired is to expose a new /proc/$pid/... endpoint that explicitly outputs guard regions. Would this solve your problem...? I had always considered implementing this if there was a demand. Maybe /proc/$pid/gmaps? :P 'g' could mean guard regions or... ;) > > > > > -- Kalesh > > > > > > > > It is permissible to permit the establishment of guard regions in read-only > > > mappings because the guard regions only reduce access to the mapping, and > > > when removed simply reinstate the existing attributes of the underlying > > > VMA, meaning no access violations can occur. > > > > > > While the change in kernel code introduced in this series is small, the > > > majority of the effort here is spent in extending the testing to assert > > > that the feature works correctly across numerous file-backed mapping > > > scenarios. > > > > > > Every single guard region self-test performed against anonymous memory > > > (which is relevant and not anon-only) has now been updated to also be > > > performed against shmem and a mapping of a file in the working directory. > > > > > > This confirms that all cases also function correctly for file-backed guard > > > regions. > > > > > > In addition a number of other tests are added for specific file-backed > > > mapping scenarios. > > > > > > There are a number of other concerns that one might have with regard to > > > guard regions, addressed below: > > > > > > Readahead > > > ~~~~~~~~~ > > > > > > Readahead is a process through which the page cache is populated on the > > > assumption that sequential reads will occur, thus amortising I/O and, > > > through a clever use of the PG_readahead folio flag establishing during > > > major fault and checked upon minor fault, provides for asynchronous I/O to > > > occur as dat is processed, reducing I/O stalls as data is faulted in. > > > > > > Guard regions do not alter this mechanism which operations at the folio and > > > fault level, but do of course prevent the faulting of folios that would > > > otherwise be mapped. > > > > > > In the instance of a major fault prior to a guard region, synchronous > > > readahead will occur including populating folios in the page cache which > > > the guard regions will, in the case of the mapping in question, prevent > > > access to. > > > > > > In addition, if PG_readahead is placed in a folio that is now inaccessible, > > > this will prevent asynchronous readahead from occurring as it would > > > otherwise do. > > > > > > However, there are mechanisms for heuristically resetting this within > > > readahead regardless, which will 'recover' correct readahead behaviour. > > > > > > Readahead presumes sequential data access, the presence of a guard region > > > clearly indicates that, at least in the guard region, no such sequential > > > access will occur, as it cannot occur there. > > > > > > So this should have very little impact on any real workload. The far more > > > important point is as to whether readahead causes incorrect or > > > inappropriate mapping of ranges disallowed by the presence of guard > > > regions - this is not the case, as readahead does not 'pre-fault' memory in > > > this fashion. > > > > > > At any rate, any mechanism which would attempt to do so would hit the usual > > > page fault paths, which correctly handle PTE markers as with anonymous > > > mappings. > > > > > > Fault-Around > > > ~~~~~~~~~~~~ > > > > > > The fault-around logic, in a similar vein to readahead, attempts to improve > > > efficiency with regard to file-backed memory mappings, however it differs > > > in that it does not try to fetch folios into the page cache that are about > > > to be accessed, but rather pre-maps a range of folios around the faulting > > > address. > > > > > > Guard regions making use of PTE markers makes this relatively trivial, as > > > this case is already handled - see filemap_map_folio_range() and > > > filemap_map_order0_folio() - in both instances, the solution is to simply > > > keep the established page table mappings and let the fault handler take > > > care of PTE markers, as per the comment: > > > > > > /* > > > * NOTE: If there're PTE markers, we'll leave them to be > > > * handled in the specific fault path, and it'll prohibit > > > * the fault-around logic. > > > */ > > > > > > This works, as establishing guard regions results in page table mappings > > > with PTE markers, and clearing them removes them. > > > > > > Truncation > > > ~~~~~~~~~~ > > > > > > File truncation will not eliminate existing guard regions, as the > > > truncation operation will ultimately zap the range via > > > unmap_mapping_range(), which specifically excludes PTE markers. > > > > > > Zapping > > > ~~~~~~~ > > > > > > Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(), > > > which specifically deals with guard entries, leaving them intact except in > > > instances such as process teardown or munmap() where they need to be > > > removed. > > > > > > Reclaim > > > ~~~~~~~ > > > > > > When reclaim is performed on file-backed folios, it ultimately invokes > > > try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte() > > > will ultimately abort the operation for the guard region mapping. If large, > > > then check_pte() will determine that this is a non-device private > > > entry/device-exclusive entry 'swap' PTE and thus abort the operation in > > > that instance. > > > > > > Therefore, no odd things happen in the instance of reclaim being attempted > > > upon a file-backed guard region. > > > > > > Hole Punching > > > ~~~~~~~~~~~~~ > > > > > > This updates the page cache and ultimately invokes unmap_mapping_range(), > > > which explicitly leaves PTE markers in place. > > > > > > Because the establishment of guard regions zapped any existing mappings to > > > file-backed folios, once the guard regions are removed then the > > > hole-punched region will be faulted in as usual and everything will behave > > > as expected. > > > > > > Lorenzo Stoakes (4): > > > mm: allow guard regions in file-backed and read-only mappings > > > selftests/mm: rename guard-pages to guard-regions > > > tools/selftests: expand all guard region tests to file-backed > > > tools/selftests: add file/shmem-backed mapping guard region tests > > > > > > mm/madvise.c | 8 +- > > > tools/testing/selftests/mm/.gitignore | 2 +- > > > tools/testing/selftests/mm/Makefile | 2 +- > > > .../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++-- > > > 4 files changed, 821 insertions(+), 112 deletions(-) > > > rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%) > > > > > > -- > > > 2.48.1
On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote: > On 19.02.25 10:03, Lorenzo Stoakes wrote: > > On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote: > > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes > > > <lorenzo.stoakes@oracle.com> wrote: > > > > > > > > The guard regions feature was initially implemented to support anonymous > > > > mappings only, excluding shmem. > > > > > > > > This was done such as to introduce the feature carefully and incrementally > > > > and to be conservative when considering the various caveats and corner > > > > cases that are applicable to file-backed mappings but not to anonymous > > > > ones. > > > > > > > > Now this feature has landed in 6.13, it is time to revisit this and to > > > > extend this functionality to file-backed and shmem mappings. > > > > > > > > In order to make this maximally useful, and since one may map file-backed > > > > mappings read-only (for instance ELF images), we also remove the > > > > restriction on read-only mappings and permit the establishment of guard > > > > regions in any non-hugetlb, non-mlock()'d mapping. > > > > > > Hi Lorenzo, > > > > > > Thank you for your work on this. > > > > You're welcome. > > > > > > > > Have we thought about how guard regions are represented in /proc/*/[s]maps? > > > > This is off-topic here but... Yes, extensively. No they do not appear > > there. > > > > I thought you had attended LPC and my talk where I mentioned this > > purposefully as a drawback? > > > > I went out of my way to advertise this limitation at the LPC talk, in the > > original series, etc. so it's a little disappointing that this is being > > brought up so late, but nobody else has raised objections to this issue so > > I think in general it's not a limitation that matters in practice. > > > > > > > > In the field, I've found that many applications read the ranges from > > > /proc/self/[s]maps to determine what they can access (usually related > > > to obfuscation techniques). If they don't know of the guard regions it > > > would cause them to crash; I think that we'll need similar entries to > > > PROT_NONE (---p) for these, and generally to maintain consistency > > > between the behavior and what is being said from /proc/*/[s]maps. > > > > No, we cannot have these, sorry. > > > > Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this > > feature is to avoid having to accumulate VMAs for regions which are not > > intended to be accessible. > > > > Secondly, there is no practical means for this to be accomplished in > > /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates > > they have guard regions. > > > > This is intentional, because setting such metadata is simply not practical > > - why? Because when you try to split the VMA, how do you know which bit > > gets the metadata and which doesn't? You can't without _reading page > > tables_. > > > > /proc/$pid/smaps _does_ read page tables, but we can't start pretending > > VMAs exist when they don't, this would be completely inaccurate, would > > break assumptions for things like mremap (which require a single VMA) and > > would be unworkable. > > > > The best that _could_ be achieved is to have a marker in /proc/$pid/smaps > > saying 'hey this region has guard regions somewhere'. > > And then simply expose it in /proc/$pid/pagemap, which is a better interface > for this pte-level information inside of VMAs. We should still have a spare > bit for that purpose in the pagemap entries. Ah yeah thanks David forgot about that! This is also a possibility if that'd solve your problems Kalesh? This bit will be fought over haha > > -- > Cheers, > > David / dhildenb >
* Kalesh Singh <kaleshsingh@google.com> [250219 03:35]: > On Wed, Feb 19, 2025 at 12:25 AM Kalesh Singh <kaleshsingh@google.com> wrote: > > > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes > > <lorenzo.stoakes@oracle.com> wrote: > > > > > > The guard regions feature was initially implemented to support anonymous > > > mappings only, excluding shmem. > > > > > > This was done such as to introduce the feature carefully and incrementally > > > and to be conservative when considering the various caveats and corner > > > cases that are applicable to file-backed mappings but not to anonymous > > > ones. > > > > > > Now this feature has landed in 6.13, it is time to revisit this and to > > > extend this functionality to file-backed and shmem mappings. > > > > > > In order to make this maximally useful, and since one may map file-backed > > > mappings read-only (for instance ELF images), we also remove the > > > restriction on read-only mappings and permit the establishment of guard > > > regions in any non-hugetlb, non-mlock()'d mapping. > > > > Hi Lorenzo, > > > > Thank you for your work on this. > > > > Have we thought about how guard regions are represented in /proc/*/[s]maps? > > > > In the field, I've found that many applications read the ranges from > > /proc/self/[s]maps to determine what they can access (usually related > > to obfuscation techniques). If they don't know of the guard regions it > > would cause them to crash; I think that we'll need similar entries to > > PROT_NONE (---p) for these, and generally to maintain consistency > > between the behavior and what is being said from /proc/*/[s]maps. > > To clarify why the applications may not be aware of their guard > regions -- in the case of the ELF mappings these PROT_NONE (guard > regions) would be installed by the dynamic loader; or may be inherited > from the parent (zygote in Android's case). The guard regions are a method to reduce vma count and speed up guard installation and removal. This very much will change the proc interface by its design, since it's removing information from the maps for the advantage of speed and less resources. I thought that was pretty clear from the start. The proc interface is also not a very good way to programmatically get this information, as I'm sure you are aware. I'm also sure you know that reading that file takes the mmap read lock, and tearing can occur. I think this implies you are taking a lot of time to get the information you want in the way you are getting it (parsing a text file, and not doing any mmap write work)? If this is a common pattern, I think we need a better interface to query this type of information. We have an ioctl going in for getting vma information, but that will lack these new guard regions as well. David mentioned /proc/self/pagemap - that's certainly worth exploring, and doesn't involve text parsing. If it's not that common, then maybe your zygote/loader can communicate the information in a way that does not involve read locking the programs vm area? Are you really parsing the same library information for each application launched to find the guards instead of asking or being told what they are? I must be missing something.. Thanks, Liam
On Wed, Feb 19, 2025 at 1:17 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote: > > On 19.02.25 10:03, Lorenzo Stoakes wrote: > > > On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote: > > > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes > > > > <lorenzo.stoakes@oracle.com> wrote: > > > > > > > > > > The guard regions feature was initially implemented to support anonymous > > > > > mappings only, excluding shmem. > > > > > > > > > > This was done such as to introduce the feature carefully and incrementally > > > > > and to be conservative when considering the various caveats and corner > > > > > cases that are applicable to file-backed mappings but not to anonymous > > > > > ones. > > > > > > > > > > Now this feature has landed in 6.13, it is time to revisit this and to > > > > > extend this functionality to file-backed and shmem mappings. > > > > > > > > > > In order to make this maximally useful, and since one may map file-backed > > > > > mappings read-only (for instance ELF images), we also remove the > > > > > restriction on read-only mappings and permit the establishment of guard > > > > > regions in any non-hugetlb, non-mlock()'d mapping. > > > > > > > > Hi Lorenzo, > > > > > > > > Thank you for your work on this. > > > > > > You're welcome. > > > > > > > > > > > Have we thought about how guard regions are represented in /proc/*/[s]maps? > > > > > > This is off-topic here but... Yes, extensively. No they do not appear > > > there. > > > > > > I thought you had attended LPC and my talk where I mentioned this > > > purposefully as a drawback? > > > > > > I went out of my way to advertise this limitation at the LPC talk, in the > > > original series, etc. so it's a little disappointing that this is being > > > brought up so late, but nobody else has raised objections to this issue so > > > I think in general it's not a limitation that matters in practice. > > > Sorry for raising this now, yes at the time I believe we discussed reducing the vma slab memory usage for the PROT_NONE mappings. I didn't imagine that apps could have dependencies on the mapped ELF ranges in /proc/self/[s]maps until recent breakages from a similar feature. Android itself doesn't depend on this but what I've seen is banking apps and apps that have obfuscation to prevent reverse engineering (the particulars of such obfuscation are a black box). > > > > > > > > In the field, I've found that many applications read the ranges from > > > > /proc/self/[s]maps to determine what they can access (usually related > > > > to obfuscation techniques). If they don't know of the guard regions it > > > > would cause them to crash; I think that we'll need similar entries to > > > > PROT_NONE (---p) for these, and generally to maintain consistency > > > > between the behavior and what is being said from /proc/*/[s]maps. > > > > > > No, we cannot have these, sorry. > > > > > > Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this > > > feature is to avoid having to accumulate VMAs for regions which are not > > > intended to be accessible. > > > > > > Secondly, there is no practical means for this to be accomplished in > > > /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates > > > they have guard regions. > > > > > > This is intentional, because setting such metadata is simply not practical > > > - why? Because when you try to split the VMA, how do you know which bit > > > gets the metadata and which doesn't? You can't without _reading page > > > tables_. Yeah the splitting becomes complicated with any vm flags for this... meaning any attempt to expose this in /proc/*/maps have to unconditionally walk the page tables :( > > > > > > /proc/$pid/smaps _does_ read page tables, but we can't start pretending > > > VMAs exist when they don't, this would be completely inaccurate, would > > > break assumptions for things like mremap (which require a single VMA) and > > > would be unworkable. > > > > > > The best that _could_ be achieved is to have a marker in /proc/$pid/smaps > > > saying 'hey this region has guard regions somewhere'. > > > > And then simply expose it in /proc/$pid/pagemap, which is a better interface > > for this pte-level information inside of VMAs. We should still have a spare > > bit for that purpose in the pagemap entries. > > Ah yeah thanks David forgot about that! > > This is also a possibility if that'd solve your problems Kalesh? I'm not sure what is the correct interface to advertise these. Maybe smaps as you suggested since we already walk the page tables there? and pagemap bit for the exact pages as well? It won't solve this particular issue, as 1000s of in field apps do look at this through /proc/*/maps. But maybe we have to live with that... We can argue that such apps are broken since they may trip on the SIGBUS off the end of the file -- usually this isn't the case for the ELF segment mappings. This is still useful for other cases, I just wanted to get some ideas if this can be extended to further use cases. Thanks, Kalesh > > This bit will be fought over haha > > > > > -- > > Cheers, > > > > David / dhildenb > >
On Wed, Feb 19, 2025 at 11:20 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > On Wed, Feb 19, 2025 at 10:52:04AM -0800, Kalesh Singh wrote: > > On Wed, Feb 19, 2025 at 1:17 AM Lorenzo Stoakes > > <lorenzo.stoakes@oracle.com> wrote: > > > > > > On Wed, Feb 19, 2025 at 10:15:47AM +0100, David Hildenbrand wrote: > > > > On 19.02.25 10:03, Lorenzo Stoakes wrote: > > > > > On Wed, Feb 19, 2025 at 12:25:51AM -0800, Kalesh Singh wrote: > > > > > > On Thu, Feb 13, 2025 at 10:18 AM Lorenzo Stoakes > > > > > > <lorenzo.stoakes@oracle.com> wrote: > > > > > > > > > > > > > > The guard regions feature was initially implemented to support anonymous > > > > > > > mappings only, excluding shmem. > > > > > > > > > > > > > > This was done such as to introduce the feature carefully and incrementally > > > > > > > and to be conservative when considering the various caveats and corner > > > > > > > cases that are applicable to file-backed mappings but not to anonymous > > > > > > > ones. > > > > > > > > > > > > > > Now this feature has landed in 6.13, it is time to revisit this and to > > > > > > > extend this functionality to file-backed and shmem mappings. > > > > > > > > > > > > > > In order to make this maximally useful, and since one may map file-backed > > > > > > > mappings read-only (for instance ELF images), we also remove the > > > > > > > restriction on read-only mappings and permit the establishment of guard > > > > > > > regions in any non-hugetlb, non-mlock()'d mapping. > > > > > > > > > > > > Hi Lorenzo, > > > > > > > > > > > > Thank you for your work on this. > > > > > > > > > > You're welcome. > > > > > > > > > > > > > > > > > Have we thought about how guard regions are represented in /proc/*/[s]maps? > > > > > > > > > > This is off-topic here but... Yes, extensively. No they do not appear > > > > > there. > > > > > > > > > > I thought you had attended LPC and my talk where I mentioned this > > > > > purposefully as a drawback? > > > > > > > > > > I went out of my way to advertise this limitation at the LPC talk, in the > > > > > original series, etc. so it's a little disappointing that this is being > > > > > brought up so late, but nobody else has raised objections to this issue so > > > > > I think in general it's not a limitation that matters in practice. > > > > > > > > > Sorry for raising this now, yes at the time I believe we discussed > > reducing the vma slab memory usage for the PROT_NONE mappings. I > > didn't imagine that apps could have dependencies on the mapped ELF > > ranges in /proc/self/[s]maps until recent breakages from a similar > > feature. Android itself doesn't depend on this but what I've seen is > > banking apps and apps that have obfuscation to prevent reverse > > engineering (the particulars of such obfuscation are a black box). > > Ack ok fair enough, sorry, but obviously you can understand it's > frustrating when I went to great lengths to advertise this not only at the > talk but in the original series. > > Really important to have these discussions early. Not that really we can do > much about this, as inherently this feature cannot give you what you need. > > Is it _only_ banking apps that do this? And do they exclusively read > /proc/$pid/maps? I mean there's nothing we can do about that, sorry. Not only banking apps but that's a common category. > If that's immutable, then unless you do your own very, very, very slow custom > android maps implementation (that will absolutely break the /proc/$pid/maps > scalability efforts atm) this is just a no-go. > Yeah unfortunately that's immutable as app versions are mostly independent from the OS version. We do have something that handles this by encoding the guard regions in the vm_flags, but as you can imagine it's not generic enough for upstream. > > > > > > > > > > > > > > In the field, I've found that many applications read the ranges from > > > > > > /proc/self/[s]maps to determine what they can access (usually related > > > > > > to obfuscation techniques). If they don't know of the guard regions it > > > > > > would cause them to crash; I think that we'll need similar entries to > > > > > > PROT_NONE (---p) for these, and generally to maintain consistency > > > > > > between the behavior and what is being said from /proc/*/[s]maps. > > > > > > > > > > No, we cannot have these, sorry. > > > > > > > > > > Firstly /proc/$pid/[s]maps describes VMAs. The entire purpose of this > > > > > feature is to avoid having to accumulate VMAs for regions which are not > > > > > intended to be accessible. > > > > > > > > > > Secondly, there is no practical means for this to be accomplished in > > > > > /proc/$pid/maps in _any_ way - as no metadata relating to a VMA indicates > > > > > they have guard regions. > > > > > > > > > > This is intentional, because setting such metadata is simply not practical > > > > > - why? Because when you try to split the VMA, how do you know which bit > > > > > gets the metadata and which doesn't? You can't without _reading page > > > > > tables_. > > > > Yeah the splitting becomes complicated with any vm flags for this... > > meaning any attempt to expose this in /proc/*/maps have to > > unconditionally walk the page tables :( > > It's not really complicated, it's _impossible_ unless you made literally > all VMA code walk page tables for every single operation. Which we are > emphatically not going to do :) > > And no, /proc/$pid/maps is _never_ going to walk page tables. For obvious > performance reasons. > > > > > > > > > > > > > /proc/$pid/smaps _does_ read page tables, but we can't start pretending > > > > > VMAs exist when they don't, this would be completely inaccurate, would > > > > > break assumptions for things like mremap (which require a single VMA) and > > > > > would be unworkable. > > > > > > > > > > The best that _could_ be achieved is to have a marker in /proc/$pid/smaps > > > > > saying 'hey this region has guard regions somewhere'. > > > > > > > > And then simply expose it in /proc/$pid/pagemap, which is a better interface > > > > for this pte-level information inside of VMAs. We should still have a spare > > > > bit for that purpose in the pagemap entries. > > > > > > Ah yeah thanks David forgot about that! > > > > > > This is also a possibility if that'd solve your problems Kalesh? > > > > I'm not sure what is the correct interface to advertise these. Maybe > > smaps as you suggested since we already walk the page tables there? > > and pagemap bit for the exact pages as well? It won't solve this > > particular issue, as 1000s of in field apps do look at this through > > /proc/*/maps. But maybe we have to live with that... > > I mean why are we even considering this if you can't change this anywhere? > Confused by that. > > I'm afraid upstream can't radically change interfaces to suit this > scenario. > > We also can't change smaps in the way you want, it _has_ to still give > output per VMA information. Sorry I wasn't suggesting to change the entries in smaps, rather agreeing to your marker suggestion. Maybe a set of ranges for each smaps entry that has guards? It doesn't solve the use case, but does make these regions visible to userspace. > > The proposed change that would be there would be a flag or something > indicating that the VMA has guard regions _SOMEWHERE_ in it. > > Since this doesn't solve your problem, adds complexity, and nobody else > seems to need it, I would suggest this is not worthwhile and I'd rather not > do this. > > Therefore for your needs there are literally only two choices here: > > 1. Add a bit to /proc/$pid/pagemap OR > 2. a new interface. > > I am not in favour of a new interface here, if we can just extend pagemap. > > What you'd have to do is: > > 1. Find virtual ranges via /proc/$pid/maps > 2. iterate through /proc/$pid/pagemaps to retrieve state for all ranges. > Could we also consider an smaps field like: VmGuards: [AAA, BBB), [CCC, DDD), ... or something of that sort? > Since anything that would retrieve guard region state would need to walk > page tables, any approach would be slow and I don't think this would be any > less slow than any other interface. > > This way you'd be able to find all guard regions all the time. > > This is just the trade-off for this feature unfortunately - its whole > design ethos is to allow modification of -faulting- behaviour without > having to modify -VMA- behaviour. > > But if it's banking apps whose code you can't control (surprised you don't > lock down these interfaces), I mean is this even useful to you? > > If your requirement is 'you have to change /proc/$pid/maps to show guard > regions' I mean the answer is that we can't. > > > > > We can argue that such apps are broken since they may trip on the > > SIGBUS off the end of the file -- usually this isn't the case for the > > ELF segment mappings. > > Or tearing of the maps interface, or things getting unmapped or or > or... It's really not a sane thing to do. > > > > > This is still useful for other cases, I just wanted to get some ideas > > if this can be extended to further use cases. > > Well I'm glad that you guys find it useful for _something_ ;) > > Again this wasn't written only for you (it is broadly a good feature for > upstream), but I did have your use case in mind, so I'm a little > disappointed that it doesn't help, as I like to solve problems. > > But I'm glad it solves at least some for you... I recall Liam had a proposal to store the guard ranges in the maple tree? I wonder if that can be used in combination with this approach to have a better representation of this? > > > > > Thanks, > > Kalesh > > > > > > > > > > This bit will be fought over haha > > > > > > > > > > > -- > > > > Cheers, > > > > > > > > David / dhildenb > > > >