Message ID | 20240707002658.1917440-2-Jason@zx2c4.com |
---|---|
State | New |
Headers | show |
Series | implement getrandom() in vDSO | expand |
On 07.07.24 02:26, Jason A. Donenfeld wrote: Hi, having more generic support for VM_DROPPABLE sounds great, I was myself at some point looking for something like that. > The vDSO getrandom() implementation works with a buffer allocated with a > new system call that has certain requirements: > > - It shouldn't be written to core dumps. > * Easy: VM_DONTDUMP. > - It should be zeroed on fork. > * Easy: VM_WIPEONFORK. > > - It shouldn't be written to swap. > * Uh-oh: mlock is rlimited. > * Uh-oh: mlock isn't inherited by forks. > > It turns out that the vDSO getrandom() function has three really nice > characteristics that we can exploit to solve this problem: > > 1) Due to being wiped during fork(), the vDSO code is already robust to > having the contents of the pages it reads zeroed out midway through > the function's execution. > > 2) In the absolute worst case of whatever contingency we're coding for, > we have the option to fallback to the getrandom() syscall, and > everything is fine. > > 3) The buffers the function uses are only ever useful for a maximum of > 60 seconds -- a sort of cache, rather than a long term allocation. > > These characteristics mean that we can introduce VM_DROPPABLE, which > has the following semantics: > > a) It never is written out to swap. > b) Under memory pressure, mm can just drop the pages (so that they're > zero when read back again). > c) It is inherited by fork. > d) It doesn't count against the mlock budget, since nothing is locked. > > This is fairly simple to implement, with the one snag that we have to > use 64-bit VM_* flags, but this shouldn't be a problem, since the only > consumers will probably be 64-bit anyway. > > This way, allocations used by vDSO getrandom() can use: > > VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE > > And there will be no problem with using memory when not in use, not > wiping on fork(), coredumps, or writing out to swap. > > In order to let vDSO getrandom() use this, expose these via mmap(2) as > well, giving MAP_WIPEONFORK, MAP_DONTDUMP, and MAP_DROPPABLE. Patch subject would be better to talk about MAP_DROPPABLE now. But I don't immediately see why MAP_WIPEONFORK and MAP_DONTDUMP have to be mmap() flags. Using mmap(MAP_NORESERVE|MAP_DROPPABLE) with madvise() to configure these (for users that require that) should be good enough, just like they are for existing users. Thinking out loud, also MAP_DROPPABLE only sets a VMA flag (and does snot affect memory commitiing like MAP_NORESERVE), right? So MAP_DROPPABLE could easily become a madvise() option as well? (as you know, we only have limited mmap bits but plenty of madvise numbers available) Interestingly, when looking into something comparable in the past I stumbled over "vrange" [1], which would have had a slightly different semantic (signal on reaccess). And that did turn out to be more sutibale for madvise() flags [2], whereby vrange evolved into MADV_VOLATILE/MADV_NONVOLATILE A sticky MADV_VOLATILE vs. MADV_NONVOLATILE would actually sound pretty handy. (again, with your semantics, not the signal-on-reaccess kind of thing) ([2] is in general a good read; hey, it's been 10 years since that was brought up the last time!) There needs to be better reasoning why we have to consume three mmap bits for something that can likely be achieved without any. Maybe that was discussed with Linus and there is a pretty good reason for that. I'll also mention that I am unsure how MAP_DROPPABLE is supposed to interact with mlock. Maybe just like MADV_FREE currently does (no idea if that will work as intended ;) ). [1] https://lwn.net/Articles/590991/ [2] https://lwn.net/Articles/602650/ > > Finally, the provided self test ensures that this is working as desired. > > Cc: linux-mm@kvack.org > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > --- [...] > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 8c6cd8825273..57b8dad9adcc 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -623,7 +623,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, > may_expand_vm(mm, oldflags, nrpages)) > return -ENOMEM; > if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB| > - VM_SHARED|VM_NORESERVE))) { > + VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) { > charged = nrpages; > if (security_vm_enough_memory_mm(mm, charged)) > return -ENOMEM; I don't quite understand this change here. If MAP_DROPPABLE does not affect memory accounting during mmap(), it should not affect the same during mprotect(). VM_NORESERVE / MAP_NORESERVE is responsible for that. Did I missing something where MAP_DROPPABLE changes the memory accounting during mmap()? > diff --git a/mm/rmap.c b/mm/rmap.c > index e8fc5ecb59b2..56d7535d5cf6 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1397,7 +1397,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, > VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); > VM_BUG_ON_VMA(address < vma->vm_start || > address + (nr << PAGE_SHIFT) > vma->vm_end, vma); > - __folio_set_swapbacked(folio); > + /* VM_DROPPABLE mappings don't swap; instead they're just dropped when > + * under memory pressure. */ > + if (!(vma->vm_flags & VM_DROPPABLE)) > + __folio_set_swapbacked(folio); > __folio_set_anon(folio, vma, address, true); > > if (likely(!folio_test_large(folio))) { > @@ -1841,7 +1844,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > * plus the rmap(s) (dropped by discard:). > */ > if (ref_count == 1 + map_count && > - !folio_test_dirty(folio)) { > + (!folio_test_dirty(folio) || > + /* Unlike MADV_FREE mappings, VM_DROPPABLE > + * ones can be dropped even if they've > + * been dirtied. */ We use /* * Comment start * Comment end */ styled comments in MM. > + (vma->vm_flags & VM_DROPPABLE))) { > dec_mm_counter(mm, MM_ANONPAGES); > goto discard; > } > @@ -1851,7 +1858,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > * discarded. Remap the page to page table. > */ > set_pte_at(mm, address, pvmw.pte, pteval); > - folio_set_swapbacked(folio); > + /* Unlike MADV_FREE mappings, VM_DROPPABLE ones > + * never get swap backed on failure to drop. */ > + if (!(vma->vm_flags & VM_DROPPABLE)) > + folio_set_swapbacked(folio); > ret = false; > page_vma_mapped_walk_done(&pvmw); > break; A note that in mm/mm-stable, "madvise_free_huge_pmd" exists to optimize MADV_FREE on PMDs. I suspect we'd want to extend that one as well for dropping support, but likely it would also only be a performance improvmeent and not affect functonality if not handled.
On Sun, 7 Jul 2024 at 00:42, David Hildenbrand <david@redhat.com> wrote: > > But I don't immediately see why MAP_WIPEONFORK and MAP_DONTDUMP have to > be mmap() flags. I don't think they have to be mmap() flags, but that said, I think it's technically the better alternative than saying "you have to madvise things later". I very much understand the "we don't have a lot of MAP_xyz flags and we don't want to waste them" argument, but at the same time (a) we _do_ have those flags (b) picking a worse interface seems bad (c) we could actually use the PROT_xyz bits, which we have a ton of And yes, (c) is ugly, but is it uglier than "use two system calls to do one thing"? I mean, "flags" and "prot" are just two sides of the same coin in the end, the split is kind of arbitrary, and "prot" only has four bits right now, and one of them is historical and useless, and actually happens to be *exactly* this kind of MAP_xyz bit. (In case it's not clear, I'm talking about PROT_SEM, which is very much a behavioral bit for broken architectures that we've actually never implemented). We also have PROT_GROSDOWN and PROT_GROWSUP , which is basically a "match MAP_GROWSxyz and change the mprotect() limits appropriately" So I actually think we could use the PROT_xyz bits, and anybody who says "those are for PROT_READ and PROT_WRITE is already very very wrong. Again - not pretty, but mappens to match reality. > Interestingly, when looking into something comparable in the past I > stumbled over "vrange" [1], which would have had a slightly different > semantic (signal on reaccess). We literally talked about exactly this with Jason, except unlike you I couldn't find the historical archive (I tried in vain to find something from lore). https://lore.kernel.org/lkml/CAHk-=whRpLyY+U9mkKo8O=2_BXNk=7sjYeObzFr3fGi0KLjLJw@mail.gmail.com/ I do think that a "explicit populate and get a signal on access" is a very valid model, but I think the "zero on access" is a more immediately real model. And we actually have had the "get signal on access" before: that's what VM_DONTCOPY is. And it was the *less* useful model, which is why we added VM_WIPEONCOPY, because that's the semantics people actually wanted. So I think the "signal on thrown out data access" is interesting, but not necessarily the *more* interesting case. And I think if we do want that case, I think having MAP_DROPPABLE have those semantics for MAP_SHARED would be the way to go. IOW, starting off with the "zero on next access after drop" case doesn't make it any harder to then later add a "fault on next access after drop" version. > There needs to be better reasoning why we have to consume three mmap > bits for something that can likely be achieved without any. I think it goes the other way: why are MAP_xyz bits so precious to make this harder to actually use? Together with that whole "maybe use PROT_xyz bits instead" discussion? Linus
On 07.07.24 20:19, Linus Torvalds wrote: > On Sun, 7 Jul 2024 at 00:42, David Hildenbrand <david@redhat.com> wrote: >> >> But I don't immediately see why MAP_WIPEONFORK and MAP_DONTDUMP have to >> be mmap() flags. Hi Linus, > > I don't think they have to be mmap() flags, but that said, I think > it's technically the better alternative than saying "you have to > madvise things later". Having the option to madvise usually means that you can toggle it on/off (e.g., MADV_DONTFORK vs. MADV_DOFORK). Not sure if having that option could be valuable here (droppable) as well; maybe not. > > I very much understand the "we don't have a lot of MAP_xyz flags and > we don't want to waste them" argument, but at the same time > > (a) we _do_ have those flags > > (b) picking a worse interface seems bad > > (c) we could actually use the PROT_xyz bits, which we have a ton of > I recall that introducing things like MAP_SHARED_VALIDATE received a lot of pushback in the past. But that was before my MM days, and I only had people tell me stories about it. (and at LSF/MM it's been a recurring theme that if you want to propose new MMAP flag, you're going to have a hard time) > And yes, (c) is ugly, but is it uglier than "use two system calls to > do one thing"? I mean, "flags" and "prot" are just two sides of the > same coin in the end, the split is kind of arbitrary, and "prot" only > has four bits right now, and one of them is historical and useless, > and actually happens to be *exactly* this kind of MAP_xyz bit. Yeah, I always had the same feeling about prot vs. flags. My understanding so far was that we should have madvise() ways to toggle stuff and add mmap bits if not avoidable; at least that's what I learned from the community. Good to hear that this is changing. (or it's just been an urban myth) I'll use your mail as reference in the future when that topic pops up ;) Maybe, historically we used madvise options so it's easier to sense which options the current kernel actually supports. (e.g., let mmap() succeed but let a separate madvise(MADV_HUGEPAGE) etc. fail if not supported by the kernel; no need to fail the whole operation). > > (In case it's not clear, I'm talking about PROT_SEM, which is very > much a behavioral bit for broken architectures that we've actually > never implemented). Yeah. > > We also have PROT_GROSDOWN and PROT_GROWSUP , which is basically a > "match MAP_GROWSxyz and change the mprotect() limits appropriately" It's the first time I hear about these two mprotect() options, thanks for mentioning that :) > > So I actually think we could use the PROT_xyz bits, and anybody who > says "those are for PROT_READ and PROT_WRITE is already very very > wrong. > > Again - not pretty, but mappens to match reality. > >> Interestingly, when looking into something comparable in the past I >> stumbled over "vrange" [1], which would have had a slightly different >> semantic (signal on reaccess). > > We literally talked about exactly this with Jason, except unlike you I > couldn't find the historical archive (I tried in vain to find > something from lore). Good that you discussed it, I primarily scanned this patch set here only. I took notes back when I was looking for something like VM_DROPPABLE (also, being more interested in the non-signal version for a VM cache use case). > > https://lore.kernel.org/lkml/CAHk-=whRpLyY+U9mkKo8O=2_BXNk=7sjYeObzFr3fGi0KLjLJw@mail.gmail.com/ > > I do think that a "explicit populate and get a signal on access" is a > very valid model, but I think the "zero on access" is a more > immediately real model. > > And we actually have had the "get signal on access" before: that's > what VM_DONTCOPY is. > > And it was the *less* useful model, which is why we added > VM_WIPEONCOPY, because that's the semantics people actually wanted. > > So I think the "signal on thrown out data access" is interesting, but > not necessarily the *more* interesting case. Absolutely agreed. > > And I think if we do want that case, I think having MAP_DROPPABLE have > those semantics for MAP_SHARED would be the way to go. IOW, starting > off with the "zero on next access after drop" case doesn't make it any > harder to then later add a "fault on next access after drop" version. > >> There needs to be better reasoning why we have to consume three mmap >> bits for something that can likely be achieved without any. > > I think it goes the other way: why are MAP_xyz bits so precious to > make this harder to actually use? If things changed and we can have as many as we want, good! Things like MADV_HUGEPAGE/MADV_MERGEABLE might benefit from a mmap flag as well.
On Sun, 7 Jul 2024 at 11:52, David Hildenbrand <david@redhat.com> wrote: > > I recall that introducing things like MAP_SHARED_VALIDATE received a lot > of pushback in the past. But that was before my MM days, and I only had > people tell me stories about it. I think MAP_SHARED_VALIDATE was mostly about worrying about the API impact. And I think it worked out so well that this is probably the first time it has been brought up ever since ;) That said, the *reason* for MAP_SHARED_VALIDATE is actually very valid: we have historically just ignored any random flags in the mmap() interfaces, and with shared mappings, that can be dangerous. IOW, the real issue wasn't MAP_SHARED_VALIDATE itself, but introducing *other* flags that affected maps that old kernels would ignore, and then the worry was "now old kernels and new kernels work very differently for this binary". That's technically obviously true of any MAP_DROPPABLE thing too - old kernels would happily just ignore it. I suspect that's more of a feature than a mis-feature, but.. > My understanding so far was that we should have madvise() ways to toggle > stuff and add mmap bits if not avoidable; at least that's what I learned > from the community. It doesn't sound like a bad model in general. I'm not entirely sure it makes sense for something like "droppable", since that is a fairly fundamental behavioral thing. Does it make sense to make something undroppable when it can drop pages concurrently with that operation? I mean, you can't switch MAP_SHARED around either. The other bits already _do_ have madvise() things, and Jason added a way to just do it all in one go. > Good to hear that this is changing. (or it's just been an urban myth) I don't know if that's an urban myth. Some people are a *lot* more risk-averse than I personally am. I want things to make sense, but I also consider "this is fixable if it causes issues" to be a valid argument. So for example, who knows *what* garbage people pass off to mmap() as an argument. That worry was why MAP_SHARED_VALIDATE happened. But at the same time, does it make sense to complicate things because of some theoretical worry? Giving random bits to mmap() sounds unlikely to be a real issue to me, but maybe I'm being naive. I do generally think that user mode programs can pretty much be expected to do random things, but how do you even *create* a mmap MAP_xyz flags field that has random high bits set? > > We also have PROT_GROSDOWN and PROT_GROWSUP , which is basically a > > "match MAP_GROWSxyz and change the mprotect() limits appropriately" > > It's the first time I hear about these two mprotect() options, thanks > for mentioning that :) Don't thank me. They actually do make sense in a "what if I want to mprotect() the stack, but I don't know what the stack range is since it's dynamic" kind of sense, so I certainly don't hate them. So they are not bad bits, but at the same time they are examples of how there is a fuzzy line between MAP_xyz and PROT_xyz. And sometimes the line is literally just "mprotect() only gets one of them, but we want to pass in the other one, so we duplicate them as a very very special case". Linus
On 07.07.24 21:22, Linus Torvalds wrote: > On Sun, 7 Jul 2024 at 11:52, David Hildenbrand <david@redhat.com> wrote: >> >> I recall that introducing things like MAP_SHARED_VALIDATE received a lot >> of pushback in the past. But that was before my MM days, and I only had >> people tell me stories about it. > > I think MAP_SHARED_VALIDATE was mostly about worrying about the API impact. > > And I think it worked out so well that this is probably the first time > it has been brought up ever since ;) > > That said, the *reason* for MAP_SHARED_VALIDATE is actually very > valid: we have historically just ignored any random flags in the > mmap() interfaces, and with shared mappings, that can be dangerous. > > IOW, the real issue wasn't MAP_SHARED_VALIDATE itself, but introducing > *other* flags that affected maps that old kernels would ignore, and > then the worry was "now old kernels and new kernels work very > differently for this binary". > > That's technically obviously true of any MAP_DROPPABLE thing too - old > kernels would happily just ignore it. I suspect that's more of a > feature than a mis-feature, but.. > >> My understanding so far was that we should have madvise() ways to toggle >> stuff and add mmap bits if not avoidable; at least that's what I learned >> from the community. > > It doesn't sound like a bad model in general. I'm not entirely sure it > makes sense for something like "droppable", since that is a fairly > fundamental behavioral thing. Does it make sense to make something > undroppable when it can drop pages concurrently with that operation? > > I mean, you can't switch MAP_SHARED around either. > > The other bits already _do_ have madvise() things, and Jason added a > way to just do it all in one go. I just recalled that with MAP_HUGETLB, bits [26:31] encode a hugetlb size (see include/uapi/asm-generic/hugetlb_encode.h). hugetlb, the gift that keeps on giving. We're using: +#define MAP_WIPEONFORK 0x08000000 /* Zero memory in child forks. */ +#define MAP_DONTDUMP 0x10000000 /* Do not write to coredumps. */ +#define MAP_DROPPABLE 0x20000000 /* Zero memory under memory pressure. */ Which should be bit 27-29. So using these flags with MAP_HUGETLB will result in surprises. At least MAP_DROPPABLE doesn't quite make sense with hugetlb, but at least the other ones do have semantics with hugetlb? It's late Sunday here in Germany, so I might just have messed something up. Just raising that there might be a "bit" conflict.
On Sun, 7 Jul 2024 at 14:01, David Hildenbrand <david@redhat.com> wrote: > > At least MAP_DROPPABLE doesn't quite make sense with hugetlb, but at least > the other ones do have semantics with hugetlb? Hmm. How about we just say that VM_DROPPABLE really is something separate from MAP_PRIVATE or MAP_SHARED.. And then we make the rule be that VM_DROPPABLE is never dumped and always dropped on fork, just to make things simpler. It not only avoids a flag, but it actually makes sense: the pages aren't stable for dumping anyway, and not copying them on fork() not only avoids some overhead, but makes it much more reliable and testable. IOW, how about taking this approach: --- a/include/uapi/linux/mman.h +++ b/include/uapi/linux/mman.h @@ -17,5 +17,6 @@ #define MAP_SHARED 0x01 /* Share changes */ #define MAP_PRIVATE 0x02 /* Changes are private */ #define MAP_SHARED_VALIDATE 0x03 /* share + validate extension flags */ +#define MAP_DROPPABLE 0x08 /* 4 is not in MAP_TYPE on parisc? */ /* with do_mmap() doing: --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1369,6 +1369,23 @@ unsigned long do_mmap(struct file *file, pgoff = 0; vm_flags |= VM_SHARED | VM_MAYSHARE; break; + case MAP_DROPPABLE: + /* + * A locked or stack area makes no sense to + * be droppable. + * + * Also, since droppable pages can just go + * away at any time, it makes no sense to + * copy them on fork or dump them. + */ + if (flags & MAP_LOCKED) + return -EINVAL; + if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) + return -EINVAL; + + vm_flags |= VM_DROPPABLE; + vm_flags |= VM_WIPEONFORK | VM_DONTDUMP; + fallthrough; case MAP_PRIVATE: /* * Set pgoff according to addr for anon_vma. which looks rather simple. The only oddity is that parisc thing - every other archiecture has the MAP_TYPE bits being 0xf, but parisc uses 0x2b (also four bits, but instead of the low four bits it's 00101011 - strange). So using 8 as a MAP_TYPE bit for MAP_DROPPABLE works everywhere, and if we eventually want to do a "signaling" MAP_DROPPABLE we could use 9. This has the added advantage that if somebody does this on an old kernel,. they *will* get an error. Because unlike the 'flag' bits in general, the MAP_TYPE bit space has always been tested. Hmm? Linus
Hi David, Thanks a lot for the review of the code. Am very glad to have somebody who knows this code take a careful look at it. On Sun, Jul 07, 2024 at 09:42:38AM +0200, David Hildenbrand wrote: > Patch subject would be better to talk about MAP_DROPPABLE now. Will do. Or, well, in light of the conversation downthread, MAP_DROPPABLE. > But I don't immediately see why MAP_WIPEONFORK and MAP_DONTDUMP have to > be mmap() flags. Using mmap(MAP_NORESERVE|MAP_DROPPABLE) with madvise() > to configure these (for users that require that) should be good enough, > just like they are for existing users. I looked into that too, and coming up with some clunky mechanism for automating several calls to madvise() for each thing. I could make it work need be, but it's really not nice. And it sort of then leads in the direction, "this interface isn't great; why don't you just make a dedicated syscall that does everything you need in one fell swoop," which is explicitly what Linus doesn't want. Making it accessible to mmap() instead makes it more of a direct thing that isn't a whole new syscall. Anyway, it indeed looks like there are more PROT_ bits available, and also that PROT_ has been used this way before. In addition to PROT_SEM, there are a few arch-specific PROT_ bits that seem similar enough. The distinction is pretty blurry between MAP_ and PROT_. So I'll just move this to PROT_ for v+1. > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > index 8c6cd8825273..57b8dad9adcc 100644 > > --- a/mm/mprotect.c > > +++ b/mm/mprotect.c > > @@ -623,7 +623,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, > > may_expand_vm(mm, oldflags, nrpages)) > > return -ENOMEM; > > if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB| > > - VM_SHARED|VM_NORESERVE))) { > > + VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) { > > charged = nrpages; > > if (security_vm_enough_memory_mm(mm, charged)) > > return -ENOMEM; > > I don't quite understand this change here. If MAP_DROPPABLE does not > affect memory accounting during mmap(), it should not affect the same > during mprotect(). VM_NORESERVE / MAP_NORESERVE is responsible for that. > > Did I missing something where MAP_DROPPABLE changes the memory > accounting during mmap()? Actually, I think I errored by not adding it to mmap() (via the check in accountable_mapping(), I believe), and I should add it there. That also might be another reason why this is better as a MAP_ (or, rather PROT_) bit, rather than an madvise call. Tell me if you disagree, as I might be way off here. But I was thinking that because the system can just "drop" this memory, it's not sensible to account for it, because it can be taken right back. > > diff --git a/mm/rmap.c b/mm/rmap.c > We use > > /* > * Comment start > * Comment end > */ > > styled comments in MM. Fixed. > > > + (vma->vm_flags & VM_DROPPABLE))) { > > dec_mm_counter(mm, MM_ANONPAGES); > > goto discard; > > } > > @@ -1851,7 +1858,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > > * discarded. Remap the page to page table. > > */ > > set_pte_at(mm, address, pvmw.pte, pteval); > > - folio_set_swapbacked(folio); > > + /* Unlike MADV_FREE mappings, VM_DROPPABLE ones > > + * never get swap backed on failure to drop. */ > > + if (!(vma->vm_flags & VM_DROPPABLE)) > > + folio_set_swapbacked(folio); > > ret = false; > > page_vma_mapped_walk_done(&pvmw); > > break; > > A note that in mm/mm-stable, "madvise_free_huge_pmd" exists to optimize > MADV_FREE on PMDs. I suspect we'd want to extend that one as well for > dropping support, but likely it would also only be a performance > improvmeent and not affect functonality if not handled. That's for doing the freeing of PTEs after the fact, right? If the mapping was created, got filled with some data, and then sometime later it got MADV_FREE'd, which is the pattern people follow typically with MADV_FREE. If we do this as PROT_/MAP_, then that's not a case we need to worry about, if I understand this code correctly. Jason
Hi Linus, On Sun, Jul 07, 2024 at 11:19:46AM -0700, Linus Torvalds wrote: > (c) we could actually use the PROT_xyz bits, which we have a ton of As I just wrote to David, I'll move this to PROT_xyz. By the way, in addition to the PROT_SEM historical artifact, there's also architecture specific ones like PROT_ADI on SPARC, PROT_SAO on PowerPC, and PROT_BTI and PROT_MTE on arm64. So the MAP_ vs PROT_ distinction seems kind of blurred anyway. > And we actually have had the "get signal on access" before: that's > what VM_DONTCOPY is. > > And it was the *less* useful model, which is why we added > VM_WIPEONCOPY, because that's the semantics people actually wanted. > > So I think the "signal on thrown out data access" is interesting, but > not necessarily the *more* interesting case. FYI, I looked into using VM_DONTCOPY/MADV_DONTFORK for my purposes, because it could possibly make another problem easier, but I couldn't figure out how to make it smoothly work. Specifically, a program has a bunch of threads, and some of them have a vgetrandom state in use, carved out of the same page. One of the threads forks. In the VM_WIPEONFORK case, the fork child has to reclaim the states that were in use by other threads at the time of the fork and return them to the pool of available slices. In the VM_DONTCOPY case, that's not necessary, which is kind of nice. But if the program forked in the signal handler and then returned to an in progress vgetrandom operation, now there's a signal that needs to be handled internally, identified as belonging to the internal state areas, and not bubbled up to other code. This seems difficult and fraught. It's far easier to just have the memory be zeroed and have the code unconditionally check for that at the same time it's doing other consistency checks. So yea it just seems a lot more desirable to have the behavior be zeroing rather than an asynchronous signal, because code can straightforwardly deal with that inline. Jason
On 08.07.24 02:08, Linus Torvalds wrote: > On Sun, 7 Jul 2024 at 14:01, David Hildenbrand <david@redhat.com> wrote: >> >> At least MAP_DROPPABLE doesn't quite make sense with hugetlb, but at least >> the other ones do have semantics with hugetlb? > > Hmm. > > How about we just say that VM_DROPPABLE really is something separate > from MAP_PRIVATE or MAP_SHARED.. So it would essentially currently imply MAP_ANON|MAP_PRIVATE, without COW (not shared with a child process). Then, we should ignore any fd+offset that is passed (or bail out); I assume that's what your proposal below does automatically without diving into the code. > > And then we make the rule be that VM_DROPPABLE is never dumped and > always dropped on fork, just to make things simpler. The semantics are much more intuitive. No need for separate mmap flags. > > It not only avoids a flag, but it actually makes sense: the pages > aren't stable for dumping anyway, and not copying them on fork() not > only avoids some overhead, but makes it much more reliable and > testable. > > IOW, how about taking this approach: > > --- a/include/uapi/linux/mman.h > +++ b/include/uapi/linux/mman.h > @@ -17,5 +17,6 @@ > #define MAP_SHARED 0x01 /* Share changes */ > #define MAP_PRIVATE 0x02 /* Changes are private */ > #define MAP_SHARED_VALIDATE 0x03 /* share + validate extension flags */ > +#define MAP_DROPPABLE 0x08 /* 4 is not in MAP_TYPE on parisc? */ > > /* > > with do_mmap() doing: > > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1369,6 +1369,23 @@ unsigned long do_mmap(struct file *file, > pgoff = 0; > vm_flags |= VM_SHARED | VM_MAYSHARE; > break; > + case MAP_DROPPABLE: > + /* > + * A locked or stack area makes no sense to > + * be droppable. > + * > + * Also, since droppable pages can just go > + * away at any time, it makes no sense to > + * copy them on fork or dump them. > + */ > + if (flags & MAP_LOCKED) > + return -EINVAL; Likely we'll have to adjust mlock() as well. Also, I think we should just bail out with hugetlb as well. > + if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) > + return -EINVAL; > + > + vm_flags |= VM_DROPPABLE; > + vm_flags |= VM_WIPEONFORK | VM_DONTDUMP; Further, maybe we want to disallow madvise() clearing these flags here, just to be consistent. > + fallthrough; > case MAP_PRIVATE: > /* > * Set pgoff according to addr for anon_vma. > > which looks rather simple. > > The only oddity is that parisc thing - every other archiecture has the > MAP_TYPE bits being 0xf, but parisc uses 0x2b (also four bits, but > instead of the low four bits it's 00101011 - strange). I assume, changing that would have the risk of breaking stupid user space, right? (that sets a bit without any semantics) > > So using 8 as a MAP_TYPE bit for MAP_DROPPABLE works everywhere, and > if we eventually want to do a "signaling" MAP_DROPPABLE we could use > 9. Sounds good enough. > > This has the added advantage that if somebody does this on an old > kernel,. they *will* get an error. Because unlike the 'flag' bits in > general, the MAP_TYPE bit space has always been tested. > > Hmm? As a side note, I'll raise that I am not a particular fan of the "droppable" terminology, at least with the "read 0s" approach. From a user perspective, the memory might suddenly lose its state and read as 0s just like volatile memory when it loses power. "dropping pages" sounds more like an implementation detail. Something like MAP_VOLATILE might be more intuitive (similar to the proposed MADV_VOLATILE). But naming is hard, just mentioning to share my thought :)
> As a side note, I'll raise that I am not a particular fan of the > "droppable" terminology, at least with the "read 0s" approach. > > From a user perspective, the memory might suddenly lose its state and > read as 0s just like volatile memory when it loses power. "dropping > pages" sounds more like an implementation detail. Just to raise why I consider "dropping" an implementation detail: in combination with a previous idea I had of exposing "nonvolatile" memory to VMs, the following might be interesting: A hypervisor could expose special "nonvolatile memory" as separate guest physical memory region to a VM. We could use that special memory to back these MAP_XXX regions in our guest, in addition to trying to make use of them in the guest kernel, for example for something similar to cleancache. Long story short: it's the hypervisor that could be effectively dropping/zeroing out that memory, not the guest VM. "NONVOLATILE" might be clearer than "DROPPABLE". But again, naming is hard ... :)
Hi Linus, On Sun, Jul 07, 2024 at 05:08:29PM -0700, Linus Torvalds wrote: > + vm_flags |= VM_DROPPABLE; > + vm_flags |= VM_WIPEONFORK | VM_DONTDUMP; > which looks rather simple. That is nice, though I would add that if we're implying things that are sensible to imply, it really also needs to add VM_NORESERVE too. DROPPABLE doesn't make sense semantically without it. Anyway, rather than adding PROT_xyz for v+1, I'll try adding this MAP_DROPPABLE (or a different name for David) with the implications as you've suggested. Jason
Hi David, On Mon, Jul 08, 2024 at 10:11:24AM +0200, David Hildenbrand wrote: > The semantics are much more intuitive. No need for separate mmap flags. Agreed. > Likely we'll have to adjust mlock() as well. Also, I think we should > just bail out with hugetlb as well. Ack. > Further, maybe we want to disallow madvise() clearing these flags here, > just to be consistent. Good thinking. > As a side note, I'll raise that I am not a particular fan of the > "droppable" terminology, at least with the "read 0s" approach. > > From a user perspective, the memory might suddenly lose its state and > read as 0s just like volatile memory when it loses power. "dropping > pages" sounds more like an implementation detail. > > Something like MAP_VOLATILE might be more intuitive (similar to the > proposed MADV_VOLATILE). > > But naming is hard, just mentioning to share my thought :) Naming is hard, but *renaming* is annoying. I like droppable simply because that's what I've been calling it in my head. MAP_VOLATILE is fine with me though, and seems reasonable enough. So I'll name it that, and then please don't change your mind about it later so I won't have to rename everything again. :) Jason
On Mon, Jul 08, 2024 at 10:23:10AM +0200, David Hildenbrand wrote: > > As a side note, I'll raise that I am not a particular fan of the > > "droppable" terminology, at least with the "read 0s" approach. > > > > From a user perspective, the memory might suddenly lose its state and > > read as 0s just like volatile memory when it loses power. "dropping > > pages" sounds more like an implementation detail. > > Long story short: it's the hypervisor that could be effectively > dropping/zeroing out that memory, not the guest VM. "NONVOLATILE" might > be clearer than "DROPPABLE". Surely you mean "VOLATILE", not "NONVOLATILE", right? Jason
Hi David, Linus, Below is what I understand the suggestions about the UX to be. The full commit is in https://git.zx2c4.com/linux-rng/log/ but here's the part we've been discussing. I've held off on David's suggestion changing "DROPPABLE" to "VOLATILE" to give Linus some time to wake up on the west coast and voice his preference for "DROPPABLE". But the rest is in place. Jason diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h index a246e11988d5..e89d00528f2f 100644 --- a/include/uapi/linux/mman.h +++ b/include/uapi/linux/mman.h @@ -17,6 +17,7 @@ #define MAP_SHARED 0x01 /* Share changes */ #define MAP_PRIVATE 0x02 /* Changes are private */ #define MAP_SHARED_VALIDATE 0x03 /* share + validate extension flags */ +#define MAP_DROPPABLE 0x08 /* Zero memory under memory pressure. */ /* * Huge page size encoding when MAP_HUGETLB is specified, and a huge page diff --git a/mm/madvise.c b/mm/madvise.c index a77893462b92..cba5bc652fc4 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1068,13 +1068,16 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, new_flags |= VM_WIPEONFORK; break; case MADV_KEEPONFORK: + if (vma->vm_flags & VM_DROPPABLE) + return -EINVAL; new_flags &= ~VM_WIPEONFORK; break; case MADV_DONTDUMP: new_flags |= VM_DONTDUMP; break; case MADV_DODUMP: - if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) + if ((!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) || + (vma->vm_flags & VM_DROPPABLE)) return -EINVAL; new_flags &= ~VM_DONTDUMP; break; diff --git a/mm/mlock.c b/mm/mlock.c index 30b51cdea89d..b87b3d8cc9cc 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -485,7 +485,7 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma, if (newflags == oldflags || (oldflags & VM_SPECIAL) || is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) || - vma_is_dax(vma) || vma_is_secretmem(vma)) + vma_is_dax(vma) || vma_is_secretmem(vma) || (oldflags & VM_DROPPABLE)) /* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */ goto out; diff --git a/mm/mmap.c b/mm/mmap.c index 83b4682ec85c..b3d38179dd42 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1369,6 +1369,34 @@ unsigned long do_mmap(struct file *file, unsigned long addr, pgoff = 0; vm_flags |= VM_SHARED | VM_MAYSHARE; break; + case MAP_DROPPABLE: + /* + * A locked or stack area makes no sense to be droppable. + * + * Also, since droppable pages can just go away at any time + * it makes no sense to copy them on fork or dump them. + * + * And don't attempt to combine with hugetlb for now. + */ + if (flags & (MAP_LOCKED | MAP_HUGETLB)) + return -EINVAL; + if (vm_flags & (VM_GROWSDOWN | VM_GROWSUP)) + return -EINVAL; + + vm_flags |= VM_DROPPABLE; + + /* + * If the pages can be dropped, then it doesn't make + * sense to reserve them. + */ + vm_flags |= VM_NORESERVE; + + /* + * Likewise, they're volatile enough that they + * shouldn't survive forks or coredumps. + */ + vm_flags |= VM_WIPEONFORK | VM_DONTDUMP; + fallthrough; case MAP_PRIVATE: /* * Set pgoff according to addr for anon_vma.
On 08.07.24 15:57, Jason A. Donenfeld wrote: > On Mon, Jul 08, 2024 at 10:23:10AM +0200, David Hildenbrand wrote: >>> As a side note, I'll raise that I am not a particular fan of the >>> "droppable" terminology, at least with the "read 0s" approach. >>> >>> From a user perspective, the memory might suddenly lose its state and >>> read as 0s just like volatile memory when it loses power. "dropping >>> pages" sounds more like an implementation detail. >> >> Long story short: it's the hypervisor that could be effectively >> dropping/zeroing out that memory, not the guest VM. "NONVOLATILE" might >> be clearer than "DROPPABLE". > > Surely you mean "VOLATILE", not "NONVOLATILE", right? Yes, typo :)
On 08.07.24 15:55, Jason A. Donenfeld wrote: > Hi David, > > On Mon, Jul 08, 2024 at 10:11:24AM +0200, David Hildenbrand wrote: >> The semantics are much more intuitive. No need for separate mmap flags. > > Agreed. > >> Likely we'll have to adjust mlock() as well. Also, I think we should >> just bail out with hugetlb as well. > > Ack. > >> Further, maybe we want to disallow madvise() clearing these flags here, >> just to be consistent. > > Good thinking. > >> As a side note, I'll raise that I am not a particular fan of the >> "droppable" terminology, at least with the "read 0s" approach. >> >> From a user perspective, the memory might suddenly lose its state and >> read as 0s just like volatile memory when it loses power. "dropping >> pages" sounds more like an implementation detail. >> >> Something like MAP_VOLATILE might be more intuitive (similar to the >> proposed MADV_VOLATILE). >> >> But naming is hard, just mentioning to share my thought :) > > Naming is hard, but *renaming* is annoying. I like droppable simply > because that's what I've been calling it in my head. MAP_VOLATILE is > fine with me though, and seems reasonable enough. So I'll name it that, > and then please don't change your mind about it later so I won't have to > rename everything again. :) :) Nah. But wait with any remaining until more than one person thinks it's a good idea.
On 08.07.24 16:40, Jason A. Donenfeld wrote: > Hi David, Linus, > > Below is what I understand the suggestions about the UX to be. The full > commit is in https://git.zx2c4.com/linux-rng/log/ but here's the part > we've been discussing. I've held off on David's suggestion changing > "DROPPABLE" to "VOLATILE" to give Linus some time to wake up on the west > coast and voice his preference for "DROPPABLE". But the rest is in > place. > > Jason > > diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h > index a246e11988d5..e89d00528f2f 100644 > --- a/include/uapi/linux/mman.h > +++ b/include/uapi/linux/mman.h > @@ -17,6 +17,7 @@ > #define MAP_SHARED 0x01 /* Share changes */ > #define MAP_PRIVATE 0x02 /* Changes are private */ > #define MAP_SHARED_VALIDATE 0x03 /* share + validate extension flags */ > +#define MAP_DROPPABLE 0x08 /* Zero memory under memory pressure. */ > > /* > * Huge page size encoding when MAP_HUGETLB is specified, and a huge page > diff --git a/mm/madvise.c b/mm/madvise.c > index a77893462b92..cba5bc652fc4 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1068,13 +1068,16 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > new_flags |= VM_WIPEONFORK; > break; > case MADV_KEEPONFORK: > + if (vma->vm_flags & VM_DROPPABLE) > + return -EINVAL; > new_flags &= ~VM_WIPEONFORK; > break; > case MADV_DONTDUMP: > new_flags |= VM_DONTDUMP; > break; > case MADV_DODUMP: > - if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) > + if ((!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) || > + (vma->vm_flags & VM_DROPPABLE)) > return -EINVAL; > new_flags &= ~VM_DONTDUMP; > break; > diff --git a/mm/mlock.c b/mm/mlock.c > index 30b51cdea89d..b87b3d8cc9cc 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -485,7 +485,7 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma, > > if (newflags == oldflags || (oldflags & VM_SPECIAL) || > is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) || > - vma_is_dax(vma) || vma_is_secretmem(vma)) > + vma_is_dax(vma) || vma_is_secretmem(vma) || (oldflags & VM_DROPPABLE)) > /* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */ > goto out; > > diff --git a/mm/mmap.c b/mm/mmap.c > index 83b4682ec85c..b3d38179dd42 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1369,6 +1369,34 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > pgoff = 0; > vm_flags |= VM_SHARED | VM_MAYSHARE; > break; > + case MAP_DROPPABLE: > + /* > + * A locked or stack area makes no sense to be droppable. > + * > + * Also, since droppable pages can just go away at any time > + * it makes no sense to copy them on fork or dump them. > + * > + * And don't attempt to combine with hugetlb for now. > + */ > + if (flags & (MAP_LOCKED | MAP_HUGETLB)) > + return -EINVAL; > + if (vm_flags & (VM_GROWSDOWN | VM_GROWSUP)) > + return -EINVAL; > + > + vm_flags |= VM_DROPPABLE; > + > + /* > + * If the pages can be dropped, then it doesn't make > + * sense to reserve them. > + */ > + vm_flags |= VM_NORESERVE; That is certainly interesting. Nothing that we might not be able to reclaim these pages reliably in all cases: for example when long-term pinning them. In some environments (OVERCOMMIT_NEVER) MAP_NORESERE would never be effective. I wonder if we want to stick to the same behavior here ... but in theory I agree that we can set this here unconditionally, it's just the corner case of "there are ways to prohibit reclaim" that makes me wonder. BTW, I was just trying to understand how MADV_FREE + MAP_DROPPABLE would behave without any swap space around. Did you experiment with that? I'm reading can_reclaim_anon_pages(), and I'm wondering how good/reliable that works when there is no swap configured. Also, the comment in get_scan_count(): "If we have no swap space, do not bother scanning anon folios." makes me wonder if some work in that area is needed.
On 08.07.24 03:46, Jason A. Donenfeld wrote: > Hi David, Hi Jason, just catching up on mails here. Most of the stuff is now clear from the other subthread. [...] >>> @@ -1851,7 +1858,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, >>> * discarded. Remap the page to page table. >>> */ >>> set_pte_at(mm, address, pvmw.pte, pteval); >>> - folio_set_swapbacked(folio); >>> + /* Unlike MADV_FREE mappings, VM_DROPPABLE ones >>> + * never get swap backed on failure to drop. */ >>> + if (!(vma->vm_flags & VM_DROPPABLE)) >>> + folio_set_swapbacked(folio); >>> ret = false; >>> page_vma_mapped_walk_done(&pvmw); >>> break; >> >> A note that in mm/mm-stable, "madvise_free_huge_pmd" exists to optimize >> MADV_FREE on PMDs. I suspect we'd want to extend that one as well for >> dropping support, but likely it would also only be a performance >> improvmeent and not affect functonality if not handled. > > That's for doing the freeing of PTEs after the fact, right? If the > mapping was created, got filled with some data, and then sometime later > it got MADV_FREE'd, which is the pattern people follow typically with > MADV_FREE. If we do this as PROT_/MAP_, then that's not a case we need > to worry about, if I understand this code correctly. We essentially now have code to handle PMD-mapped THP: instead of first remapping them using PTEs to then unmap+discard via 512 PTEs (due to MADV_FREE being set on the folio), we can now simply unmap+discard a single PMD. So performance wise, this might be interesting for this mechanism as well (when used in combination with THP).
On 08.07.24 22:21, David Hildenbrand wrote: > On 08.07.24 16:40, Jason A. Donenfeld wrote: >> Hi David, Linus, >> >> Below is what I understand the suggestions about the UX to be. The full >> commit is in https://git.zx2c4.com/linux-rng/log/ but here's the part >> we've been discussing. I've held off on David's suggestion changing >> "DROPPABLE" to "VOLATILE" to give Linus some time to wake up on the west >> coast and voice his preference for "DROPPABLE". But the rest is in >> place. >> >> Jason >> >> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h >> index a246e11988d5..e89d00528f2f 100644 >> --- a/include/uapi/linux/mman.h >> +++ b/include/uapi/linux/mman.h >> @@ -17,6 +17,7 @@ >> #define MAP_SHARED 0x01 /* Share changes */ >> #define MAP_PRIVATE 0x02 /* Changes are private */ >> #define MAP_SHARED_VALIDATE 0x03 /* share + validate extension flags */ >> +#define MAP_DROPPABLE 0x08 /* Zero memory under memory pressure. */ >> >> /* >> * Huge page size encoding when MAP_HUGETLB is specified, and a huge page >> diff --git a/mm/madvise.c b/mm/madvise.c >> index a77893462b92..cba5bc652fc4 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -1068,13 +1068,16 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, >> new_flags |= VM_WIPEONFORK; >> break; >> case MADV_KEEPONFORK: >> + if (vma->vm_flags & VM_DROPPABLE) >> + return -EINVAL; >> new_flags &= ~VM_WIPEONFORK; >> break; >> case MADV_DONTDUMP: >> new_flags |= VM_DONTDUMP; >> break; >> case MADV_DODUMP: >> - if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) >> + if ((!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) || >> + (vma->vm_flags & VM_DROPPABLE)) >> return -EINVAL; >> new_flags &= ~VM_DONTDUMP; >> break; >> diff --git a/mm/mlock.c b/mm/mlock.c >> index 30b51cdea89d..b87b3d8cc9cc 100644 >> --- a/mm/mlock.c >> +++ b/mm/mlock.c >> @@ -485,7 +485,7 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma, >> >> if (newflags == oldflags || (oldflags & VM_SPECIAL) || >> is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) || >> - vma_is_dax(vma) || vma_is_secretmem(vma)) >> + vma_is_dax(vma) || vma_is_secretmem(vma) || (oldflags & VM_DROPPABLE)) >> /* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */ >> goto out; >> >> diff --git a/mm/mmap.c b/mm/mmap.c >> index 83b4682ec85c..b3d38179dd42 100644 >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -1369,6 +1369,34 @@ unsigned long do_mmap(struct file *file, unsigned long addr, >> pgoff = 0; >> vm_flags |= VM_SHARED | VM_MAYSHARE; >> break; >> + case MAP_DROPPABLE: >> + /* >> + * A locked or stack area makes no sense to be droppable. >> + * >> + * Also, since droppable pages can just go away at any time >> + * it makes no sense to copy them on fork or dump them. >> + * >> + * And don't attempt to combine with hugetlb for now. >> + */ >> + if (flags & (MAP_LOCKED | MAP_HUGETLB)) >> + return -EINVAL; >> + if (vm_flags & (VM_GROWSDOWN | VM_GROWSUP)) >> + return -EINVAL; >> + >> + vm_flags |= VM_DROPPABLE; >> + >> + /* >> + * If the pages can be dropped, then it doesn't make >> + * sense to reserve them. >> + */ >> + vm_flags |= VM_NORESERVE; > > That is certainly interesting. Nothing that we might not be able to "Nothing" -> "I'll note that" :)
Hi David, On Mon, Jul 08, 2024 at 10:21:09PM +0200, David Hildenbrand wrote: > BTW, I was just trying to understand how MADV_FREE + MAP_DROPPABLE would > behave without any swap space around. > > Did you experiment with that? You mean on a system without any swap configured? That's actually my primary test environment for this. It behaves as expected: when ram fills up and the scanner is trying to reclaim what it can, folio_test_swapbacked(folio) is false, and the memory gets freed. After, reads fault in a zero page. So it's working as expected. Jason
On 09.07.24 04:17, Jason A. Donenfeld wrote: > Hi David, > > On Mon, Jul 08, 2024 at 10:21:09PM +0200, David Hildenbrand wrote: >> BTW, I was just trying to understand how MADV_FREE + MAP_DROPPABLE would >> behave without any swap space around. >> >> Did you experiment with that? > > You mean on a system without any swap configured? That's actually my > primary test environment for this. It behaves as expected: when ram > fills up and the scanner is trying to reclaim what it can, > folio_test_swapbacked(folio) is false, and the memory gets freed. After, > reads fault in a zero page. So it's working as expected. Okay, just to be clear: no swap/zram/zswap. The reclaim code regarding not scanning anonymous memory without swap was a bit confusing. thanks!
On Wed, Jul 10, 2024 at 05:05:06AM +0200, David Hildenbrand wrote: > On 09.07.24 04:17, Jason A. Donenfeld wrote: > > Hi David, > > > > On Mon, Jul 08, 2024 at 10:21:09PM +0200, David Hildenbrand wrote: > >> BTW, I was just trying to understand how MADV_FREE + MAP_DROPPABLE would > >> behave without any swap space around. > >> > >> Did you experiment with that? > > > > You mean on a system without any swap configured? That's actually my > > primary test environment for this. It behaves as expected: when ram > > fills up and the scanner is trying to reclaim what it can, > > folio_test_swapbacked(folio) is false, and the memory gets freed. After, > > reads fault in a zero page. So it's working as expected. > > Okay, just to be clear: no swap/zram/zswap. The reclaim code regarding > not scanning anonymous memory without swap was a bit confusing. Right, no swap, as boring a system as can be. I've experimented with that behavior on my swap-less 64GB thinkpad, as well as on little special purpose VMs, where I hacked the VM_DROPPABLE test code into the wireguard test suite. Jason
On 10.07.24 05:34, Jason A. Donenfeld wrote: > On Wed, Jul 10, 2024 at 05:05:06AM +0200, David Hildenbrand wrote: >> On 09.07.24 04:17, Jason A. Donenfeld wrote: >>> Hi David, >>> >>> On Mon, Jul 08, 2024 at 10:21:09PM +0200, David Hildenbrand wrote: >>>> BTW, I was just trying to understand how MADV_FREE + MAP_DROPPABLE would >>>> behave without any swap space around. >>>> >>>> Did you experiment with that? >>> >>> You mean on a system without any swap configured? That's actually my >>> primary test environment for this. It behaves as expected: when ram >>> fills up and the scanner is trying to reclaim what it can, >>> folio_test_swapbacked(folio) is false, and the memory gets freed. After, >>> reads fault in a zero page. So it's working as expected. >> >> Okay, just to be clear: no swap/zram/zswap. The reclaim code regarding >> not scanning anonymous memory without swap was a bit confusing. > > Right, no swap, as boring a system as can be. I've experimented with > that behavior on my swap-less 64GB thinkpad, as well as on little > special purpose VMs, where I hacked the VM_DROPPABLE test code into the > wireguard test suite. Great, thanks!
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 763929e814e9..951c54a45676 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -31,6 +31,9 @@ #define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x100000 /* create a huge page mapping */ #define MAP_FIXED_NOREPLACE 0x200000/* MAP_FIXED which doesn't unmap underlying mapping */ +#define MAP_WIPEONFORK 0x08000000 /* Zero memory in child forks. */ +#define MAP_DONTDUMP 0x10000000 /* Do not write to coredumps. */ +#define MAP_DROPPABLE 0x20000000 /* Zero memory under memory pressure. */ #define MS_ASYNC 1 /* sync memory asynchronously */ #define MS_SYNC 2 /* synchronous memory sync */ diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index 9c48d9a21aa0..7490a28ec960 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -49,6 +49,9 @@ #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */ #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ +#define MAP_WIPEONFORK 0x08000000 /* Zero memory in child forks. */ +#define MAP_DONTDUMP 0x10000000 /* Do not write to coredumps. */ +#define MAP_DROPPABLE 0x20000000 /* Zero memory under memory pressure. */ /* * Flags for msync diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 68c44f99bc93..ed03f1d7d06c 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -26,6 +26,9 @@ #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ #define MAP_UNINITIALIZED 0 /* uninitialized anonymous mmap */ +#define MAP_WIPEONFORK 0x08000000 /* Zero memory in child forks. */ +#define MAP_DONTDUMP 0x10000000 /* Do not write to coredumps. */ +#define MAP_DROPPABLE 0x20000000 /* Zero memory under memory pressure. */ #define MS_SYNC 1 /* synchronous memory sync */ #define MS_ASYNC 2 /* sync memory asynchronously */ diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index 1ff0c858544f..2e777670b7fa 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -58,6 +58,9 @@ #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ +#define MAP_WIPEONFORK 0x08000000 /* Zero memory in child forks. */ +#define MAP_DONTDUMP 0x10000000 /* Do not write to coredumps. */ +#define MAP_DROPPABLE 0x20000000 /* Zero memory under memory pressure. */ /* * Flags for msync diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 71e5039d940d..b3bd8432f869 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -709,6 +709,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) #endif #ifdef CONFIG_64BIT [ilog2(VM_SEALED)] = "sl", +#endif +#ifdef CONFIG_NEED_VM_DROPPABLE + [ilog2(VM_DROPPABLE)] = "dp", #endif }; size_t i; diff --git a/include/linux/mm.h b/include/linux/mm.h index eb7c96d24ac0..92454a0272ce 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -321,12 +321,14 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_6 38 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) #define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) +#define VM_HIGH_ARCH_6 BIT(VM_HIGH_ARCH_BIT_6) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #ifdef CONFIG_ARCH_HAS_PKEYS @@ -357,6 +359,12 @@ extern unsigned int kobjsize(const void *objp); # define VM_SHADOW_STACK VM_NONE #endif +#ifdef CONFIG_NEED_VM_DROPPABLE +# define VM_DROPPABLE VM_HIGH_ARCH_6 +#else +# define VM_DROPPABLE VM_NONE +#endif + #if defined(CONFIG_X86) # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */ #elif defined(CONFIG_PPC) diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index e46d6e82765e..fab7848df50a 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -165,6 +165,12 @@ IF_HAVE_PG_ARCH_X(arch_3) # define IF_HAVE_UFFD_MINOR(flag, name) #endif +#ifdef CONFIG_NEED_VM_DROPPABLE +# define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name}, +#else +# define IF_HAVE_VM_DROPPABLE(flag, name) +#endif + #define __def_vmaflag_names \ {VM_READ, "read" }, \ {VM_WRITE, "write" }, \ @@ -197,6 +203,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \ {VM_MIXEDMAP, "mixedmap" }, \ {VM_HUGEPAGE, "hugepage" }, \ {VM_NOHUGEPAGE, "nohugepage" }, \ +IF_HAVE_VM_DROPPABLE(VM_DROPPABLE, "droppable" ) \ {VM_MERGEABLE, "mergeable" } \ #define show_vma_flags(flags) \ diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..65a3069462a8 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -33,6 +33,10 @@ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ +#define MAP_WIPEONFORK 0x08000000 /* Zero memory in child forks. */ +#define MAP_DONTDUMP 0x10000000 /* Do not write to coredumps. */ +#define MAP_DROPPABLE 0x20000000 /* Zero memory under memory pressure. */ + /* * Flags for mlock */ diff --git a/mm/Kconfig b/mm/Kconfig index b4cb45255a54..6cd65ea4b3ad 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1056,6 +1056,9 @@ config ARCH_USES_HIGH_VMA_FLAGS bool config ARCH_HAS_PKEYS bool +config NEED_VM_DROPPABLE + select ARCH_USES_HIGH_VMA_FLAGS + bool config ARCH_USES_PG_ARCH_X bool diff --git a/mm/mmap.c b/mm/mmap.c index 83b4682ec85c..e361f6750201 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1278,6 +1278,21 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; + if (flags & MAP_WIPEONFORK) { + /* MAP_WIPEONFORK is only supported on anonymous memory. */ + if (file || !(flags & MAP_PRIVATE)) + return -EINVAL; + vm_flags |= VM_WIPEONFORK; + } + if (flags & MAP_DONTDUMP) + vm_flags |= VM_DONTDUMP; + if (flags & MAP_DROPPABLE) { + /* MAP_DROPPABLE is only supported on anonymous memory. */ + if (file || !(flags & MAP_PRIVATE)) + return -EINVAL; + vm_flags |= VM_DROPPABLE; + } + /* Obtain the address to map to. we verify (or select) it and ensure * that it represents a valid section of the address space. */ diff --git a/mm/mprotect.c b/mm/mprotect.c index 8c6cd8825273..57b8dad9adcc 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -623,7 +623,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, may_expand_vm(mm, oldflags, nrpages)) return -ENOMEM; if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB| - VM_SHARED|VM_NORESERVE))) { + VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) { charged = nrpages; if (security_vm_enough_memory_mm(mm, charged)) return -ENOMEM; diff --git a/mm/rmap.c b/mm/rmap.c index e8fc5ecb59b2..56d7535d5cf6 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1397,7 +1397,10 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); VM_BUG_ON_VMA(address < vma->vm_start || address + (nr << PAGE_SHIFT) > vma->vm_end, vma); - __folio_set_swapbacked(folio); + /* VM_DROPPABLE mappings don't swap; instead they're just dropped when + * under memory pressure. */ + if (!(vma->vm_flags & VM_DROPPABLE)) + __folio_set_swapbacked(folio); __folio_set_anon(folio, vma, address, true); if (likely(!folio_test_large(folio))) { @@ -1841,7 +1844,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * plus the rmap(s) (dropped by discard:). */ if (ref_count == 1 + map_count && - !folio_test_dirty(folio)) { + (!folio_test_dirty(folio) || + /* Unlike MADV_FREE mappings, VM_DROPPABLE + * ones can be dropped even if they've + * been dirtied. */ + (vma->vm_flags & VM_DROPPABLE))) { dec_mm_counter(mm, MM_ANONPAGES); goto discard; } @@ -1851,7 +1858,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * discarded. Remap the page to page table. */ set_pte_at(mm, address, pvmw.pte, pteval); - folio_set_swapbacked(folio); + /* Unlike MADV_FREE mappings, VM_DROPPABLE ones + * never get swap backed on failure to drop. */ + if (!(vma->vm_flags & VM_DROPPABLE)) + folio_set_swapbacked(folio); ret = false; page_vma_mapped_walk_done(&pvmw); break; diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..65a3069462a8 100644 --- a/tools/include/uapi/asm-generic/mman-common.h +++ b/tools/include/uapi/asm-generic/mman-common.h @@ -33,6 +33,10 @@ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ +#define MAP_WIPEONFORK 0x08000000 /* Zero memory in child forks. */ +#define MAP_DONTDUMP 0x10000000 /* Do not write to coredumps. */ +#define MAP_DROPPABLE 0x20000000 /* Zero memory under memory pressure. */ + /* * Flags for mlock */ diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 0b9ab987601c..a8beeb43c2b5 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -49,3 +49,4 @@ hugetlb_fault_after_madv hugetlb_madv_vs_map mseal_test seal_elf +droppable diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 3b49bc3d0a3b..e3e5740e13e1 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -73,6 +73,7 @@ TEST_GEN_FILES += ksm_functional_tests TEST_GEN_FILES += mdwe_test TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map +TEST_GEN_FILES += droppable ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty diff --git a/tools/testing/selftests/mm/droppable.c b/tools/testing/selftests/mm/droppable.c new file mode 100644 index 000000000000..846fb9aea4d1 --- /dev/null +++ b/tools/testing/selftests/mm/droppable.c @@ -0,0 +1,53 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2024 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. + */ + +#include <assert.h> +#include <stdbool.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <signal.h> +#include <sys/mman.h> +#include <linux/mman.h> + +#include "../kselftest.h" + +int main(int argc, char *argv[]) +{ + size_t alloc_size = 134217728; + size_t page_size = getpagesize(); + void *alloc; + pid_t child; + + ksft_print_header(); + ksft_set_plan(1); + + alloc = mmap(0, alloc_size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_DROPPABLE, -1, 0); + assert(alloc != MAP_FAILED); + memset(alloc, 'A', alloc_size); + for (size_t i = 0; i < alloc_size; i += page_size) + assert(*(uint8_t *)(alloc + i)); + + child = fork(); + assert(child >= 0); + if (!child) { + for (;;) + memset(malloc(page_size), 'A', page_size); + } + + for (bool done = false; !done;) { + for (size_t i = 0; i < alloc_size; i += page_size) { + if (!*(uint8_t *)(alloc + i)) { + done = true; + break; + } + } + } + kill(child, SIGTERM); + + ksft_test_result_pass("VM_DROPPABLE: PASS\n"); + exit(KSFT_PASS); +}
The vDSO getrandom() implementation works with a buffer allocated with a new system call that has certain requirements: - It shouldn't be written to core dumps. * Easy: VM_DONTDUMP. - It should be zeroed on fork. * Easy: VM_WIPEONFORK. - It shouldn't be written to swap. * Uh-oh: mlock is rlimited. * Uh-oh: mlock isn't inherited by forks. It turns out that the vDSO getrandom() function has three really nice characteristics that we can exploit to solve this problem: 1) Due to being wiped during fork(), the vDSO code is already robust to having the contents of the pages it reads zeroed out midway through the function's execution. 2) In the absolute worst case of whatever contingency we're coding for, we have the option to fallback to the getrandom() syscall, and everything is fine. 3) The buffers the function uses are only ever useful for a maximum of 60 seconds -- a sort of cache, rather than a long term allocation. These characteristics mean that we can introduce VM_DROPPABLE, which has the following semantics: a) It never is written out to swap. b) Under memory pressure, mm can just drop the pages (so that they're zero when read back again). c) It is inherited by fork. d) It doesn't count against the mlock budget, since nothing is locked. This is fairly simple to implement, with the one snag that we have to use 64-bit VM_* flags, but this shouldn't be a problem, since the only consumers will probably be 64-bit anyway. This way, allocations used by vDSO getrandom() can use: VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE And there will be no problem with using memory when not in use, not wiping on fork(), coredumps, or writing out to swap. In order to let vDSO getrandom() use this, expose these via mmap(2) as well, giving MAP_WIPEONFORK, MAP_DONTDUMP, and MAP_DROPPABLE. Finally, the provided self test ensures that this is working as desired. Cc: linux-mm@kvack.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- arch/alpha/include/uapi/asm/mman.h | 3 ++ arch/mips/include/uapi/asm/mman.h | 3 ++ arch/parisc/include/uapi/asm/mman.h | 3 ++ arch/xtensa/include/uapi/asm/mman.h | 3 ++ fs/proc/task_mmu.c | 3 ++ include/linux/mm.h | 8 +++ include/trace/events/mmflags.h | 7 +++ include/uapi/asm-generic/mman-common.h | 4 ++ mm/Kconfig | 3 ++ mm/mmap.c | 15 ++++++ mm/mprotect.c | 2 +- mm/rmap.c | 16 ++++-- tools/include/uapi/asm-generic/mman-common.h | 4 ++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/droppable.c | 53 ++++++++++++++++++++ 16 files changed, 125 insertions(+), 4 deletions(-) create mode 100644 tools/testing/selftests/mm/droppable.c