Message ID | 20240528122352.2485958-2-Jason@zx2c4.com |
---|---|
State | New |
Headers | show |
Series | implement getrandom() in vDSO | expand |
On Tue, May 28, 2024 at 01:41:50PM -0700, Frank van der Linden wrote: > On Tue, May 28, 2024 at 5:24 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > The vDSO getrandom() implementation works with a buffer allocated with a > > new system call that has certain requirements: > > > > - It shouldn't be written to core dumps. > > * Easy: VM_DONTDUMP. > > - It should be zeroed on fork. > > * Easy: VM_WIPEONFORK. > > > > - It shouldn't be written to swap. > > * Uh-oh: mlock is rlimited. > > * Uh-oh: mlock isn't inherited by forks. > > > > - It shouldn't reserve actual memory, but it also shouldn't crash when > > page faulting in memory if none is available > > * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2. > > * Uh-oh: VM_NORESERVE means segfaults. > > > > It turns out that the vDSO getrandom() function has three really nice > > characteristics that we can exploit to solve this problem: > > > > 1) Due to being wiped during fork(), the vDSO code is already robust to > > having the contents of the pages it reads zeroed out midway through > > the function's execution. > > > > 2) In the absolute worst case of whatever contingency we're coding for, > > we have the option to fallback to the getrandom() syscall, and > > everything is fine. > > > > 3) The buffers the function uses are only ever useful for a maximum of > > 60 seconds -- a sort of cache, rather than a long term allocation. > > > > These characteristics mean that we can introduce VM_DROPPABLE, which > > has the following semantics: > > > > a) It never is written out to swap. > > b) Under memory pressure, mm can just drop the pages (so that they're > > zero when read back again). > > c) If there's not enough memory to service a page fault, it's not fatal. > > d) It is inherited by fork. > > e) It doesn't count against the mlock budget, since nothing is locked. > > > > This is fairly simple to implement, with the one snag that we have to > > use 64-bit VM_* flags, but this shouldn't be a problem, since the only > > consumers will probably be 64-bit anyway. > > > > This way, allocations used by vDSO getrandom() can use: > > > > VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE > > > > And there will be no problem with OOMing, crashing on overcommitment, > > using memory when not in use, not wiping on fork(), coredumps, or > > writing out to swap. > > > > Cc: linux-mm@kvack.org > > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > > --- > > fs/proc/task_mmu.c | 3 +++ > > include/linux/mm.h | 8 ++++++++ > > include/trace/events/mmflags.h | 7 +++++++ > > mm/Kconfig | 3 +++ > > mm/memory.c | 4 ++++ > > mm/mempolicy.c | 3 +++ > > mm/mprotect.c | 2 +- > > mm/rmap.c | 8 +++++--- > > 8 files changed, 34 insertions(+), 4 deletions(-) > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index e5a5f015ff03..b5a59e57bde1 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -706,6 +706,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) > > #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ > > #ifdef CONFIG_X86_USER_SHADOW_STACK > > [ilog2(VM_SHADOW_STACK)] = "ss", > > +#endif > > +#ifdef CONFIG_NEED_VM_DROPPABLE > > + [ilog2(VM_DROPPABLE)] = "dp", > > #endif > > }; > > size_t i; > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 9849dfda44d4..5978cb4cc21c 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -321,12 +321,14 @@ extern unsigned int kobjsize(const void *objp); > > #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ > > #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ > > #define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ > > +#define VM_HIGH_ARCH_BIT_6 38 /* bit only usable on 64-bit architectures */ > > #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) > > #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) > > #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) > > #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) > > #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) > > #define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) > > +#define VM_HIGH_ARCH_6 BIT(VM_HIGH_ARCH_BIT_6) > > #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ > > > > #ifdef CONFIG_ARCH_HAS_PKEYS > > @@ -357,6 +359,12 @@ extern unsigned int kobjsize(const void *objp); > > # define VM_SHADOW_STACK VM_NONE > > #endif > > > > +#ifdef CONFIG_NEED_VM_DROPPABLE > > +# define VM_DROPPABLE VM_HIGH_ARCH_6 > > +#else > > +# define VM_DROPPABLE VM_NONE > > +#endif > > + > > #if defined(CONFIG_X86) > > # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */ > > #elif defined(CONFIG_PPC) > > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h > > index e46d6e82765e..fab7848df50a 100644 > > --- a/include/trace/events/mmflags.h > > +++ b/include/trace/events/mmflags.h > > @@ -165,6 +165,12 @@ IF_HAVE_PG_ARCH_X(arch_3) > > # define IF_HAVE_UFFD_MINOR(flag, name) > > #endif > > > > +#ifdef CONFIG_NEED_VM_DROPPABLE > > +# define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name}, > > +#else > > +# define IF_HAVE_VM_DROPPABLE(flag, name) > > +#endif > > + > > #define __def_vmaflag_names \ > > {VM_READ, "read" }, \ > > {VM_WRITE, "write" }, \ > > @@ -197,6 +203,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \ > > {VM_MIXEDMAP, "mixedmap" }, \ > > {VM_HUGEPAGE, "hugepage" }, \ > > {VM_NOHUGEPAGE, "nohugepage" }, \ > > +IF_HAVE_VM_DROPPABLE(VM_DROPPABLE, "droppable" ) \ > > {VM_MERGEABLE, "mergeable" } \ > > > > #define show_vma_flags(flags) \ > > diff --git a/mm/Kconfig b/mm/Kconfig > > index b4cb45255a54..6cd65ea4b3ad 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -1056,6 +1056,9 @@ config ARCH_USES_HIGH_VMA_FLAGS > > bool > > config ARCH_HAS_PKEYS > > bool > > +config NEED_VM_DROPPABLE > > + select ARCH_USES_HIGH_VMA_FLAGS > > + bool > > > > config ARCH_USES_PG_ARCH_X > > bool > > diff --git a/mm/memory.c b/mm/memory.c > > index b5453b86ec4b..57b03fc73159 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -5689,6 +5689,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > > > lru_gen_exit_fault(); > > > > + /* If the mapping is droppable, then errors due to OOM aren't fatal. */ > > + if (vma->vm_flags & VM_DROPPABLE) > > + ret &= ~VM_FAULT_OOM; > > + > > if (flags & FAULT_FLAG_USER) { > > mem_cgroup_exit_user_fault(); > > /* > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index aec756ae5637..a66289f1d931 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -2300,6 +2300,9 @@ struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct > > pgoff_t ilx; > > struct page *page; > > > > + if (vma->vm_flags & VM_DROPPABLE) > > + gfp |= __GFP_NOWARN | __GFP_NORETRY; > > + > > pol = get_vma_policy(vma, addr, order, &ilx); > > page = alloc_pages_mpol_noprof(gfp | __GFP_COMP, order, > > pol, ilx, numa_node_id()); > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > index 94878c39ee32..88ff3ecc08a1 100644 > > --- a/mm/mprotect.c > > +++ b/mm/mprotect.c > > @@ -622,7 +622,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, > > may_expand_vm(mm, oldflags, nrpages)) > > return -ENOMEM; > > if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB| > > - VM_SHARED|VM_NORESERVE))) { > > + VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) { > > charged = nrpages; > > if (security_vm_enough_memory_mm(mm, charged)) > > return -ENOMEM; > > diff --git a/mm/rmap.c b/mm/rmap.c > > index e8fc5ecb59b2..d873a3f06506 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -1397,7 +1397,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, > > VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); > > VM_BUG_ON_VMA(address < vma->vm_start || > > address + (nr << PAGE_SHIFT) > vma->vm_end, vma); > > - __folio_set_swapbacked(folio); > > + if (!(vma->vm_flags & VM_DROPPABLE)) > > + __folio_set_swapbacked(folio); > > __folio_set_anon(folio, vma, address, true); > > > > if (likely(!folio_test_large(folio))) { > > @@ -1841,7 +1842,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > > * plus the rmap(s) (dropped by discard:). > > */ > > if (ref_count == 1 + map_count && > > - !folio_test_dirty(folio)) { > > + (!folio_test_dirty(folio) || (vma->vm_flags & VM_DROPPABLE))) { > > dec_mm_counter(mm, MM_ANONPAGES); > > goto discard; > > } > > @@ -1851,7 +1852,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > > * discarded. Remap the page to page table. > > */ > > set_pte_at(mm, address, pvmw.pte, pteval); > > - folio_set_swapbacked(folio); > > + if (!(vma->vm_flags & VM_DROPPABLE)) > > + folio_set_swapbacked(folio); > > ret = false; > > page_vma_mapped_walk_done(&pvmw); > > break; > > -- > > 2.44.0 > > > > > > This seems like an obvious question, but I can't seem to find a > message asking this in the long history of this patchset: VM_DROPPABLE > seems very close to MADV_FREE lazyfree memory. Very different semantics and use case. For example, with MADV_FREE, if you redirty the page by writing to it, the flag is cleared. Jason
On Tue, May 28, 2024 at 2:24 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > c) If there's not enough memory to service a page fault, it's not fatal. [...] > @@ -5689,6 +5689,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > lru_gen_exit_fault(); > > + /* If the mapping is droppable, then errors due to OOM aren't fatal. */ > + if (vma->vm_flags & VM_DROPPABLE) > + ret &= ~VM_FAULT_OOM; Can you remind me how this is supposed to work? If we get an OOM error, and the error is not fatal, does that mean we'll just keep hitting the same fault handler over and over again (until we happen to have memory available again I guess)? Or is there something in this series that somehow redirects userspace execution to getrandom() in that case? > + > if (flags & FAULT_FLAG_USER) { > mem_cgroup_exit_user_fault(); > /*
On Fri, May 31, 2024 at 12:48:58PM +0200, Jann Horn wrote: > On Tue, May 28, 2024 at 2:24 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > c) If there's not enough memory to service a page fault, it's not fatal. > [...] > > @@ -5689,6 +5689,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > > > lru_gen_exit_fault(); > > > > + /* If the mapping is droppable, then errors due to OOM aren't fatal. */ > > + if (vma->vm_flags & VM_DROPPABLE) > > + ret &= ~VM_FAULT_OOM; > > Can you remind me how this is supposed to work? If we get an OOM > error, and the error is not fatal, does that mean we'll just keep > hitting the same fault handler over and over again (until we happen to > have memory available again I guess)? Right, it'll just keep retrying. I agree this isn't great, which is why in the 2023 patchset, I had additional code to simply skip the faulting instruction, and then the userspace code would notice the inconsistency and fallback to the syscall. This worked pretty well. But it meant decoding the instruction and in general skipping instructions is weird, and that made this patchset very very contentious. Since the skipping behavior isn't actually required by the /security goals/ of this, I figured I'd just drop that. And maybe we can all revisit it together sometime down the line. But for now I'm hoping for something a little easier to swallow. Jason
On Fri, May 31, 2024 at 2:13 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > On Fri, May 31, 2024 at 12:48:58PM +0200, Jann Horn wrote: > > On Tue, May 28, 2024 at 2:24 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > c) If there's not enough memory to service a page fault, it's not fatal. > > [...] > > > @@ -5689,6 +5689,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > > > > > lru_gen_exit_fault(); > > > > > > + /* If the mapping is droppable, then errors due to OOM aren't fatal. */ > > > + if (vma->vm_flags & VM_DROPPABLE) > > > + ret &= ~VM_FAULT_OOM; > > > > Can you remind me how this is supposed to work? If we get an OOM > > error, and the error is not fatal, does that mean we'll just keep > > hitting the same fault handler over and over again (until we happen to > > have memory available again I guess)? > > Right, it'll just keep retrying. I agree this isn't great, which is why > in the 2023 patchset, I had additional code to simply skip the faulting > instruction, and then the userspace code would notice the inconsistency > and fallback to the syscall. This worked pretty well. But it meant > decoding the instruction and in general skipping instructions is weird, > and that made this patchset very very contentious. Since the skipping > behavior isn't actually required by the /security goals/ of this, I > figured I'd just drop that. And maybe we can all revisit it together > sometime down the line. But for now I'm hoping for something a little > easier to swallow. In that case, since we need to be able to populate this memory to make forward progress, would it make sense to remove the parts of the patch that treat the allocation as if it was allowed to silently fail (the "__GFP_NOWARN | __GFP_NORETRY" and the "ret &= ~VM_FAULT_OOM")? I think that would also simplify this a bit by making this type of memory a little less special.
On Fri, May 31, 2024 at 03:00:26PM +0200, Jann Horn wrote: > On Fri, May 31, 2024 at 2:13 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > On Fri, May 31, 2024 at 12:48:58PM +0200, Jann Horn wrote: > > > On Tue, May 28, 2024 at 2:24 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > c) If there's not enough memory to service a page fault, it's not fatal. > > > [...] > > > > @@ -5689,6 +5689,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > > > > > > > lru_gen_exit_fault(); > > > > > > > > + /* If the mapping is droppable, then errors due to OOM aren't fatal. */ > > > > + if (vma->vm_flags & VM_DROPPABLE) > > > > + ret &= ~VM_FAULT_OOM; > > > > > > Can you remind me how this is supposed to work? If we get an OOM > > > error, and the error is not fatal, does that mean we'll just keep > > > hitting the same fault handler over and over again (until we happen to > > > have memory available again I guess)? > > > > Right, it'll just keep retrying. I agree this isn't great, which is why > > in the 2023 patchset, I had additional code to simply skip the faulting > > instruction, and then the userspace code would notice the inconsistency > > and fallback to the syscall. This worked pretty well. But it meant > > decoding the instruction and in general skipping instructions is weird, > > and that made this patchset very very contentious. Since the skipping > > behavior isn't actually required by the /security goals/ of this, I > > figured I'd just drop that. And maybe we can all revisit it together > > sometime down the line. But for now I'm hoping for something a little > > easier to swallow. > > In that case, since we need to be able to populate this memory to make > forward progress, would it make sense to remove the parts of the patch > that treat the allocation as if it was allowed to silently fail (the > "__GFP_NOWARN | __GFP_NORETRY" and the "ret &= ~VM_FAULT_OOM")? I > think that would also simplify this a bit by making this type of > memory a little less special. The whole point, though, is that it needs to not fail or warn. It's memory that can be dropped/zeroed at any moment, and the code is deliberately robust to that. Jason
On Fri, Jun 7, 2024 at 4:35 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > On Fri, May 31, 2024 at 03:00:26PM +0200, Jann Horn wrote: > > On Fri, May 31, 2024 at 2:13 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > On Fri, May 31, 2024 at 12:48:58PM +0200, Jann Horn wrote: > > > > On Tue, May 28, 2024 at 2:24 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > > c) If there's not enough memory to service a page fault, it's not fatal. > > > > [...] > > > > > @@ -5689,6 +5689,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > > > > > > > > > lru_gen_exit_fault(); > > > > > > > > > > + /* If the mapping is droppable, then errors due to OOM aren't fatal. */ > > > > > + if (vma->vm_flags & VM_DROPPABLE) > > > > > + ret &= ~VM_FAULT_OOM; > > > > > > > > Can you remind me how this is supposed to work? If we get an OOM > > > > error, and the error is not fatal, does that mean we'll just keep > > > > hitting the same fault handler over and over again (until we happen to > > > > have memory available again I guess)? > > > > > > Right, it'll just keep retrying. I agree this isn't great, which is why > > > in the 2023 patchset, I had additional code to simply skip the faulting > > > instruction, and then the userspace code would notice the inconsistency > > > and fallback to the syscall. This worked pretty well. But it meant > > > decoding the instruction and in general skipping instructions is weird, > > > and that made this patchset very very contentious. Since the skipping > > > behavior isn't actually required by the /security goals/ of this, I > > > figured I'd just drop that. And maybe we can all revisit it together > > > sometime down the line. But for now I'm hoping for something a little > > > easier to swallow. > > > > In that case, since we need to be able to populate this memory to make > > forward progress, would it make sense to remove the parts of the patch > > that treat the allocation as if it was allowed to silently fail (the > > "__GFP_NOWARN | __GFP_NORETRY" and the "ret &= ~VM_FAULT_OOM")? I > > think that would also simplify this a bit by making this type of > > memory a little less special. > > The whole point, though, is that it needs to not fail or warn. It's > memory that can be dropped/zeroed at any moment, and the code is > deliberately robust to that. Sure - but does it have to be more robust than accessing a newly allocated piece of memory [which hasn't been populated with anonymous pages yet] or bringing a swapped-out page back from swap? I'm not an expert on OOM handling, but my understanding is that the kernel tries _really_ hard to avoid failing low-order GFP_KERNEL allocations, with the help of the OOM killer. My understanding is that those allocations basically can't fail with a NULL return unless the process has already been killed or it is in a memcg_kmem cgroup that contains only processes that have been marked as exempt from OOM killing. (Or if you're using error injection to explicitly tell the kernel to fail the allocation.) My understanding is that normal outcomes of an out-of-memory situation are things like the OOM killer killing processes (including potentially the calling one) to free up memory, or the OOM killer panic()ing the whole system as a last resort; but getting a NULL return from page_alloc(GFP_KERNEL) without getting killed is not one of those outcomes.
On Fri, Jun 7, 2024 at 5:12 PM Jann Horn <jannh@google.com> wrote: > On Fri, Jun 7, 2024 at 4:35 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > On Fri, May 31, 2024 at 03:00:26PM +0200, Jann Horn wrote: > > > On Fri, May 31, 2024 at 2:13 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > On Fri, May 31, 2024 at 12:48:58PM +0200, Jann Horn wrote: > > > > > On Tue, May 28, 2024 at 2:24 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > > > c) If there's not enough memory to service a page fault, it's not fatal. > > > > > [...] > > > > > > @@ -5689,6 +5689,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > > > > > > > > > > > lru_gen_exit_fault(); > > > > > > > > > > > > + /* If the mapping is droppable, then errors due to OOM aren't fatal. */ > > > > > > + if (vma->vm_flags & VM_DROPPABLE) > > > > > > + ret &= ~VM_FAULT_OOM; > > > > > > > > > > Can you remind me how this is supposed to work? If we get an OOM > > > > > error, and the error is not fatal, does that mean we'll just keep > > > > > hitting the same fault handler over and over again (until we happen to > > > > > have memory available again I guess)? > > > > > > > > Right, it'll just keep retrying. I agree this isn't great, which is why > > > > in the 2023 patchset, I had additional code to simply skip the faulting > > > > instruction, and then the userspace code would notice the inconsistency > > > > and fallback to the syscall. This worked pretty well. But it meant > > > > decoding the instruction and in general skipping instructions is weird, > > > > and that made this patchset very very contentious. Since the skipping > > > > behavior isn't actually required by the /security goals/ of this, I > > > > figured I'd just drop that. And maybe we can all revisit it together > > > > sometime down the line. But for now I'm hoping for something a little > > > > easier to swallow. > > > > > > In that case, since we need to be able to populate this memory to make > > > forward progress, would it make sense to remove the parts of the patch > > > that treat the allocation as if it was allowed to silently fail (the > > > "__GFP_NOWARN | __GFP_NORETRY" and the "ret &= ~VM_FAULT_OOM")? I > > > think that would also simplify this a bit by making this type of > > > memory a little less special. > > > > The whole point, though, is that it needs to not fail or warn. It's > > memory that can be dropped/zeroed at any moment, and the code is > > deliberately robust to that. > > Sure - but does it have to be more robust than accessing a newly > allocated piece of memory [which hasn't been populated with anonymous > pages yet] or bringing a swapped-out page back from swap? > > I'm not an expert on OOM handling, but my understanding is that the > kernel tries _really_ hard to avoid failing low-order GFP_KERNEL > allocations, with the help of the OOM killer. My understanding is that > those allocations basically can't fail with a NULL return unless the > process has already been killed or it is in a memcg_kmem cgroup that > contains only processes that have been marked as exempt from OOM > killing. (Or if you're using error injection to explicitly tell the > kernel to fail the allocation.) > My understanding is that normal outcomes of an out-of-memory situation > are things like the OOM killer killing processes (including > potentially the calling one) to free up memory, or the OOM killer > panic()ing the whole system as a last resort; but getting a NULL > return from page_alloc(GFP_KERNEL) without getting killed is not one > of those outcomes. Or, from a different angle: You're trying to allocate memory, and you can't make forward progress until that memory has been allocated (unless the process is killed). That's what GFP_KERNEL is for. Stuff like "__GFP_NOWARN | __GFP_NORETRY" is for when you have a backup plan that lets you make progress (perhaps in a slightly less efficient way, or by dropping some incoming data, or something like that), and it hints to the page allocator that it doesn't have to try hard to reclaim memory if it can't find free memory quickly.
On Tue, May 28, 2024 at 5:24 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > The vDSO getrandom() implementation works with a buffer allocated with a > new system call that has certain requirements: > > - It shouldn't be written to core dumps. > * Easy: VM_DONTDUMP. I'll bite: why shouldn't it be written to core dumps? The implementation is supposed to be forward-secret: an attacker who gets the state can't predict prior outputs. And a core-dumped process is dead: there won't be future outputs.
On Fri 07-06-24 17:50:34, Jann Horn wrote: [...] > Or, from a different angle: You're trying to allocate memory, and you > can't make forward progress until that memory has been allocated > (unless the process is killed). That's what GFP_KERNEL is for. Stuff > like "__GFP_NOWARN | __GFP_NORETRY" is for when you have a backup plan > that lets you make progress (perhaps in a slightly less efficient way, > or by dropping some incoming data, or something like that), and it > hints to the page allocator that it doesn't have to try hard to > reclaim memory if it can't find free memory quickly. Correct. A psedu-busy wait for allocation to succeed sounds like a very bad idea to imprint into ABI. Is there really any design requirement to make these mappings to never cause the OOM killer? Making the content dropable under memory pressure because it is inherently recoverable is something else (this is essentially an implicit MADV_FREE semantic) but putting a requirement on the memory allocation on the fault sounds just wrong to me.
On Mon, Jun 10, 2024 at 02:00:21PM +0200, Michal Hocko wrote: > On Fri 07-06-24 17:50:34, Jann Horn wrote: > [...] > > Or, from a different angle: You're trying to allocate memory, and you > > can't make forward progress until that memory has been allocated > > (unless the process is killed). That's what GFP_KERNEL is for. Stuff > > like "__GFP_NOWARN | __GFP_NORETRY" is for when you have a backup plan > > that lets you make progress (perhaps in a slightly less efficient way, > > or by dropping some incoming data, or something like that), and it > > hints to the page allocator that it doesn't have to try hard to > > reclaim memory if it can't find free memory quickly. > > Correct. A psedu-busy wait for allocation to succeed sounds like a very > bad idea to imprint into ABI. Is there really any design requirement to > make these mappings to never cause the OOM killer? > > Making the content dropable under memory pressure because it is > inherently recoverable is something else (this is essentially an > implicit MADV_FREE semantic) but putting a requirement on the memory > allocation on the fault sounds just wrong to me. The idea is that syscall getrandom() won't make a process be killed, so neither should vgetrandom(). But there's an argument to be made that the NOWARN|NORETRY logic only made sense with the now-dropped "skip instruction on fault" patch that was so controversial before, since in that case, there wouldn't be infinite retry, but rather skipping and then falling back to the syscall. I think this is nicer behavior, but the implementation caused a stir, so I'm not at the moment going that route. Given that, I think I'll follow your advice and get rid of NOWARN|NORETRY for this too. And then maybe we'll all revisit that later. Jason
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e5a5f015ff03..b5a59e57bde1 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -706,6 +706,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ #ifdef CONFIG_X86_USER_SHADOW_STACK [ilog2(VM_SHADOW_STACK)] = "ss", +#endif +#ifdef CONFIG_NEED_VM_DROPPABLE + [ilog2(VM_DROPPABLE)] = "dp", #endif }; size_t i; diff --git a/include/linux/mm.h b/include/linux/mm.h index 9849dfda44d4..5978cb4cc21c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -321,12 +321,14 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_6 38 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) #define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) +#define VM_HIGH_ARCH_6 BIT(VM_HIGH_ARCH_BIT_6) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #ifdef CONFIG_ARCH_HAS_PKEYS @@ -357,6 +359,12 @@ extern unsigned int kobjsize(const void *objp); # define VM_SHADOW_STACK VM_NONE #endif +#ifdef CONFIG_NEED_VM_DROPPABLE +# define VM_DROPPABLE VM_HIGH_ARCH_6 +#else +# define VM_DROPPABLE VM_NONE +#endif + #if defined(CONFIG_X86) # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */ #elif defined(CONFIG_PPC) diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index e46d6e82765e..fab7848df50a 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -165,6 +165,12 @@ IF_HAVE_PG_ARCH_X(arch_3) # define IF_HAVE_UFFD_MINOR(flag, name) #endif +#ifdef CONFIG_NEED_VM_DROPPABLE +# define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name}, +#else +# define IF_HAVE_VM_DROPPABLE(flag, name) +#endif + #define __def_vmaflag_names \ {VM_READ, "read" }, \ {VM_WRITE, "write" }, \ @@ -197,6 +203,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \ {VM_MIXEDMAP, "mixedmap" }, \ {VM_HUGEPAGE, "hugepage" }, \ {VM_NOHUGEPAGE, "nohugepage" }, \ +IF_HAVE_VM_DROPPABLE(VM_DROPPABLE, "droppable" ) \ {VM_MERGEABLE, "mergeable" } \ #define show_vma_flags(flags) \ diff --git a/mm/Kconfig b/mm/Kconfig index b4cb45255a54..6cd65ea4b3ad 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1056,6 +1056,9 @@ config ARCH_USES_HIGH_VMA_FLAGS bool config ARCH_HAS_PKEYS bool +config NEED_VM_DROPPABLE + select ARCH_USES_HIGH_VMA_FLAGS + bool config ARCH_USES_PG_ARCH_X bool diff --git a/mm/memory.c b/mm/memory.c index b5453b86ec4b..57b03fc73159 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5689,6 +5689,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, lru_gen_exit_fault(); + /* If the mapping is droppable, then errors due to OOM aren't fatal. */ + if (vma->vm_flags & VM_DROPPABLE) + ret &= ~VM_FAULT_OOM; + if (flags & FAULT_FLAG_USER) { mem_cgroup_exit_user_fault(); /* diff --git a/mm/mempolicy.c b/mm/mempolicy.c index aec756ae5637..a66289f1d931 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2300,6 +2300,9 @@ struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct pgoff_t ilx; struct page *page; + if (vma->vm_flags & VM_DROPPABLE) + gfp |= __GFP_NOWARN | __GFP_NORETRY; + pol = get_vma_policy(vma, addr, order, &ilx); page = alloc_pages_mpol_noprof(gfp | __GFP_COMP, order, pol, ilx, numa_node_id()); diff --git a/mm/mprotect.c b/mm/mprotect.c index 94878c39ee32..88ff3ecc08a1 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -622,7 +622,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb, may_expand_vm(mm, oldflags, nrpages)) return -ENOMEM; if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB| - VM_SHARED|VM_NORESERVE))) { + VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) { charged = nrpages; if (security_vm_enough_memory_mm(mm, charged)) return -ENOMEM; diff --git a/mm/rmap.c b/mm/rmap.c index e8fc5ecb59b2..d873a3f06506 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1397,7 +1397,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); VM_BUG_ON_VMA(address < vma->vm_start || address + (nr << PAGE_SHIFT) > vma->vm_end, vma); - __folio_set_swapbacked(folio); + if (!(vma->vm_flags & VM_DROPPABLE)) + __folio_set_swapbacked(folio); __folio_set_anon(folio, vma, address, true); if (likely(!folio_test_large(folio))) { @@ -1841,7 +1842,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * plus the rmap(s) (dropped by discard:). */ if (ref_count == 1 + map_count && - !folio_test_dirty(folio)) { + (!folio_test_dirty(folio) || (vma->vm_flags & VM_DROPPABLE))) { dec_mm_counter(mm, MM_ANONPAGES); goto discard; } @@ -1851,7 +1852,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * discarded. Remap the page to page table. */ set_pte_at(mm, address, pvmw.pte, pteval); - folio_set_swapbacked(folio); + if (!(vma->vm_flags & VM_DROPPABLE)) + folio_set_swapbacked(folio); ret = false; page_vma_mapped_walk_done(&pvmw); break;
The vDSO getrandom() implementation works with a buffer allocated with a new system call that has certain requirements: - It shouldn't be written to core dumps. * Easy: VM_DONTDUMP. - It should be zeroed on fork. * Easy: VM_WIPEONFORK. - It shouldn't be written to swap. * Uh-oh: mlock is rlimited. * Uh-oh: mlock isn't inherited by forks. - It shouldn't reserve actual memory, but it also shouldn't crash when page faulting in memory if none is available * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2. * Uh-oh: VM_NORESERVE means segfaults. It turns out that the vDSO getrandom() function has three really nice characteristics that we can exploit to solve this problem: 1) Due to being wiped during fork(), the vDSO code is already robust to having the contents of the pages it reads zeroed out midway through the function's execution. 2) In the absolute worst case of whatever contingency we're coding for, we have the option to fallback to the getrandom() syscall, and everything is fine. 3) The buffers the function uses are only ever useful for a maximum of 60 seconds -- a sort of cache, rather than a long term allocation. These characteristics mean that we can introduce VM_DROPPABLE, which has the following semantics: a) It never is written out to swap. b) Under memory pressure, mm can just drop the pages (so that they're zero when read back again). c) If there's not enough memory to service a page fault, it's not fatal. d) It is inherited by fork. e) It doesn't count against the mlock budget, since nothing is locked. This is fairly simple to implement, with the one snag that we have to use 64-bit VM_* flags, but this shouldn't be a problem, since the only consumers will probably be 64-bit anyway. This way, allocations used by vDSO getrandom() can use: VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE And there will be no problem with OOMing, crashing on overcommitment, using memory when not in use, not wiping on fork(), coredumps, or writing out to swap. Cc: linux-mm@kvack.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- fs/proc/task_mmu.c | 3 +++ include/linux/mm.h | 8 ++++++++ include/trace/events/mmflags.h | 7 +++++++ mm/Kconfig | 3 +++ mm/memory.c | 4 ++++ mm/mempolicy.c | 3 +++ mm/mprotect.c | 2 +- mm/rmap.c | 8 +++++--- 8 files changed, 34 insertions(+), 4 deletions(-)