diff mbox series

[v14,2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

Message ID 20230101162910.710293-3-Jason@zx2c4.com
State New
Headers show
Series implement getrandom() in vDSO | expand

Commit Message

Jason A. Donenfeld Jan. 1, 2023, 4:29 p.m. UTC
The vDSO getrandom() implementation works with a buffer allocated with a
new system call that has certain requirements:

- It shouldn't be written to core dumps.
  * Easy: VM_DONTDUMP.
- It should be zeroed on fork.
  * Easy: VM_WIPEONFORK.

- It shouldn't be written to swap.
  * Uh-oh: mlock is rlimited.
  * Uh-oh: mlock isn't inherited by forks.

- It shouldn't reserve actual memory, but it also shouldn't crash when
  page faulting in memory if none is available
  * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
  * Uh-oh: VM_NORESERVE means segfaults.

It turns out that the vDSO getrandom() function has three really nice
characteristics that we can exploit to solve this problem:

1) Due to being wiped during fork(), the vDSO code is already robust to
   having the contents of the pages it reads zeroed out midway through
   the function's execution.

2) In the absolute worst case of whatever contingency we're coding for,
   we have the option to fallback to the getrandom() syscall, and
   everything is fine.

3) The buffers the function uses are only ever useful for a maximum of
   60 seconds -- a sort of cache, rather than a long term allocation.

These characteristics mean that we can introduce VM_DROPPABLE, which
has the following semantics:

a) It never is written out to swap.
b) Under memory pressure, mm can just drop the pages (so that they're
   zero when read back again).
c) If there's not enough memory to service a page fault, it's not fatal,
   and no signal is sent. Instead, writes are simply lost.
d) It is inherited by fork.
e) It doesn't count against the mlock budget, since nothing is locked.

This is fairly simple to implement, with the one snag that we have to
use 64-bit VM_* flags, but this shouldn't be a problem, since the only
consumers will probably be 64-bit anyway.

This way, allocations used by vDSO getrandom() can use:

    VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE

And there will be no problem with OOMing, crashing on overcommitment,
using memory when not in use, not wiping on fork(), coredumps, or
writing out to swap.

At the moment, rather than skipping writes on OOM, the fault handler
just returns to userspace, and the instruction is retried. This isn't
terrible, but it's not quite what is intended. The actual instruction
skipping has to be implemented arch-by-arch, but so does this whole
vDSO series, so that's fine. The following commit addresses it for x86.

Cc: linux-mm@kvack.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 fs/proc/task_mmu.c             | 3 +++
 include/linux/mm.h             | 8 ++++++++
 include/trace/events/mmflags.h | 7 +++++++
 mm/Kconfig                     | 3 +++
 mm/memory.c                    | 4 ++++
 mm/mempolicy.c                 | 3 +++
 mm/mprotect.c                  | 2 +-
 mm/rmap.c                      | 5 +++--
 8 files changed, 32 insertions(+), 3 deletions(-)

Comments

Jason A. Donenfeld Jan. 3, 2023, 3:01 p.m. UTC | #1
On Tue, Jan 03, 2023 at 11:50:43AM +0100, Ingo Molnar wrote:
> 
> * Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> 
> > The vDSO getrandom() implementation works with a buffer allocated with a
> > new system call that has certain requirements:
> > 
> > - It shouldn't be written to core dumps.
> >   * Easy: VM_DONTDUMP.
> > - It should be zeroed on fork.
> >   * Easy: VM_WIPEONFORK.
> > 
> > - It shouldn't be written to swap.
> >   * Uh-oh: mlock is rlimited.
> >   * Uh-oh: mlock isn't inherited by forks.
> > 
> > - It shouldn't reserve actual memory, but it also shouldn't crash when
> >   page faulting in memory if none is available
> >   * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
> >   * Uh-oh: VM_NORESERVE means segfaults.
> > 
> > It turns out that the vDSO getrandom() function has three really nice
> > characteristics that we can exploit to solve this problem:
> > 
> > 1) Due to being wiped during fork(), the vDSO code is already robust to
> >    having the contents of the pages it reads zeroed out midway through
> >    the function's execution.
> > 
> > 2) In the absolute worst case of whatever contingency we're coding for,
> >    we have the option to fallback to the getrandom() syscall, and
> >    everything is fine.
> > 
> > 3) The buffers the function uses are only ever useful for a maximum of
> >    60 seconds -- a sort of cache, rather than a long term allocation.
> > 
> > These characteristics mean that we can introduce VM_DROPPABLE, which
> > has the following semantics:
> > 
> > a) It never is written out to swap.
> > b) Under memory pressure, mm can just drop the pages (so that they're
> >    zero when read back again).
> > c) If there's not enough memory to service a page fault, it's not fatal,
> >    and no signal is sent. Instead, writes are simply lost.
> > d) It is inherited by fork.
> > e) It doesn't count against the mlock budget, since nothing is locked.
> > 
> > This is fairly simple to implement, with the one snag that we have to
> > use 64-bit VM_* flags, but this shouldn't be a problem, since the only
> > consumers will probably be 64-bit anyway.
> > 
> > This way, allocations used by vDSO getrandom() can use:
> > 
> >     VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
> > 
> > And there will be no problem with OOMing, crashing on overcommitment,
> > using memory when not in use, not wiping on fork(), coredumps, or
> > writing out to swap.
> > 
> > At the moment, rather than skipping writes on OOM, the fault handler
> > just returns to userspace, and the instruction is retried. This isn't
> > terrible, but it's not quite what is intended. The actual instruction
> > skipping has to be implemented arch-by-arch, but so does this whole
> > vDSO series, so that's fine. The following commit addresses it for x86.
> 
> Yeah, so VM_DROPPABLE adds a whole lot of complexity, corner cases, per 
> arch low level work and rarely tested functionality (seriously, whose 
> desktop system touches swap these days?), just so we can add a few pages of 
> per thread vDSO data of a quirky type that in 99.999% of cases won't ever 
> be 'dropped' from under the functionality that is using it and will thus 
> bitrot fast?

It sounds like you've misunderstood the issue.

Firstly, the arch work is +19 lines (in the vdso branch of random.git).
That's very small and basic. Don't misrepresent it just to make a point.

Secondly, and more importantly, swapping this data is *not* permissible.
It doesn't matter if you think that certain systems don't use swap
anyway (you're wrong though regardless -- desktops will swap things
pretty regularly). It simply must *not* be swapped under any
circumstances. It's not that swapping is simply undesirable; it's
absolutely catastrophic. Do you understand that distinction?

If so, then if you don't like this solution, perhaps you could mention
something that accomplishes the goals in the commit message.

> The maintainability baby is being thrown out with the bath water IMO ...

Really? Are you sure this isn't just a matter of you not having written
it yourself or something? The commit is pretty short and basic. All the
actual internals for this kind of thing are already in present. The
commit just pipes it through.

> And we want to add more complexity to a facility people desperately want to 
> trust *more*? [RNG]

That seems like a ridiculous rhetorical leap.

> What's wrong with making mlock() more usable? Or just saying that "yeah, 
> the vDSO can allocate a few more permanent pages outside of existing 
> rlimits & mlock budgets"?

Did you actually read the commit message?

And now you're suggesting a rlimit backdoor? And adding more cases for
fork() to fail or block? That sounds terrible.

> The rest of the series looks fine to me, but this special one of a kind 
> VM_DROPPABLE is just way over-engineered cornercase functionality that 
> pushes us well past the maintenance_overhead<->complexity trade-off sweet spot ...

I think you should reevaluate that judgement. Consider the actual
problem. I think if you look at it carefully you'll see that this is a
pretty straight forward solution. It's deliberately not over-engineered.
It uses existing facilities and just does the plumbing in the right way
to make it work. It's really not that complicated.

The commit is +38, -4. The commit message has more lines than that.

Jason
Andy Lutomirski Jan. 3, 2023, 6:36 p.m. UTC | #2
On Tue, Jan 3, 2023 at 2:50 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> > The vDSO getrandom() implementation works with a buffer allocated with a
> > new system call that has certain requirements:
> >
> > - It shouldn't be written to core dumps.
> >   * Easy: VM_DONTDUMP.
> > - It should be zeroed on fork.
> >   * Easy: VM_WIPEONFORK.

I have a rather different suggestion: make a special mapping.  Jason,
you're trying to shoehorn all kinds of bizarre behavior into the core
mm, and none of that seems to me to belong to the core mm.  Instead,
have an actual special mapping with callbacks that does the right
thing.  No fancy VM flags.

Memory pressure: have it free and unmap it self.  Gets accessed again?
 ->fault can handle it.

Want to mlock it?  No, don't do that -- that's absurd.  Just arrange
so that, if it gets evicted, it's not written out anywhere.  And when
it gets faulted back in it does the right thing -- see above.

Zero on fork?  I'm sure that's manageable with a special mapping.  If
not, you can add a new vm operation or similar to make it work.  (Kind
of like how we extended special mappings to get mremap right a couple
years go.)  But maybe you don't want to *zero* it on fork and you want
to do something more intelligent.  Fine -- you control ->fault!

> >
> > - It shouldn't be written to swap.
> >   * Uh-oh: mlock is rlimited.
> >   * Uh-oh: mlock isn't inherited by forks.

No mlock, no problems.

> >
> > - It shouldn't reserve actual memory, but it also shouldn't crash when
> >   page faulting in memory if none is available
> >   * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
> >   * Uh-oh: VM_NORESERVE means segfaults.

->fault can do whatever you want.

And there is no shortage of user memory that *must* be made available
on fault in order to resume the faulting process.  ->fault can handle
this.

> >
> > It turns out that the vDSO getrandom() function has three really nice
> > characteristics that we can exploit to solve this problem:
> >
> > 1) Due to being wiped during fork(), the vDSO code is already robust to
> >    having the contents of the pages it reads zeroed out midway through
> >    the function's execution.
> >
> > 2) In the absolute worst case of whatever contingency we're coding for,
> >    we have the option to fallback to the getrandom() syscall, and
> >    everything is fine.
> >
> > 3) The buffers the function uses are only ever useful for a maximum of
> >    60 seconds -- a sort of cache, rather than a long term allocation.
> >
> > These characteristics mean that we can introduce VM_DROPPABLE, which
> > has the following semantics:

No need for another vm flag.

> >
> > a) It never is written out to swap.

No need to complicate the swap logic for this.

> > b) Under memory pressure, mm can just drop the pages (so that they're
> >    zero when read back again).

Or ->fault could even repopulate it without needing to ever read zeros.

> > c) If there's not enough memory to service a page fault, it's not fatal,
> >    and no signal is sent. Instead, writes are simply lost.

This just seems massively overcomplicated to me.  If there isn't
enough memory to fault in a page of code, we don't have some magic
instruction emulator in the kernel.  We either OOM or we wait for
memory to show up.

> > d) It is inherited by fork.

If you have a special mapping and you fork, it doesn't magically turn
into normal memory.

> > e) It doesn't count against the mlock budget, since nothing is locked.

Special mapping -> no mlock.

> >
> > This is fairly simple to implement, with the one snag that we have to
> > use 64-bit VM_* flags, but this shouldn't be a problem, since the only
> > consumers will probably be 64-bit anyway.
> >
> > This way, allocations used by vDSO getrandom() can use:
> >
> >     VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
> >
> > And there will be no problem with OOMing, crashing on overcommitment,
> > using memory when not in use, not wiping on fork(), coredumps, or
> > writing out to swap.
> >
> > At the moment, rather than skipping writes on OOM, the fault handler
> > just returns to userspace, and the instruction is retried. This isn't
> > terrible, but it's not quite what is intended. The actual instruction
> > skipping has to be implemented arch-by-arch, but so does this whole
> > vDSO series, so that's fine. The following commit addresses it for x86.

I really dislike this.  I'm with Ingo.
Jason A. Donenfeld Jan. 3, 2023, 7:05 p.m. UTC | #3
Hi Andy,

Thanks for your constructive suggestions.

On Tue, Jan 03, 2023 at 10:36:01AM -0800, Andy Lutomirski wrote:
> > > c) If there's not enough memory to service a page fault, it's not fatal,
> > >    and no signal is sent. Instead, writes are simply lost.
> 
> This just seems massively overcomplicated to me.  If there isn't
> enough memory to fault in a page of code, we don't have some magic
> instruction emulator in the kernel.  We either OOM or we wait for
> memory to show up.

Before addressing the other parts of your email, I thought I'd touch on
this. Quoting from the email I just wrote Ingo:

| *However* - if your big objection to this patch is that the instruction
| skipping is problematic, we could actually punt that part. The result
| will be that userspace just retries the memory write and the fault
| happens again, and eventually it succeeds. From a perspective of
| vgetrandom(), that's perhaps worse -- with this v14 patchset, it'll
| immediately fallback to the syscall under memory pressure -- but you
| could argue that nobody really cares about performance at that point
| anyway, and so just retrying the fault until it succeeds is a less
| complex behavior that would be just fine.
| 
| Let me know if you think that'd be an acceptable compromise, and I'll
| roll it into v15. As a preview, it pretty much amounts to dropping 3/7
| and editing the commit message in this 2/7 patch.

IOW, I think the main ideas of the patch work just fine without "point
c" with the instruction skipping. Instead, waiting/retrying could
potentially work. So, okay, it seems like the two of you both hate the
instruction decoder stuff, so I'll plan on working that part in, in one
way or another, for v15.

> On Tue, Jan 3, 2023 at 2:50 AM Ingo Molnar <mingo@kernel.org> wrote:
> > > The vDSO getrandom() implementation works with a buffer allocated with a
> > > new system call that has certain requirements:
> > >
> > > - It shouldn't be written to core dumps.
> > >   * Easy: VM_DONTDUMP.
> > > - It should be zeroed on fork.
> > >   * Easy: VM_WIPEONFORK.
> 
> I have a rather different suggestion: make a special mapping.  Jason,
> you're trying to shoehorn all kinds of bizarre behavior into the core
> mm, and none of that seems to me to belong to the core mm.  Instead,
> have an actual special mapping with callbacks that does the right
> thing.  No fancy VM flags.

Oooo! I like this. Avoiding adding VM_* flags would indeed be nice.
I had seen things that I thought looked in this direction with the shmem
API, but when I got into the details, it looked like this was meant for
something else and couldn't address most of what I wanted here.

If you say this is possible, I'll look again to see if I can figure it
out. Though, if you have some API name at the top of your head, you
might save me some code squinting time.

> Memory pressure: have it free and unmap it self.  Gets accessed again?
>  ->fault can handle it.

Right.

> Want to mlock it?  No, don't do that -- that's absurd.  Just arrange
> so that, if it gets evicted, it's not written out anywhere.  And when
> it gets faulted back in it does the right thing -- see above.

Presumably mlock calls are redirected to some function pointer so I can
just return EINTR?

> Zero on fork?  I'm sure that's manageable with a special mapping.  If
> not, you can add a new vm operation or similar to make it work.  (Kind
> of like how we extended special mappings to get mremap right a couple
> years go.)  But maybe you don't want to *zero* it on fork and you want
> to do something more intelligent.  Fine -- you control ->fault!

Doing something more intelligent would be an interesting development, I
guess... But, before I think about that, all mapping have flags;
couldn't I *still* set VM_WIPEONFORK on the special mapping? Or does the
API you have in mind not work that way? (Side note: I also want
VM_DONTDUMP to work.)

> > > - It shouldn't reserve actual memory, but it also shouldn't crash when
> > >   page faulting in memory if none is available
> > >   * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
> > >   * Uh-oh: VM_NORESERVE means segfaults.
> 
> ->fault can do whatever you want.
> 
> And there is no shortage of user memory that *must* be made available
> on fault in order to resume the faulting process.  ->fault can handle
> this.

I'll look to see how other things handle this...

Anyway, thanks for the suggestion. That seems like a good future
direction for this.

Jason
Linus Torvalds Jan. 3, 2023, 7:19 p.m. UTC | #4
On Tue, Jan 3, 2023 at 10:36 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> I have a rather different suggestion: make a special mapping.  Jason,
> you're trying to shoehorn all kinds of bizarre behavior into the core
> mm, and none of that seems to me to belong to the core mm.  Instead,
> have an actual special mapping with callbacks that does the right
> thing.  No fancy VM flags.

I don't disagree, but honestly, my deeper reaction is "this is not worth it".

I think the real issue here is that Jason has to prove that this is
such a big deal that the VM has to be modified *at*all* for this.

Honestly, "getrandom()" performance simply is not important enough to
design VM changes for.

Do some people care? Debatable. Jason cares, sure. But people have
gotten used to doing their own random number implementations if they
*really* care, and yes, they've gotten them wrong, or they've not
performed as well as they could, but on the whole this is still a
really tiny thing, and Jason is trying to micro-optimize something
that THE KERNEL SHOULD NOT CARE ABOUT.

This should all be in libc. Not in the kernel with special magic vdso
support and special buffer allocations. The kernel should give good
enough support that libc can do a good job, but the kernel should
simply *not* take the approach of "libc will get this wrong, so let's
just do all the work for it".

Let user space do their own percpu areas if they really care. And most
(as in 99.9%) of all user space won't care at all, and is perfectly
happy just doing a read() from /dev/urandom or using our existing
"getrandom()" without any super-clever vdso support with magic
temporary buffer handling.

Now, if the magic buffers were something cool that were a generic
concept that a lot of *other* cases would also kill for, that's one
thing. But this is such a small special case that absolutely *nobody*
has asked for, and that nothing else wants.

             Linus
Linus Torvalds Jan. 3, 2023, 7:54 p.m. UTC | #5
On Tue, Jan 3, 2023 at 11:35 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> I don't think this is about micro-optimization. Rather, userspace RNGs
> aren't really possible in a safe way at the moment.

"Bah, humbug", to quote a modern-time philosopher.

It's humbug simply because it makes two assumptions that aren't even valid:

 (a) that you have to do it in user space in the first place

 (b) that you care about the particular semantics that you are looking for

The thing is, you can just do getrandom(). It's what people already
do. Problem solved.

The system call interface just isn't that expensive. Sure, the various
spectre mitigations have screwed us over in a big way on various
hardware, but even with that in mind system calls aren't some kind of
insurmountable hazard. There are absolutely tons of system calls that
are way more performance-critical than getrandom() has ever been.

So 99% of the time, the solution really is just "getrandom()",
generally with the usual buffering ("read more than you need, so that
you don't do it all the time").\

Which is not at all different from all the other system calls like
"read()" and friends.

And that's entirely ignoring the fact that something like "read()"
basically *has* to be a system call, because there are no alternatives
(ok, mmap, but that's actually much slower, unless you're in it for
the reuse-and/or-sparse-use situation).

With random numbers, you have tons of other alternatives, including
just hardware giving you the random numbers almost for free and you
just using your own rng in user space entirely.

And yes, some users might want to do things like periodic re-seeding,
but we actually do have fast ways to check for periodic events in user
space, ranging from entirely non-kernel things like rdtsc to actual
vdso's for gettimeofday().

So when you say that this isn't about micro-optimizations, I really
say "humbug". It's *clearly* about micro-optimization of an area that
very few people care about, since the alternative is just our existing
"getrandom()" that is not at all horribly slow.

Let me guess: the people you talked to who were excited about this are
mainly just library people?

And they are excited about this not because they actually need the
random numbers themselves, but because they just don't want to do the
work. They want to get nice benchmark numbers, and push the onus of
complexity onto somebody else?

Because the people who actually *use* the random numbers and are *so*
performance-critical about them already have their own per-thread
buffers or what-not, and need to have them anyway because they write
real applications that possibly work on other systems than Linux, but
that *definitely* work on enterprise systems that won't even have this
kind of new support for another several years even if we implemented
it today.

In fact, they probably are people inside of cloud vendors that are
still running 4.x kernels on a large portion of their machines.

          Linus
Linus Torvalds Jan. 3, 2023, 8:15 p.m. UTC | #6
On Tue, Jan 3, 2023 at 12:03 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> That buffering cannot be done safely currently

.. again, this is "your semantics" (the (b) in my humbug list), not
necessarily reality for anybody else.

I'm NAK'ing making invasive changes to the VM for something this
specialized. I really believe that the people who have this issue are
*so* few and far between that they can deal with the VM forking and
reseeding issues quite well on their own.

            Linus
Jason A. Donenfeld Jan. 3, 2023, 8:44 p.m. UTC | #7
On Tue, Jan 03, 2023 at 12:15:57PM -0800, Linus Torvalds wrote:
> On Tue, Jan 3, 2023 at 12:03 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > That buffering cannot be done safely currently
> 
> .. again, this is "your semantics" (the (b) in my humbug list), not
> necessarily reality for anybody else.

Yea that's fair. Except, of course, I maintain that my semantics are
important ones. :)

> I'm NAK'ing making invasive changes to the VM for something this
> specialized. I really believe that the people who have this issue are
> *so* few and far between that they can deal with the VM forking and
> reseeding issues quite well on their own.

Okay, that's fine. I'll see if I can make this work without having to do
surgery on mm and introduce a new VM_* flag and such. Hopefully I'll
succeed there.

Jason
Andy Lutomirski Jan. 6, 2023, 8:53 p.m. UTC | #8
On Thu, Jan 5, 2023, at 6:08 PM, Linus Torvalds wrote:
> On Thu, Jan 5, 2023 at 5:02 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> None of what you ask for is for any kind of real security, it's all
>> just crazy "but I want to feel the warm and fuzzies and take shortcuts
>> elsewhere, and push my pain onto other people".
>
> Actually, let me maybe soften that a bit and say that it's
> "convenience features". It might make some things more _convenient_ to
> do, exactly because it might allow other parts to do short-cuts.
>
> But because it's a convenience-feature, it had also better either be
> (a) really easy and clear to do in the kernel and (b) have
> sufficiently *wide* convenience so that it doesn't end up being one of
> those "corner case things we have to maintain forever and nobody
> uses".
>
> And I think VM_DROPPABLE matches (a), and would be fine if it had some
> other non-made-up use (although honestly, we should solve the 32-bit
> problem first - ignoring it isn't fine for anything that is supposed
> to be widely useful).
>
> We *have* talked about features kind of like it before, for people
> doing basically caches in user space that they can re-create on demand
> and are ok with just going away under memory pressure.
>
> But those people almost invariably want dropped pages to cause a
> SIGSEGV or SIGBUS, not to come back as zeroes.
>
> So you were insulting when you said kernel people don't care about
> security issues.  And I'm just telling you that's not true, but it
> *is* 100% true that kernel people are often really fed up with
> security people who have their blinders on, focus on some small thing,
> and think nothing else ever matters.
>
> So yes, the way to get something like VM_DROPPABLE accepted is to
> remove the blinders, and have it be something more widely useful, and
> not be a "for made up bad code".

I'm going to suggest a very very different approach: fix secret storage in memory for real.  That is, don't lock "super secret sensitive stuff" into memory, and don't wipe it either.  *Encrypt* it.

This boils down to implementing proper encrypted swap support, which is not conceptually that difficult.  The kernel already has identifiers (mapping, offset, etc) for every page in swap and already stores some metadata.  Using that as part of a cryptographic operation seems conceptually fairly straightforward.

And a nice implementation of this could be on by default, and the kernel could even tell userspace that it's on, and then userspace could just stop worrying about this particular issue.
Linus Torvalds Jan. 6, 2023, 9:10 p.m. UTC | #9
On Fri, Jan 6, 2023 at 12:54 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> I'm going to suggest a very very different approach: fix secret
> storage in memory for real. That is, don't lock "super secret
> sensitive stuff" into memory, and don't wipe it either. *Encrypt* it.

I don't think you're wrong, but people will complain about key
management, and worry about that part instead.

Honestly, this is what SGX and CPU enclaves is _supposed_ to all do
for you, but then nobody uses it for various reasons.

               Linus
Jason A. Donenfeld Jan. 6, 2023, 9:36 p.m. UTC | #10
On Fri, Jan 06, 2023 at 12:53:41PM -0800, Andy Lutomirski wrote:
> 
> 
> On Thu, Jan 5, 2023, at 6:08 PM, Linus Torvalds wrote:
> > On Thu, Jan 5, 2023 at 5:02 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> None of what you ask for is for any kind of real security, it's all
> >> just crazy "but I want to feel the warm and fuzzies and take shortcuts
> >> elsewhere, and push my pain onto other people".
> >
> > Actually, let me maybe soften that a bit and say that it's
> > "convenience features". It might make some things more _convenient_ to
> > do, exactly because it might allow other parts to do short-cuts.
> >
> > But because it's a convenience-feature, it had also better either be
> > (a) really easy and clear to do in the kernel and (b) have
> > sufficiently *wide* convenience so that it doesn't end up being one of
> > those "corner case things we have to maintain forever and nobody
> > uses".
> >
> > And I think VM_DROPPABLE matches (a), and would be fine if it had some
> > other non-made-up use (although honestly, we should solve the 32-bit
> > problem first - ignoring it isn't fine for anything that is supposed
> > to be widely useful).
> >
> > We *have* talked about features kind of like it before, for people
> > doing basically caches in user space that they can re-create on demand
> > and are ok with just going away under memory pressure.
> >
> > But those people almost invariably want dropped pages to cause a
> > SIGSEGV or SIGBUS, not to come back as zeroes.
> >
> > So you were insulting when you said kernel people don't care about
> > security issues.  And I'm just telling you that's not true, but it
> > *is* 100% true that kernel people are often really fed up with
> > security people who have their blinders on, focus on some small thing,
> > and think nothing else ever matters.
> >
> > So yes, the way to get something like VM_DROPPABLE accepted is to
> > remove the blinders, and have it be something more widely useful, and
> > not be a "for made up bad code".
> 
> I'm going to suggest a very very different approach: fix secret storage in memory for real.  That is, don't lock "super secret sensitive stuff" into memory, and don't wipe it either.  *Encrypt* it.
> 
> This boils down to implementing proper encrypted swap support, which is not conceptually that difficult.  The kernel already has identifiers (mapping, offset, etc) for every page in swap and already stores some metadata.  Using that as part of a cryptographic operation seems conceptually fairly straightforward.

Not sure this solves the right problem, which is primarily related to
forward secrecy, which means the important property is timely secret
erasure. Writing things out to disk complicates that, and encrypted swap
means moving the problem into key lifetime, which sounds like a can of
worms to synchronize. So this doesn't sound so appealing to me as a
solution.

It does sound like a potentially nice thing for other uses, though.

Jason
Matthew Wilcox Jan. 6, 2023, 9:42 p.m. UTC | #11
On Thu, Jan 05, 2023 at 06:08:28PM -0800, Linus Torvalds wrote:
> Side note: making the 32-bit issue go away is likely trivial. We can
> make 'vm_flags' be 64-bit, and a patch for that has been floating
> around for over a decade:
> 
>    https://lore.kernel.org/all/20110412151116.B50D.A69D9226@jp.fujitsu.com/
> 
> but there was enough push-back on that patch that I didn't want to
> take it, and some of the arguments for it were not that convincing (at
> the time).
> 
> But see commit ca16d140af91 ("mm: don't access vm_flags as 'int'"),
> which happened as a result, and which I (obviously very naively)
> believed would be a way to get the conversion to happen in a more
> controlled manner. Sadly, it never actually took off, and we have very
> few "vm_flags_t" users in the kernel, and a lot of "unsigned long
> flags". We even started out with a "__nocast" annotation to try to
> make sparse trigger on people who didn't use vm_flags_t properly. That
> was removed due to it just never happening.
> 
> But converting things to vm_flags_t with a coccinelle script
> (hand-wave: look for variables of of "unsigned long" that use the
> VM_xyz constants), and then just making vm_flags_t be a "u64" instead
> sounds like a way forward.

I'd be more inclined to do:

typedef unsigned int vm_flags_t[2];

and deal with all the fallout.  That'll find all the problems (although
leave us vulnerable to people forgetting which half of the flags they
want to be looking at).

Hm.  We never actually converted vma->vm_flags to be vm_flags_t.  Only
vm_region, which aiui is only used on nommu.
Linus Torvalds Jan. 6, 2023, 10:06 p.m. UTC | #12
On Fri, Jan 6, 2023 at 1:42 PM Matthew Wilcox <willy@infradead.org> wrote:
>
>
> I'd be more inclined to do:
>
> typedef unsigned int vm_flags_t[2];

No, that's entirely invalid.

Never *ever* use arrays in C for type safety. Arrays are not type
safe. They can't be assigned sanely, and they silently become pointers
(which also aren't type-safe, since they end up converting silently to
'void *').

If you want to use the type system to enforce things, and you don't
want to rely on sparse, you absolutely have to use a struct (or union)
type.

So something like

   typedef struct { u64 val; } vm_flags_t;

would be an option.

              Linus
Florian Weimer Jan. 9, 2023, 10:34 a.m. UTC | #13
* Linus Torvalds:

> On Tue, Jan 3, 2023 at 11:35 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>>
>> I don't think this is about micro-optimization. Rather, userspace RNGs
>> aren't really possible in a safe way at the moment.
>
> "Bah, humbug", to quote a modern-time philosopher.
>
> It's humbug simply because it makes two assumptions that aren't even valid:
>
>  (a) that you have to do it in user space in the first place
>
>  (b) that you care about the particular semantics that you are looking for
>
> The thing is, you can just do getrandom(). It's what people already
> do. Problem solved.

We are currently doing this in glibc for our arc4random implementation,
after Jason opposed userspace buffering.  If chrony is recompiled
against the glibc version of arc4random (instead of its OpenBSD compat
version, which uses userspace buffering), the result is a 25% drop in
NTP packet response rate:

| The new arc4random using getrandom() seems to have a significant
| impact on performance of chronyd operating as an NTP server. On an
| Intel E3-1220 CPU, I see that the maximum number of requests per
| second dropped by about 25%. That would be an issue for some public
| NTP servers.

arc4random is too slow
<https://sourceware.org/bugzilla/show_bug.cgi?id=29437>

This is *not* “arc4random is 25% slower”, it is the measured overall
impact on server performance.

Historically, the solution space for getrandom and arc4random are
slightly different.  The backronym is “A Replacement Call For random”,
i.e., you should be able to use arc4random without worrying about
performance.  I don't expect cryptographic libraries to turn to
arc4random to implement their random number generators, and that
programmers that use low-level OpenSSL primitives (for example) keep
calling RAND_bytes instead of arc4random because it is available to
them.

We did these changes on the glibc side because Jason sounded very
confident that he's able to deliver vDSO acceleration for getrandom.  If
that fails to materialize, we'll just have to add back userspace
buffering in glibc.  At least we can get process fork protection via
MADV_WIPEONFORK, solving a real problem with the usual arc4random compat
implementation.  (The OpenBSD mechanism for this is slightly different.)
We won't get VM fork protection or forward secrecy against ptrace.  But
the latter is rather speculative anyway because if you can do ptrace
once, you can likely do ptrace twice, the first time patching the
process to remove forward secrecy.  There is a real gap for VM forks,
but I'm not sure how much that matters in practice.  Live migration has
to be implemented in such a way that this isn't observable (otherwise
TCP connections etc. would break), and long-term keys probably shouldn't
be generated under virtualization anyway.

Thanks,
Florian
Linus Torvalds Jan. 9, 2023, 2:28 p.m. UTC | #14
On Mon, Jan 9, 2023 at 4:34 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> We did these changes on the glibc side because Jason sounded very
> confident that he's able to deliver vDSO acceleration for getrandom.  If
> that fails to materialize, we'll just have to add back userspace
> buffering in glibc.

My whole argument has been that user-space buffering is the sane thing
to do. Most definitely for something like glibc.

The number of people who go "oh, no, my buffer or randomness could be
exposed by insert-odd-situation-here" is approximately zero, and then
the onus should be on *them* to do something special.

Because *they* are special. Precious little snowflake special.

             Linus
Dr. Greg Jan. 10, 2023, 11:01 a.m. UTC | #15
On Fri, Jan 06, 2023 at 01:10:44PM -0800, Linus Torvalds wrote:

Good morning, I hope the week is going well for everyone.

> On Fri, Jan 6, 2023 at 12:54 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > I'm going to suggest a very very different approach: fix secret
> > storage in memory for real. That is, don't lock "super secret
> > sensitive stuff" into memory, and don't wipe it either. *Encrypt* it.
> 
> I don't think you're wrong, but people will complain about key
> management, and worry about that part instead.
> 
> Honestly, this is what SGX and CPU enclaves is _supposed_ to all do
> for you, but then nobody uses it for various reasons.

The principal problem is that enclave technology was not made either
ubiquitous or accessible, long story there, suitable for multiple
snifters of single malt.

Unfortunately, the same goes for just about every other hardware
security technology.  Every conversation comes down to; "what is the
business case for the technology", which translated means, how much
money are we going to make off it.

Encrypting memory based secrets, as an alternative to wiping them, is
attractive, but hardware support is needed to do key management
securely and correctly.  Even than, by definition, there will be a
window when the material needs to be in memory as plaintext.

A discussion can be had in this arena about perfection being the enemy
of good.  If you are truely interested in perfection in this endeavor,
you need to have a trusted platform definition and implementation.

Which, if history is any indication, needs to be an open architecture
with respect to both software and hardware.

>                Linus

Best wishes to everyone for a productive remainder of the week.

As always,
Dr. Greg

The Quixote Project - Flailing at the Travails of Cybersecurity
Eric Biggers Jan. 11, 2023, 7:27 a.m. UTC | #16
On Mon, Jan 09, 2023 at 08:28:58AM -0600, Linus Torvalds wrote:
> On Mon, Jan 9, 2023 at 4:34 AM Florian Weimer <fweimer@redhat.com> wrote:
> >
> > We did these changes on the glibc side because Jason sounded very
> > confident that he's able to deliver vDSO acceleration for getrandom.  If
> > that fails to materialize, we'll just have to add back userspace
> > buffering in glibc.
> 
> My whole argument has been that user-space buffering is the sane thing
> to do. Most definitely for something like glibc.
> 
> The number of people who go "oh, no, my buffer or randomness could be
> exposed by insert-odd-situation-here" is approximately zero, and then
> the onus should be on *them* to do something special.
> 
> Because *they* are special. Precious little snowflake special.
> 
>              Linus

How would userspace decide when to reseed its CRNGs, then?

IMO, the main benefit of the VDSO getrandom over a traditional userspace CRNG is
that it makes reseeds of the kernel's CRNG take effect immediately.  See the
cover letter, where Jason explains this.

It's definitely important to make the memory used by userspace CRNGs have
appropriate semantics, but my understanding is that's not the main point.

- Eric
Linus Torvalds Jan. 11, 2023, 12:07 p.m. UTC | #17
On Wed, Jan 11, 2023 at 1:27 AM Eric Biggers <ebiggers@kernel.org> wrote:
>
> How would userspace decide when to reseed its CRNGs, then?

.. and that is relevant to all the VM discussions exactly why?

                Linus
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e35a0398db63..47c7c046f2be 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -711,6 +711,9 @@  static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 		[ilog2(VM_UFFD_MINOR)]	= "ui",
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_NEED_VM_DROPPABLE
+		[ilog2(VM_DROPPABLE)]	= "dp",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f196e4d66d..fba3f1e8616b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -315,11 +315,13 @@  extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -335,6 +337,12 @@  extern unsigned int kobjsize(const void *objp);
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 
+#ifdef CONFIG_NEED_VM_DROPPABLE
+# define VM_DROPPABLE VM_HIGH_ARCH_5
+#else
+# define VM_DROPPABLE 0
+#endif
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 412b5a46374c..82b2fb811d06 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -163,6 +163,12 @@  IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
 # define IF_HAVE_UFFD_MINOR(flag, name)
 #endif
 
+#ifdef CONFIG_NEED_VM_DROPPABLE
+# define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
+#else
+# define IF_HAVE_VM_DROPPABLE(flag, name)
+#endif
+
 #define __def_vmaflag_names						\
 	{VM_READ,			"read"		},		\
 	{VM_WRITE,			"write"		},		\
@@ -195,6 +201,7 @@  IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	{VM_MIXEDMAP,			"mixedmap"	},		\
 	{VM_HUGEPAGE,			"hugepage"	},		\
 	{VM_NOHUGEPAGE,			"nohugepage"	},		\
+IF_HAVE_VM_DROPPABLE(VM_DROPPABLE,	"droppable"	)		\
 	{VM_MERGEABLE,			"mergeable"	}		\
 
 #define show_vma_flags(flags)						\
diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209dec05..91fd0be96ca4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1030,6 +1030,9 @@  config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+config NEED_VM_DROPPABLE
+	select ARCH_USES_HIGH_VMA_FLAGS
+	bool
 
 config ARCH_USES_PG_ARCH_X
 	bool
diff --git a/mm/memory.c b/mm/memory.c
index aad226daf41b..1ade407ccbf9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5220,6 +5220,10 @@  vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 
 	lru_gen_exit_fault();
 
+	/* If the mapping is droppable, then errors due to OOM aren't fatal. */
+	if (vma->vm_flags & VM_DROPPABLE)
+		ret &= ~VM_FAULT_OOM;
+
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_exit_user_fault();
 		/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 02c8a712282f..ebf2e3694a0a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2173,6 +2173,9 @@  struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
 	int preferred_nid;
 	nodemask_t *nmask;
 
+	if (vma->vm_flags & VM_DROPPABLE)
+		gfp |= __GFP_NOWARN | __GFP_NORETRY;
+
 	pol = get_vma_policy(vma, addr);
 
 	if (pol->mode == MPOL_INTERLEAVE) {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 908df12caa26..a679cc5d1c75 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -593,7 +593,7 @@  mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				may_expand_vm(mm, oldflags, nrpages))
 			return -ENOMEM;
 		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
-						VM_SHARED|VM_NORESERVE))) {
+				  VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) {
 			charged = nrpages;
 			if (security_vm_enough_memory_mm(mm, charged))
 				return -ENOMEM;
diff --git a/mm/rmap.c b/mm/rmap.c
index b616870a09be..5ed46e59dfcd 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1294,7 +1294,8 @@  void page_add_new_anon_rmap(struct page *page,
 	int nr;
 
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
-	__SetPageSwapBacked(page);
+	if (!(vma->vm_flags & VM_DROPPABLE))
+		__SetPageSwapBacked(page);
 
 	if (likely(!PageCompound(page))) {
 		/* increment count (starts at -1) */
@@ -1683,7 +1684,7 @@  static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 * plus the rmap(s) (dropped by discard:).
 				 */
 				if (ref_count == 1 + map_count &&
-				    !folio_test_dirty(folio)) {
+				    (!folio_test_dirty(folio) || (vma->vm_flags & VM_DROPPABLE))) {
 					/* Invalidate as we cleared the pte */
 					mmu_notifier_invalidate_range(mm,
 						address, address + PAGE_SIZE);