[1/2] sysctl: read() must consume poll events, not poll()

Message ID	20220502140602.130373-1-Jason@zx2c4.com
State	New
Headers	show Return-Path: <linux-crypto-owner@kernel.org> From: "Jason A. Donenfeld" <Jason@zx2c4.com> To: linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>, Jann Horn <jannh@google.com>, Kees Cook <keescook@chromium.org>, Luis Chamberlain <mcgrof@kernel.org>, linux-fsdevel@vger.kernel.org Subject: [PATCH 1/2] sysctl: read() must consume poll events, not poll() Date: Mon, 2 May 2022 16:06:01 +0200 Message-Id: <20220502140602.130373-1-Jason@zx2c4.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[1/2] sysctl: read() must consume poll events, not poll() \| expand [1/2] sysctl: read() must consume poll events, not poll()

Jason A. Donenfeld May 2, 2022, 2:06 p.m. UTC

Events that poll() responds to are supposed to be consumed when the file
is read(), not by the poll() itself. By putting it on the poll() itself,
it makes it impossible to poll() on a epoll file descriptor, since the
event gets consumed too early. Jann wrote a PoC, available in the link
below.

Reported-by: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/lkml/CAG48ez1F0P7Wnp=PGhiUej=u=8CSF6gpD9J=Oxxg0buFRqV1tA@mail.gmail.com/
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 fs/proc/proc_sysctl.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

Lennart Poettering May 3, 2022, 7:42 a.m. UTC | #1

On Mo, 02.05.22 20:04, Jason A. Donenfeld (Jason@zx2c4.com) wrote:

> > I can just tell you, that in systemd we'd have a usecase for consuming
> > such a generation counter: we try to provide stable MAC addresses for
> > synthetic network interfaces managed by networkd, so we hash them from
> > /etc/machine-id, but otoh people also want them to change when they
> > clone their VMs. We could very nicely solve this if we had a
> > generation counter easily accessible from userspace, that starts at 0
> > initially. Because then we can hash as we always did when the counter
> > is zero, but otherwise use something else, possibly hashed from the
> > generation counter.
>
> This doesn't work, because you could have memory-A split into memory-A.1
> and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with
> the same new value "2".

Yes, that's why I as vague about what to switch to if the counter is
non-zero, i.e. "something else, *possibly* hashed…".

For this MAC address usecase it's entirely sufficient to be able to
distinguish if the system was closed at all, i.e. if the counter is
zero or is non-zero. Because that would already be great for a policy
of "hash it in a stable way from /etc/machine-id, if counter == 0" +
"use random MAC once counter > 0".

Such a MAC address policy I think should probably even be the new
default in networkd, if we could implement it. For that we'd need a
single bit of info from the kernel, indicating whether the sysem was
cloned at all. i.e. if the vmgenid uuid is different from the one the
system booted up first.

Lennart

--
Lennart Poettering, Berlin

Jason A. Donenfeld May 3, 2022, 9:08 a.m. UTC | #2

Hey Lennart,

On Tue, May 03, 2022 at 09:42:40AM +0200, Lennart Poettering wrote:
> For this MAC address usecase it's entirely sufficient to be able to
> distinguish if the system was closed at all, i.e. if the counter is
> zero or is non-zero. Because that would already be great for a policy
> of "hash it in a stable way from /etc/machine-id, if counter == 0" +
> "use random MAC once counter > 0".

Hm, are you sure that's actually what you want? It turns out this
vmgenid notification from the hypervisor might not be sufficiently
granular for this use case:

- vmgenid changes when you fork a new snapshot, so now you have two VMs
- vmgenid also changes when you rewind to 2 minutes ago

The first is what I assume you care about for this networkd business.
The second is probably not what any networkd user expects.

[Aside: I hope there are few networkd users; having seen what Yu did
with wireguard and how fast and recklessly that went, I can't recommend
that part of systemd to anyone.]

Jason A. Donenfeld May 3, 2022, 10:07 a.m. UTC | #3

On Tue, May 03, 2022 at 11:32:26AM +0200, Lennart Poettering wrote:
> So first of all, it appears to me that rewinding a VM is something people
> would do for debugging things, i.e. not how things are done on
> deployment.

I wouldn't be so sure here... Some people have processes around, "always
start out from the same place", like for build machines, and employ a
single VM snapshot that's always rewound after use. It's never forked
into multiple snapshots, but just always goes back to that same starting
point.

> > >From the perspective of randomness, both of these events imply the same
> > thing. The situation is BAD; reseed immediately. From the perspective of
> > MAC addresses, though, these events would imply different behavior,
> > right? So it seems like vmgenid might need an additional field for this
> > use case. Relatedly, VMware has that prompt where you select about your
> > VM whether, "I moved it" or "I copied it." Presumably something like
> > that would play a part in what is decided as part of this hypothetical
> > second field.
> 
> networkd doesn't change MAC addresses in the middle of everything, but
> only when a network interface is downed and upped again. This for
> example happens when a link beat goes away and comes back. In the
> rewind-2min case i'd assume the link beat would probably be restored
> to what it was 2min ago (and thus stay online), but in the clone case
> it would probably drop momentarily and be restored than, to tell
> software to reacquire dhcp and so on.

That sounds like it's going to be sort of confusing. Let's say we've got
some VM scenario in which rewinds are common due to whatever weird
process (such as a build machine that wants to start out at the same
place each time). During its course of execution, it reboots, or maybe
there's some network connectivity issue and the link goes down. In that
case, when the link comes up, it's going to have a different MAC
address? That doesn't make much sense to me, but maybe I'm missing some
bigger picture detail.

Jason

Jason A. Donenfeld May 3, 2022, 11:55 a.m. UTC | #4

On Tue, May 03, 2022 at 10:29:49AM +0200, Lennart Poettering wrote:
> As mentioned earlier, I am not convinced sysctl is the right place for
> this. sysctls are understood by most people as being the place for
> tweaking kernel settings. This is not a kernel setting, but a
> notification concept, and the way Jason defined it there's nothing to
> read nor write, which strongly suggests to move it elsewhere, but not
> /proc/sys/.

I think I'm coming around to this view that having a sysctl return
-ENODATA is weird. It makes `sysctl -a` always complain to stderr, for
example, which seems bad.

> > I can see attractiveness in providing the /run/fork-id directly from the
> > kernel though, to remove the dependency on systemd for poll-less
> > notification of libraries.
> 
> I agree.

I'm still not convinced there's value in having a counter or a UUID, but
if you had to choose, would you prefer a counter or a UUID? It sounds
like the former, because you see a use for distinguishing between zero
and non-zero? Or did you finally agree with me that vmgenid isn't
granular enough for that?

Jason

Lennart Poettering May 3, 2022, 12:33 p.m. UTC | #5

On Di, 03.05.22 13:55, Jason A. Donenfeld (Jason@zx2c4.com) wrote:

> I'm still not convinced there's value in having a counter or a UUID, but
> if you had to choose, would you prefer a counter or a UUID? It sounds
> like the former, because you see a use for distinguishing between zero
> and non-zero? Or did you finally agree with me that vmgenid isn't
> granular enough for that?

I would prefer a monotonic counter, since it allows answering
questions like the following:

1. Did this image get cloned at all? (i.e. counter != 0; usecase as per the MAC address
   discussion)

2. Did the image get cloned since the last time I looked? (i.e. counter
   != my_previously_saved_counter; usecase: detect clones in an "offline"
   fashion, i.e. from a component that doesn't continously run, but
   only from time to time)

3. How many clones did I miss? (i.e. missed_clones =
   my_previously_saved_counter - counter; usecase: catch up with
   generating proxy D-Bus signal messages for clones).

There might be more.

Using a UUID would not give us #1 or #3. It would deliver #2 however.

Lennart

--
Lennart Poettering, Berlin

Simo Sorce May 11, 2022, 12:40 a.m. UTC | #6

On Mon, 2022-05-02 at 16:06 +0200, Jason A. Donenfeld wrote:
> In order to inform userspace of virtual machine forks, this commit adds
> a "fork_event" sysctl, which does not return any data, but allows
> userspace processes to poll() on it for notification of VM forks.
> 
> It avoids exposing the actual vmgenid from the hypervisor to userspace,
> in case there is any randomness value in keeping it secret. Rather,
> userspace is expected to simply use getrandom() if it wants a fresh
> value.
> 
> For example, the following snippet can be used to print a message every
> time a VM forks, after the RNG has been reseeded:
> 
>   struct pollfd fd = { .fd = open("/proc/sys/kernel/random/fork_event", O_RDONLY)  };
>   assert(fd.fd >= 0);
>   for (;;) {
>     read(fd.fd, NULL, 0);
>     assert(poll(&fd, 1, -1) > 0);
>     puts("vm fork detected");
>   }
> 
> Various programs and libraries that utilize cryptographic operations
> depending on fresh randomness can invalidate old keys or take other
> appropriate actions when receiving that event. While this is racier than
> allowing userspace to mmap/vDSO the vmgenid itself, it's an incremental
> step forward that's not as heavyweight.

At your request teleporting here the answer I gave on a different
thread, reinforced by some thinking.

As a user space crypto library person I think the only reasonable
interface is something like a vDSO.

Poll() interfaces are nice and all for system programs that have full
control of their event loop and do not have to react immediately to
this event, however crypto libraries do not have the luxury of
controlling the main loop of the application.

Additionally crypto libraries really need to ensure the value they
return from their PRNG is fine, which means they do not return a value
if the vmgenid has changed before they can reseed, or there could be
catastrophic duplication of "random" values used in IVs or ECDSA
Signatures or ids/cookies or whatever.

For crypto libraries it is much simpler to poll for this information 
than using notifications of any kind given libraries are
generally not in full control of what the process does.

This needs to be polled fast as well, because the whole point of
initializing a PRNG in the library is that asking /dev/urandom all the
time is too slow (due to context switches and syscall overhead), so
anything that would require a context switch in order to pull data from
the PRNG would not really fly.

A vDSO or similar would allow to pull the vmgenid or whatever epoch
value in before generating the random numbers and then barrier-style
check that the value is still unchanged before returning the random
data to the caller. This will reduce the race condition (which simply
cannot be completely avoided) to a very unlikely event.

HTH,
Simo.

Jason A. Donenfeld May 11, 2022, 1:18 a.m. UTC | #7

Hi Simo,

On Tue, May 10, 2022 at 08:40:48PM -0400, Simo Sorce wrote:
> At your request teleporting here the answer I gave on a different
> thread, reinforced by some thinking.
> 
> As a user space crypto library person I think the only reasonable
> interface is something like a vDSO.
> 
> Poll() interfaces are nice and all for system programs that have full
> control of their event loop and do not have to react immediately to
> this event, however crypto libraries do not have the luxury of
> controlling the main loop of the application.
> 
> Additionally crypto libraries really need to ensure the value they
> return from their PRNG is fine, which means they do not return a value
> if the vmgenid has changed before they can reseed, or there could be
> catastrophic duplication of "random" values used in IVs or ECDSA
> Signatures or ids/cookies or whatever.
> 
> For crypto libraries it is much simpler to poll for this information 
> than using notifications of any kind given libraries are
> generally not in full control of what the process does.
> 
> This needs to be polled fast as well, because the whole point of
> initializing a PRNG in the library is that asking /dev/urandom all the
> time is too slow (due to context switches and syscall overhead), so
> anything that would require a context switch in order to pull data from
> the PRNG would not really fly.
> 
> A vDSO or similar would allow to pull the vmgenid or whatever epoch
> value in before generating the random numbers and then barrier-style
> check that the value is still unchanged before returning the random
> data to the caller. This will reduce the race condition (which simply
> cannot be completely avoided) to a very unlikely event.

It sounds like your library issue is somewhat similar to what Alex was
talking about with regards to having a hard time using poll in s2n. I'm
still waiting to hear if Amazon figured out some way that this is
possible (with, e.g., a thread). But anyway, it seems like this is
something library authors might hit.

My proposal here is made with nonce reuse in mind, for things like
session keys that use sequential nonces.

A different issue is random nonces. For these, it seems like a call to
getrandom() for each nonce is probably the best bet. But it sounds like
you're interested in a userspace RNG, akin to OpenBSD's arc4random(3). I
hope you saw these threads:

- https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/
- https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/
- https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/

Each one of those touches on vDSO things quite a bit. Basically, the
motivation for doing that is for making userspace RNGs safe and
promoting their use with a variety of kernel enhancements to make that
easy. And IF we are to ship a vDSO RNG, then certainly this vmgenid
business should be exposed that way, over and above other mechanisms.
It'd make the most sense...IF we're going to ship a vDSO RNG.

So the question really is: should we ship a vDSO RNG? I could work on
designing that right. But I'm a little bit skeptical generally of the
whole userspace RNG concept. By and large they always turn out to be
less safe and more complex than the kernel one. So if we're to go that
way, I'd like to understand what the strongest arguments for it are.

Jason

Simo Sorce May 11, 2022, 12:59 p.m. UTC | #8

Hi Jason,

On Wed, 2022-05-11 at 03:18 +0200, Jason A. Donenfeld wrote:
> My proposal here is made with nonce reuse in mind, for things like
> session keys that use sequential nonces.

Although this makes sense the problem is that changing applications to
do the right thing based on which situation they are in will never be
done right or soon enough. So I would focus on a solution that makes
the CSPRNGs in crypto libraries safe.

> A different issue is random nonces. For these, it seems like a call to
> getrandom() for each nonce is probably the best bet. But it sounds like
> you're interested in a userspace RNG, akin to OpenBSD's arc4random(3). I
> hope you saw these threads:
> 
> - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/
> - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/
> - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/

4c does sound like a decent solution, it is semantically identical to
an epoch vmgenid, all the library needs to do is to create such a mmap
region, stick a value on  it, verify it is not zero after computing the
next random value but before returning it to the caller.
This reduces the race to a very small window when the machine is frozen
right after the random value is returned to the caller but before it is
used, but hopefully this just means that the two machines will just
make parallel computations that yield the exact same value, so no
catastrophic consequence will arise (there is the odd case where two
random values are sought and the split happens between the two are
retrieved and this has bad consequences, I think we can ignore that).

> Each one of those touches on vDSO things quite a bit. Basically, the
> motivation for doing that is for making userspace RNGs safe and
> promoting their use with a variety of kernel enhancements to make that
> easy. And IF we are to ship a vDSO RNG, then certainly this vmgenid
> business should be exposed that way, over and above other mechanisms.
> It'd make the most sense...IF we're going to ship a vDSO RNG.
> 
> So the question really is: should we ship a vDSO RNG? I could work on
> designing that right. But I'm a little bit skeptical generally of the
> whole userspace RNG concept. By and large they always turn out to be
> less safe and more complex than the kernel one. So if we're to go that
> way, I'd like to understand what the strongest arguments for it are.

I am not entirely sure how a vDSO RNG would work, I think exposing the
epoch(or whatever indicator) is enough, crypto libraries have pretty
good PRNGs, what they require is simply a good source of entropy for
the initial seeding and this safety mechanism to avoid state
duplication on machine cloning.
All the decent libraries already support detecting process forks.

Simo.

Alexander Graf May 11, 2022, 1:19 p.m. UTC | #9

Hi Simo,

On 11.05.22 14:59, Simo Sorce wrote:
> Hi Jason,
>
> On Wed, 2022-05-11 at 03:18 +0200, Jason A. Donenfeld wrote:
>> My proposal here is made with nonce reuse in mind, for things like
>> session keys that use sequential nonces.
> Although this makes sense the problem is that changing applications to
> do the right thing based on which situation they are in will never be
> done right or soon enough. So I would focus on a solution that makes
> the CSPRNGs in crypto libraries safe.

I think we intrinsically have 2 sets of applications: Ones that want an 
event based notification and don't care about the racyness of it and 
ones that want an atomic way to determine the epoch. Userspace RNGs are 
naturally in the second category.

>
>> A different issue is random nonces. For these, it seems like a call to
>> getrandom() for each nonce is probably the best bet. But it sounds like
>> you're interested in a userspace RNG, akin to OpenBSD's arc4random(3). I
>> hope you saw these threads:
>>
>> - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/
>> - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/
>> - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/
> 4c does sound like a decent solution, it is semantically identical to
> an epoch vmgenid, all the library needs to do is to create such a mmap
> region, stick a value on  it, verify it is not zero after computing the
> next random value but before returning it to the caller.
> This reduces the race to a very small window when the machine is frozen
> right after the random value is returned to the caller but before it is
> used, but hopefully this just means that the two machines will just
> make parallel computations that yield the exact same value, so no
> catastrophic consequence will arise (there is the odd case where two
> random values are sought and the split happens between the two are
> retrieved and this has bad consequences, I think we can ignore that).

The problem with wiping memory on clone is that it means you must keep 
these special wipe on clone pages always present and pinned in memory 
and can no longer swap them out, compact them or move them for memory 
hotplug.

We started the journey with a WIPEONSUSPEND flag and eventually 
abandoned it because it seemed clunky. Happy to reopen it if we believe 
it's a viable path:

   https://lwn.net/Articles/825230/

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Jason A. Donenfeld May 11, 2022, 1:19 p.m. UTC | #10

Hi Simo,

On Wed, May 11, 2022 at 08:59:07AM -0400, Simo Sorce wrote:
> Hi Jason,
> 
> On Wed, 2022-05-11 at 03:18 +0200, Jason A. Donenfeld wrote:
> > My proposal here is made with nonce reuse in mind, for things like
> > session keys that use sequential nonces.
> 
> Although this makes sense the problem is that changing applications to
> do the right thing based on which situation they are in will never be
> done right or soon enough. So I would focus on a solution that makes
> the CSPRNGs in crypto libraries safe.

Please don't dismiss this. I realize you have your one single use case
in mind, but there are others, and the distinction you gave for why we
should dismiss the others to focus on yours doesn't really make any
sense. Here's why:

In my email I pointed out two places where VM forks impact crypto in bad
ways:

- Session keys, wrt nonce reuse.

- Random nonces, wrt nonce reuse.

There are other problems that arise from VM forks too. But these stand
out because they are both quite catastrophic, whether it's duplicated
ECDSA random nonces, or whether it's the same session key used with the
same sequential counter to encrypt different plaintexts with something
like AES-GCM or ChaCha20Poly1305. These are both very, very bad things.

And both things happen in:

- Libraries: crypto lib random number generators (e.g. OpenSSL), crypto
  lib session keys (e.g. any TLS library).

- Applications: application level random number generators (e.g.
  Bitcoin Core *facepalm*), application level session keys (e.g.
  OpenSSH).

So I don't think the "library vs application" distinction is really
meaningful here. Rather, things kind of fall apart all over the place
for a variety of reasons on VM fork.

> > - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/
> > - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/
> > - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/
> 
> 4c does sound like a decent solution, it is semantically identical to

It does, yeah, but realistically it's never going to happen. I don't
think there's a near- or medium-term chance of changing hypervisor
semantics again. That means for 4-like solutions, there's 4a and 4b.

By the way, that email of mine has inaccuracy in it. I complain about
being in irq context, but it turns out not to be the case; we're inside
of a kthread during the notification, which means we have a lot more
options on what we can do.

If 4 is the solution that appeals to you most, do you want to try your
hand at a RFC patch for it? I don't yet know if that's the best
direction to take, but the devil is kind of in the details, so it might
be interesting to see how it pans out.

Jason

Alexander Graf May 11, 2022, 1:20 p.m. UTC | #11

On 11.05.22 03:18, Jason A. Donenfeld wrote:
> Hi Simo,
>
> On Tue, May 10, 2022 at 08:40:48PM -0400, Simo Sorce wrote:
>> At your request teleporting here the answer I gave on a different
>> thread, reinforced by some thinking.
>>
>> As a user space crypto library person I think the only reasonable
>> interface is something like a vDSO.
>>
>> Poll() interfaces are nice and all for system programs that have full
>> control of their event loop and do not have to react immediately to
>> this event, however crypto libraries do not have the luxury of
>> controlling the main loop of the application.
>>
>> Additionally crypto libraries really need to ensure the value they
>> return from their PRNG is fine, which means they do not return a value
>> if the vmgenid has changed before they can reseed, or there could be
>> catastrophic duplication of "random" values used in IVs or ECDSA
>> Signatures or ids/cookies or whatever.
>>
>> For crypto libraries it is much simpler to poll for this information
>> than using notifications of any kind given libraries are
>> generally not in full control of what the process does.
>>
>> This needs to be polled fast as well, because the whole point of
>> initializing a PRNG in the library is that asking /dev/urandom all the
>> time is too slow (due to context switches and syscall overhead), so
>> anything that would require a context switch in order to pull data from
>> the PRNG would not really fly.
>>
>> A vDSO or similar would allow to pull the vmgenid or whatever epoch
>> value in before generating the random numbers and then barrier-style
>> check that the value is still unchanged before returning the random
>> data to the caller. This will reduce the race condition (which simply
>> cannot be completely avoided) to a very unlikely event.
> It sounds like your library issue is somewhat similar to what Alex was
> talking about with regards to having a hard time using poll in s2n. I'm
> still waiting to hear if Amazon figured out some way that this is
> possible (with, e.g., a thread). But anyway, it seems like this is

Sounds like I didn't reply with my findings - sorry about that. Our s2n 
people *could* build something based on a thread, but are afraid that 
it's racy and would introduce creating a thread which the library does 
not do today.

So in a nutshell, possible yes, desirable no.

I think we're maybe a bit too scared of building something from scratch 
here. What would the best case situation be? Let's roll backwards from 
that then.

 From what I read, we want a "VMGenID v2" device that gives us the 
ability to

   * Get an IRQ on VM clone
   * Store and update an RNG seed value (128bit? Configurable len?) at a 
physical address or stand alone page on clone
   * Store and update a unique-to-this-VM rolling 32bit identifier at a 
physical address or stand alone page on clone

We can either make the device expose these values as individual pages 
(like VMGenID does today) or as guest physical addresses that it needs 
to store into (like virtio-rng). The latter makes protection from DMA 
attacks of the hypervisor and kexec slightly more complicated, but it 
would be doable.

VMGenID has 2 out of 3 features above. So why don't we just go the easy 
route, add a second property to VMGenID that gives us another page with 
that 32bit value and then provide a /dev/vmgenid device node you can 
open and mmap() that 32bit value page from?

User space libraries can then try to open on init and determine their epoch.
For the async event, we add the poll() logic again to /dev/vmgenid and 
make networkd for example use that.

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Simo Sorce May 11, 2022, 2:32 p.m. UTC | #12

Hi Jason,

On Wed, 2022-05-11 at 15:19 +0200, Jason A. Donenfeld wrote:
> Please don't dismiss this. I realize you have your one single use case
> in mind, but there are others, and the distinction you gave for why we
> should dismiss the others to focus on yours doesn't really make any
> sense. Here's why:

I do not think I am dismissing any other use cases, clearly anything
that depend on unique random numbers for security is impacted, but I
tend to focus where we can get the biggest impact. 

> In my email I pointed out two places where VM forks impact crypto in bad
> ways:
> 
> - Session keys, wrt nonce reuse.
> 
> - Random nonces, wrt nonce reuse.
> 
> There are other problems that arise from VM forks too. But these stand
> out because they are both quite catastrophic, whether it's duplicated
> ECDSA random nonces, or whether it's the same session key used with the
> same sequential counter to encrypt different plaintexts with something
> like AES-GCM or ChaCha20Poly1305. These are both very, very bad things.
> 
> And both things happen in:
> 
> - Libraries: crypto lib random number generators (e.g. OpenSSL), crypto
>   lib session keys (e.g. any TLS library).
> 
> - Applications: application level random number generators (e.g.
>   Bitcoin Core *facepalm*), application level session keys (e.g.
>   OpenSSH).

Yes, some applications that are involved with security do have their
own application level PRNGs, clearly they will have to either stop
using customized PRNGs and use the library provided ones (or even just
/dev/urandom if their needs are no performance critical) or adjust
their own PRNGs to be safe using whatever mechanism will be provided.

> So I don't think the "library vs application" distinction is really
> meaningful here. Rather, things kind of fall apart all over the place
> for a variety of reasons on VM fork.

I am not really making a library vs application distinction, what I am
saying is that the library uses case has a set of tighter constraints
than the application one. Basically anything a library can use an
application can as well, while the contrary is not true. Therefore it
if we resolve the library problem, applications will have a solution as
well.

> > > - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/
> > > - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/
> > > - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/
> > 
> > 4c does sound like a decent solution, it is semantically identical to
> 
> It does, yeah, but realistically it's never going to happen. I don't
> think there's a near- or medium-term chance of changing hypervisor
> semantics again. That means for 4-like solutions, there's 4a and 4b.

I think 4a and 4b are fine mechanisms too, 4c is just more efficient,
and potentially optimizable in HW.
That said I think 3 (vDSO) is also a fine solution, and would not be
disappointed if 3 was chosen over 4.

I am not really after evaluating how it is done below the kernel
boundary. As long as the effects are the same, semantically, from the
user space pov.

> By the way, that email of mine has inaccuracy in it. I complain about
> being in irq context, but it turns out not to be the case; we're inside
> of a kthread during the notification, which means we have a lot more
> options on what we can do.
> 
> If 4 is the solution that appeals to you most, do you want to try your
> hand at a RFC patch for it? I don't yet know if that's the best
> direction to take, but the devil is kind of in the details, so it might
> be interesting to see how it pans out.

I think it would be prudent to agree on the correct mechanisms before
venturing into potentially invasive patches.

Simo.

Luis Chamberlain May 12, 2022, 5:40 p.m. UTC | #13

On Tue, May 03, 2022 at 01:27:44PM +0200, Jason A. Donenfeld wrote:
> On Mon, May 02, 2022 at 05:43:21PM +0200, Lennart Poettering wrote:
> > On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote:
> > 
> > > Just wanted to double check with you that this change wouldn't break how
> > > you're using it in systemd for /proc/sys/kernel/hostname:
> > >
> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832
> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465
> > >
> > > I couldn't find anybody else actually polling on it. Interestingly, it
> > > looks like sd_event_add_io uses epoll() inside, but you're not hitting
> > > the bug that Jann pointed out (because I suppose you're not poll()ing on
> > > an epoll fd).
> > 
> > Well, if you made sure this still works, I am fine either way ;-)
> 
> Actually... ugh. It doesn't work. systemd uses uname() to read the host
> name, and doesn't actually read() the file descriptor after receiving
> the poll event on it. So I guess I'll forget this, and maybe we'll have
> to live with sysctl's poll() being broken. :(

A kconfig option may let you do what you want, and allow older kernels
to not break, however I am more curious how sysctl's approach to poll
went unnnoticed for so long. But also, I'm curious if it was based on
another poll implementation which may have been busted.

But more importantly, how do we avoid this in the future?

  Luis

Lucas De Marchi May 12, 2022, 6:22 p.m. UTC | #14

On Mon, May 02, 2022 at 04:06:01PM +0200, Jason A. Donenfeld wrote:
>Events that poll() responds to are supposed to be consumed when the file
>is read(), not by the poll() itself. By putting it on the poll() itself,
>it makes it impossible to poll() on a epoll file descriptor, since the
>event gets consumed too early. Jann wrote a PoC, available in the link
>below.
>
>Reported-by: Jann Horn <jannh@google.com>
>Cc: Kees Cook <keescook@chromium.org>
>Cc: Luis Chamberlain <mcgrof@kernel.org>
>Cc: linux-fsdevel@vger.kernel.org
>Link: https://lore.kernel.org/lkml/CAG48ez1F0P7Wnp=PGhiUej=u=8CSF6gpD9J=Oxxg0buFRqV1tA@mail.gmail.com/
>Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

It seems to be my bug. This is indeed better. Also, I don't think it's unsafe
to fix it like this neither. If my memory serves (it's what, 10+ years?), this
was only tested and used with poll(), which will continue to work.

There were plans to use it in one of systemd's tools, in which case we'd
probably notice the misbehavior with epoll().... humn, checking now systemd's
codebase:

static int on_hostname_change(sd_event_source *es, int fd, uint32_t revents, void *userdata) {
	...
	log_info("System hostname changed to '%s'.", full_hostname);
	...
}

static int manager_watch_hostname(Manager *m) {
         int r;

         assert(m);

         m->hostname_fd = open("/proc/sys/kernel/hostname",
                               O_RDONLY|O_CLOEXEC|O_NONBLOCK|O_NOCTTY);
         if (m->hostname_fd < 0) {
                 log_warning_errno(errno, "Failed to watch hostname: %m");
                 return 0;
         }

         r = sd_event_add_io(m->event, &m->hostname_event_source, m->hostname_fd, 0, on_hostname_change, m);
         if (r < 0) {
                 if (r == -EPERM)
                         /* kernels prior to 3.2 don't support polling this file. Ignore the failure. */
                         m->hostname_fd = safe_close(m->hostname_fd);
                 else
                         return log_error_errno(r, "Failed to add hostname event source: %m");
         }
	....
}

and sd_event library uses epoll. So, it's apparently not working and it doesn't
seem to be their intention to rely on the misbehavior. This makes me think it
even deserves a Cc to stable.

Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>


Lucas De Marchi

>---
> fs/proc/proc_sysctl.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
>diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
>index 7d9cfc730bd4..1aa145794207 100644
>--- a/fs/proc/proc_sysctl.c
>+++ b/fs/proc/proc_sysctl.c
>@@ -622,6 +622,14 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
>
> static ssize_t proc_sys_read(struct kiocb *iocb, struct iov_iter *iter)
> {
>+	struct inode *inode = file_inode(iocb->ki_filp);
>+	struct ctl_table_header *head = grab_header(inode);
>+	struct ctl_table *table = PROC_I(inode)->sysctl_entry;
>+
>+	if (!IS_ERR(head) && table->poll)
>+		iocb->ki_filp->private_data = proc_sys_poll_event(table->poll);
>+	sysctl_head_finish(head);
>+
> 	return proc_sys_call_handler(iocb, iter, 0);
> }
>
>@@ -668,10 +676,8 @@ static __poll_t proc_sys_poll(struct file *filp, poll_table *wait)
> 	event = (unsigned long)filp->private_data;
> 	poll_wait(filp, &table->poll->wait, wait);
>
>-	if (event != atomic_read(&table->poll->event)) {
>-		filp->private_data = proc_sys_poll_event(table->poll);
>+	if (event != atomic_read(&table->poll->event))
> 		ret = EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLPRI;
>-	}
>
> out:
> 	sysctl_head_finish(head);
>-- 
>2.35.1
>

Jason A. Donenfeld May 12, 2022, 6:27 p.m. UTC | #15

Hi Lucas,

On 5/12/22, Lucas De Marchi <lucas.demarchi@intel.com> wrote:
> On Mon, May 02, 2022 at 04:06:01PM +0200, Jason A. Donenfeld wrote:
>>Events that poll() responds to are supposed to be consumed when the file
>>is read(), not by the poll() itself. By putting it on the poll() itself,
>>it makes it impossible to poll() on a epoll file descriptor, since the
>>event gets consumed too early. Jann wrote a PoC, available in the link
>>below.
>>
>>Reported-by: Jann Horn <jannh@google.com>
>>Cc: Kees Cook <keescook@chromium.org>
>>Cc: Luis Chamberlain <mcgrof@kernel.org>
>>Cc: linux-fsdevel@vger.kernel.org
>>Link:
>> https://lore.kernel.org/lkml/CAG48ez1F0P7Wnp=PGhiUej=u=8CSF6gpD9J=Oxxg0buFRqV1tA@mail.gmail.com/
>>Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>
> It seems to be my bug. This is indeed better. Also, I don't think it's
> unsafe
> to fix it like this neither. If my memory serves (it's what, 10+ years?),
> this
> was only tested and used with poll(), which will continue to work.

You are not correct. Please read the entire thread. This breaks systemd.

Jason

Eric W. Biederman May 12, 2022, 6:29 p.m. UTC | #16

Luis Chamberlain <mcgrof@kernel.org> writes:

> On Tue, May 03, 2022 at 01:27:44PM +0200, Jason A. Donenfeld wrote:
>> On Mon, May 02, 2022 at 05:43:21PM +0200, Lennart Poettering wrote:
>> > On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote:
>> > 
>> > > Just wanted to double check with you that this change wouldn't break how
>> > > you're using it in systemd for /proc/sys/kernel/hostname:
>> > >
>> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832
>> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465
>> > >
>> > > I couldn't find anybody else actually polling on it. Interestingly, it
>> > > looks like sd_event_add_io uses epoll() inside, but you're not hitting
>> > > the bug that Jann pointed out (because I suppose you're not poll()ing on
>> > > an epoll fd).
>> > 
>> > Well, if you made sure this still works, I am fine either way ;-)
>> 
>> Actually... ugh. It doesn't work. systemd uses uname() to read the host
>> name, and doesn't actually read() the file descriptor after receiving
>> the poll event on it. So I guess I'll forget this, and maybe we'll have
>> to live with sysctl's poll() being broken. :(

We should be able to modify calling uname() to act the same as reading
the file descriptor. 

> A kconfig option may let you do what you want, and allow older kernels
> to not break, however I am more curious how sysctl's approach to poll
> went unnnoticed for so long. But also, I'm curious if it was based on
> another poll implementation which may have been busted.
>
> But more importantly, how do we avoid this in the future?

Poll on files is weird and generally doesn't work (because files are
always read to read or write).  What did we do to make it work on these
sysctl files?

Eric

Jason A. Donenfeld May 12, 2022, 6:32 p.m. UTC | #17

Hi Eric,

On 5/12/22, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Luis Chamberlain <mcgrof@kernel.org> writes:
>
>> On Tue, May 03, 2022 at 01:27:44PM +0200, Jason A. Donenfeld wrote:
>>> On Mon, May 02, 2022 at 05:43:21PM +0200, Lennart Poettering wrote:
>>> > On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote:
>>> >
>>> > > Just wanted to double check with you that this change wouldn't break
>>> > > how
>>> > > you're using it in systemd for /proc/sys/kernel/hostname:
>>> > >
>>> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832
>>> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465
>>> > >
>>> > > I couldn't find anybody else actually polling on it. Interestingly,
>>> > > it
>>> > > looks like sd_event_add_io uses epoll() inside, but you're not
>>> > > hitting
>>> > > the bug that Jann pointed out (because I suppose you're not poll()ing
>>> > > on
>>> > > an epoll fd).
>>> >
>>> > Well, if you made sure this still works, I am fine either way ;-)
>>>
>>> Actually... ugh. It doesn't work. systemd uses uname() to read the host
>>> name, and doesn't actually read() the file descriptor after receiving
>>> the poll event on it. So I guess I'll forget this, and maybe we'll have
>>> to live with sysctl's poll() being broken. :(
>
> We should be able to modify calling uname() to act the same as reading
> the file descriptor.

How? That sounds like madness. read() takes a fd. uname() doesn't. Are
you proposing we walk through the fds of the process calling uname()
til we find a matching one and then twiddle it's private context
state? I mean I guess that'd work, but...

Jason

[1/2] sysctl: read() must consume poll events, not poll()

Commit Message

Comments

Patch