Message ID | 20220502140602.130373-1-Jason@zx2c4.com |
---|---|
State | New |
Headers | show |
Series | [1/2] sysctl: read() must consume poll events, not poll() | expand |
On Mo, 02.05.22 20:04, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > > I can just tell you, that in systemd we'd have a usecase for consuming > > such a generation counter: we try to provide stable MAC addresses for > > synthetic network interfaces managed by networkd, so we hash them from > > /etc/machine-id, but otoh people also want them to change when they > > clone their VMs. We could very nicely solve this if we had a > > generation counter easily accessible from userspace, that starts at 0 > > initially. Because then we can hash as we always did when the counter > > is zero, but otherwise use something else, possibly hashed from the > > generation counter. > > This doesn't work, because you could have memory-A split into memory-A.1 > and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with > the same new value "2". Yes, that's why I as vague about what to switch to if the counter is non-zero, i.e. "something else, *possibly* hashed…". For this MAC address usecase it's entirely sufficient to be able to distinguish if the system was closed at all, i.e. if the counter is zero or is non-zero. Because that would already be great for a policy of "hash it in a stable way from /etc/machine-id, if counter == 0" + "use random MAC once counter > 0". Such a MAC address policy I think should probably even be the new default in networkd, if we could implement it. For that we'd need a single bit of info from the kernel, indicating whether the sysem was cloned at all. i.e. if the vmgenid uuid is different from the one the system booted up first. Lennart -- Lennart Poettering, Berlin
Hey Lennart, On Tue, May 03, 2022 at 09:42:40AM +0200, Lennart Poettering wrote: > For this MAC address usecase it's entirely sufficient to be able to > distinguish if the system was closed at all, i.e. if the counter is > zero or is non-zero. Because that would already be great for a policy > of "hash it in a stable way from /etc/machine-id, if counter == 0" + > "use random MAC once counter > 0". Hm, are you sure that's actually what you want? It turns out this vmgenid notification from the hypervisor might not be sufficiently granular for this use case: - vmgenid changes when you fork a new snapshot, so now you have two VMs - vmgenid also changes when you rewind to 2 minutes ago The first is what I assume you care about for this networkd business. The second is probably not what any networkd user expects. [Aside: I hope there are few networkd users; having seen what Yu did with wireguard and how fast and recklessly that went, I can't recommend that part of systemd to anyone.]
On Tue, May 03, 2022 at 11:32:26AM +0200, Lennart Poettering wrote: > So first of all, it appears to me that rewinding a VM is something people > would do for debugging things, i.e. not how things are done on > deployment. I wouldn't be so sure here... Some people have processes around, "always start out from the same place", like for build machines, and employ a single VM snapshot that's always rewound after use. It's never forked into multiple snapshots, but just always goes back to that same starting point. > > >From the perspective of randomness, both of these events imply the same > > thing. The situation is BAD; reseed immediately. From the perspective of > > MAC addresses, though, these events would imply different behavior, > > right? So it seems like vmgenid might need an additional field for this > > use case. Relatedly, VMware has that prompt where you select about your > > VM whether, "I moved it" or "I copied it." Presumably something like > > that would play a part in what is decided as part of this hypothetical > > second field. > > networkd doesn't change MAC addresses in the middle of everything, but > only when a network interface is downed and upped again. This for > example happens when a link beat goes away and comes back. In the > rewind-2min case i'd assume the link beat would probably be restored > to what it was 2min ago (and thus stay online), but in the clone case > it would probably drop momentarily and be restored than, to tell > software to reacquire dhcp and so on. That sounds like it's going to be sort of confusing. Let's say we've got some VM scenario in which rewinds are common due to whatever weird process (such as a build machine that wants to start out at the same place each time). During its course of execution, it reboots, or maybe there's some network connectivity issue and the link goes down. In that case, when the link comes up, it's going to have a different MAC address? That doesn't make much sense to me, but maybe I'm missing some bigger picture detail. Jason
On Tue, May 03, 2022 at 10:29:49AM +0200, Lennart Poettering wrote: > As mentioned earlier, I am not convinced sysctl is the right place for > this. sysctls are understood by most people as being the place for > tweaking kernel settings. This is not a kernel setting, but a > notification concept, and the way Jason defined it there's nothing to > read nor write, which strongly suggests to move it elsewhere, but not > /proc/sys/. I think I'm coming around to this view that having a sysctl return -ENODATA is weird. It makes `sysctl -a` always complain to stderr, for example, which seems bad. > > I can see attractiveness in providing the /run/fork-id directly from the > > kernel though, to remove the dependency on systemd for poll-less > > notification of libraries. > > I agree. I'm still not convinced there's value in having a counter or a UUID, but if you had to choose, would you prefer a counter or a UUID? It sounds like the former, because you see a use for distinguishing between zero and non-zero? Or did you finally agree with me that vmgenid isn't granular enough for that? Jason
On Di, 03.05.22 13:55, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > I'm still not convinced there's value in having a counter or a UUID, but > if you had to choose, would you prefer a counter or a UUID? It sounds > like the former, because you see a use for distinguishing between zero > and non-zero? Or did you finally agree with me that vmgenid isn't > granular enough for that? I would prefer a monotonic counter, since it allows answering questions like the following: 1. Did this image get cloned at all? (i.e. counter != 0; usecase as per the MAC address discussion) 2. Did the image get cloned since the last time I looked? (i.e. counter != my_previously_saved_counter; usecase: detect clones in an "offline" fashion, i.e. from a component that doesn't continously run, but only from time to time) 3. How many clones did I miss? (i.e. missed_clones = my_previously_saved_counter - counter; usecase: catch up with generating proxy D-Bus signal messages for clones). There might be more. Using a UUID would not give us #1 or #3. It would deliver #2 however. Lennart -- Lennart Poettering, Berlin
On Mon, 2022-05-02 at 16:06 +0200, Jason A. Donenfeld wrote: > In order to inform userspace of virtual machine forks, this commit adds > a "fork_event" sysctl, which does not return any data, but allows > userspace processes to poll() on it for notification of VM forks. > > It avoids exposing the actual vmgenid from the hypervisor to userspace, > in case there is any randomness value in keeping it secret. Rather, > userspace is expected to simply use getrandom() if it wants a fresh > value. > > For example, the following snippet can be used to print a message every > time a VM forks, after the RNG has been reseeded: > > struct pollfd fd = { .fd = open("/proc/sys/kernel/random/fork_event", O_RDONLY) }; > assert(fd.fd >= 0); > for (;;) { > read(fd.fd, NULL, 0); > assert(poll(&fd, 1, -1) > 0); > puts("vm fork detected"); > } > > Various programs and libraries that utilize cryptographic operations > depending on fresh randomness can invalidate old keys or take other > appropriate actions when receiving that event. While this is racier than > allowing userspace to mmap/vDSO the vmgenid itself, it's an incremental > step forward that's not as heavyweight. At your request teleporting here the answer I gave on a different thread, reinforced by some thinking. As a user space crypto library person I think the only reasonable interface is something like a vDSO. Poll() interfaces are nice and all for system programs that have full control of their event loop and do not have to react immediately to this event, however crypto libraries do not have the luxury of controlling the main loop of the application. Additionally crypto libraries really need to ensure the value they return from their PRNG is fine, which means they do not return a value if the vmgenid has changed before they can reseed, or there could be catastrophic duplication of "random" values used in IVs or ECDSA Signatures or ids/cookies or whatever. For crypto libraries it is much simpler to poll for this information than using notifications of any kind given libraries are generally not in full control of what the process does. This needs to be polled fast as well, because the whole point of initializing a PRNG in the library is that asking /dev/urandom all the time is too slow (due to context switches and syscall overhead), so anything that would require a context switch in order to pull data from the PRNG would not really fly. A vDSO or similar would allow to pull the vmgenid or whatever epoch value in before generating the random numbers and then barrier-style check that the value is still unchanged before returning the random data to the caller. This will reduce the race condition (which simply cannot be completely avoided) to a very unlikely event. HTH, Simo.
Hi Simo, On Tue, May 10, 2022 at 08:40:48PM -0400, Simo Sorce wrote: > At your request teleporting here the answer I gave on a different > thread, reinforced by some thinking. > > As a user space crypto library person I think the only reasonable > interface is something like a vDSO. > > Poll() interfaces are nice and all for system programs that have full > control of their event loop and do not have to react immediately to > this event, however crypto libraries do not have the luxury of > controlling the main loop of the application. > > Additionally crypto libraries really need to ensure the value they > return from their PRNG is fine, which means they do not return a value > if the vmgenid has changed before they can reseed, or there could be > catastrophic duplication of "random" values used in IVs or ECDSA > Signatures or ids/cookies or whatever. > > For crypto libraries it is much simpler to poll for this information > than using notifications of any kind given libraries are > generally not in full control of what the process does. > > This needs to be polled fast as well, because the whole point of > initializing a PRNG in the library is that asking /dev/urandom all the > time is too slow (due to context switches and syscall overhead), so > anything that would require a context switch in order to pull data from > the PRNG would not really fly. > > A vDSO or similar would allow to pull the vmgenid or whatever epoch > value in before generating the random numbers and then barrier-style > check that the value is still unchanged before returning the random > data to the caller. This will reduce the race condition (which simply > cannot be completely avoided) to a very unlikely event. It sounds like your library issue is somewhat similar to what Alex was talking about with regards to having a hard time using poll in s2n. I'm still waiting to hear if Amazon figured out some way that this is possible (with, e.g., a thread). But anyway, it seems like this is something library authors might hit. My proposal here is made with nonce reuse in mind, for things like session keys that use sequential nonces. A different issue is random nonces. For these, it seems like a call to getrandom() for each nonce is probably the best bet. But it sounds like you're interested in a userspace RNG, akin to OpenBSD's arc4random(3). I hope you saw these threads: - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/ - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/ - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/ Each one of those touches on vDSO things quite a bit. Basically, the motivation for doing that is for making userspace RNGs safe and promoting their use with a variety of kernel enhancements to make that easy. And IF we are to ship a vDSO RNG, then certainly this vmgenid business should be exposed that way, over and above other mechanisms. It'd make the most sense...IF we're going to ship a vDSO RNG. So the question really is: should we ship a vDSO RNG? I could work on designing that right. But I'm a little bit skeptical generally of the whole userspace RNG concept. By and large they always turn out to be less safe and more complex than the kernel one. So if we're to go that way, I'd like to understand what the strongest arguments for it are. Jason
Hi Jason, On Wed, 2022-05-11 at 03:18 +0200, Jason A. Donenfeld wrote: > My proposal here is made with nonce reuse in mind, for things like > session keys that use sequential nonces. Although this makes sense the problem is that changing applications to do the right thing based on which situation they are in will never be done right or soon enough. So I would focus on a solution that makes the CSPRNGs in crypto libraries safe. > A different issue is random nonces. For these, it seems like a call to > getrandom() for each nonce is probably the best bet. But it sounds like > you're interested in a userspace RNG, akin to OpenBSD's arc4random(3). I > hope you saw these threads: > > - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/ > - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/ > - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/ 4c does sound like a decent solution, it is semantically identical to an epoch vmgenid, all the library needs to do is to create such a mmap region, stick a value on it, verify it is not zero after computing the next random value but before returning it to the caller. This reduces the race to a very small window when the machine is frozen right after the random value is returned to the caller but before it is used, but hopefully this just means that the two machines will just make parallel computations that yield the exact same value, so no catastrophic consequence will arise (there is the odd case where two random values are sought and the split happens between the two are retrieved and this has bad consequences, I think we can ignore that). > Each one of those touches on vDSO things quite a bit. Basically, the > motivation for doing that is for making userspace RNGs safe and > promoting their use with a variety of kernel enhancements to make that > easy. And IF we are to ship a vDSO RNG, then certainly this vmgenid > business should be exposed that way, over and above other mechanisms. > It'd make the most sense...IF we're going to ship a vDSO RNG. > > So the question really is: should we ship a vDSO RNG? I could work on > designing that right. But I'm a little bit skeptical generally of the > whole userspace RNG concept. By and large they always turn out to be > less safe and more complex than the kernel one. So if we're to go that > way, I'd like to understand what the strongest arguments for it are. I am not entirely sure how a vDSO RNG would work, I think exposing the epoch(or whatever indicator) is enough, crypto libraries have pretty good PRNGs, what they require is simply a good source of entropy for the initial seeding and this safety mechanism to avoid state duplication on machine cloning. All the decent libraries already support detecting process forks. Simo.
Hi Simo, On 11.05.22 14:59, Simo Sorce wrote: > Hi Jason, > > On Wed, 2022-05-11 at 03:18 +0200, Jason A. Donenfeld wrote: >> My proposal here is made with nonce reuse in mind, for things like >> session keys that use sequential nonces. > Although this makes sense the problem is that changing applications to > do the right thing based on which situation they are in will never be > done right or soon enough. So I would focus on a solution that makes > the CSPRNGs in crypto libraries safe. I think we intrinsically have 2 sets of applications: Ones that want an event based notification and don't care about the racyness of it and ones that want an atomic way to determine the epoch. Userspace RNGs are naturally in the second category. > >> A different issue is random nonces. For these, it seems like a call to >> getrandom() for each nonce is probably the best bet. But it sounds like >> you're interested in a userspace RNG, akin to OpenBSD's arc4random(3). I >> hope you saw these threads: >> >> - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/ >> - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/ >> - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/ > 4c does sound like a decent solution, it is semantically identical to > an epoch vmgenid, all the library needs to do is to create such a mmap > region, stick a value on it, verify it is not zero after computing the > next random value but before returning it to the caller. > This reduces the race to a very small window when the machine is frozen > right after the random value is returned to the caller but before it is > used, but hopefully this just means that the two machines will just > make parallel computations that yield the exact same value, so no > catastrophic consequence will arise (there is the odd case where two > random values are sought and the split happens between the two are > retrieved and this has bad consequences, I think we can ignore that). The problem with wiping memory on clone is that it means you must keep these special wipe on clone pages always present and pinned in memory and can no longer swap them out, compact them or move them for memory hotplug. We started the journey with a WIPEONSUSPEND flag and eventually abandoned it because it seemed clunky. Happy to reopen it if we believe it's a viable path: https://lwn.net/Articles/825230/ Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Hi Simo, On Wed, May 11, 2022 at 08:59:07AM -0400, Simo Sorce wrote: > Hi Jason, > > On Wed, 2022-05-11 at 03:18 +0200, Jason A. Donenfeld wrote: > > My proposal here is made with nonce reuse in mind, for things like > > session keys that use sequential nonces. > > Although this makes sense the problem is that changing applications to > do the right thing based on which situation they are in will never be > done right or soon enough. So I would focus on a solution that makes > the CSPRNGs in crypto libraries safe. Please don't dismiss this. I realize you have your one single use case in mind, but there are others, and the distinction you gave for why we should dismiss the others to focus on yours doesn't really make any sense. Here's why: In my email I pointed out two places where VM forks impact crypto in bad ways: - Session keys, wrt nonce reuse. - Random nonces, wrt nonce reuse. There are other problems that arise from VM forks too. But these stand out because they are both quite catastrophic, whether it's duplicated ECDSA random nonces, or whether it's the same session key used with the same sequential counter to encrypt different plaintexts with something like AES-GCM or ChaCha20Poly1305. These are both very, very bad things. And both things happen in: - Libraries: crypto lib random number generators (e.g. OpenSSL), crypto lib session keys (e.g. any TLS library). - Applications: application level random number generators (e.g. Bitcoin Core *facepalm*), application level session keys (e.g. OpenSSH). So I don't think the "library vs application" distinction is really meaningful here. Rather, things kind of fall apart all over the place for a variety of reasons on VM fork. > > - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/ > > - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/ > > - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/ > > 4c does sound like a decent solution, it is semantically identical to It does, yeah, but realistically it's never going to happen. I don't think there's a near- or medium-term chance of changing hypervisor semantics again. That means for 4-like solutions, there's 4a and 4b. By the way, that email of mine has inaccuracy in it. I complain about being in irq context, but it turns out not to be the case; we're inside of a kthread during the notification, which means we have a lot more options on what we can do. If 4 is the solution that appeals to you most, do you want to try your hand at a RFC patch for it? I don't yet know if that's the best direction to take, but the devil is kind of in the details, so it might be interesting to see how it pans out. Jason
On 11.05.22 03:18, Jason A. Donenfeld wrote: > Hi Simo, > > On Tue, May 10, 2022 at 08:40:48PM -0400, Simo Sorce wrote: >> At your request teleporting here the answer I gave on a different >> thread, reinforced by some thinking. >> >> As a user space crypto library person I think the only reasonable >> interface is something like a vDSO. >> >> Poll() interfaces are nice and all for system programs that have full >> control of their event loop and do not have to react immediately to >> this event, however crypto libraries do not have the luxury of >> controlling the main loop of the application. >> >> Additionally crypto libraries really need to ensure the value they >> return from their PRNG is fine, which means they do not return a value >> if the vmgenid has changed before they can reseed, or there could be >> catastrophic duplication of "random" values used in IVs or ECDSA >> Signatures or ids/cookies or whatever. >> >> For crypto libraries it is much simpler to poll for this information >> than using notifications of any kind given libraries are >> generally not in full control of what the process does. >> >> This needs to be polled fast as well, because the whole point of >> initializing a PRNG in the library is that asking /dev/urandom all the >> time is too slow (due to context switches and syscall overhead), so >> anything that would require a context switch in order to pull data from >> the PRNG would not really fly. >> >> A vDSO or similar would allow to pull the vmgenid or whatever epoch >> value in before generating the random numbers and then barrier-style >> check that the value is still unchanged before returning the random >> data to the caller. This will reduce the race condition (which simply >> cannot be completely avoided) to a very unlikely event. > It sounds like your library issue is somewhat similar to what Alex was > talking about with regards to having a hard time using poll in s2n. I'm > still waiting to hear if Amazon figured out some way that this is > possible (with, e.g., a thread). But anyway, it seems like this is Sounds like I didn't reply with my findings - sorry about that. Our s2n people *could* build something based on a thread, but are afraid that it's racy and would introduce creating a thread which the library does not do today. So in a nutshell, possible yes, desirable no. I think we're maybe a bit too scared of building something from scratch here. What would the best case situation be? Let's roll backwards from that then. From what I read, we want a "VMGenID v2" device that gives us the ability to * Get an IRQ on VM clone * Store and update an RNG seed value (128bit? Configurable len?) at a physical address or stand alone page on clone * Store and update a unique-to-this-VM rolling 32bit identifier at a physical address or stand alone page on clone We can either make the device expose these values as individual pages (like VMGenID does today) or as guest physical addresses that it needs to store into (like virtio-rng). The latter makes protection from DMA attacks of the hypervisor and kexec slightly more complicated, but it would be doable. VMGenID has 2 out of 3 features above. So why don't we just go the easy route, add a second property to VMGenID that gives us another page with that 32bit value and then provide a /dev/vmgenid device node you can open and mmap() that 32bit value page from? User space libraries can then try to open on init and determine their epoch. For the async event, we add the poll() logic again to /dev/vmgenid and make networkd for example use that. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Hi Jason, On Wed, 2022-05-11 at 15:19 +0200, Jason A. Donenfeld wrote: > Please don't dismiss this. I realize you have your one single use case > in mind, but there are others, and the distinction you gave for why we > should dismiss the others to focus on yours doesn't really make any > sense. Here's why: I do not think I am dismissing any other use cases, clearly anything that depend on unique random numbers for security is impacted, but I tend to focus where we can get the biggest impact. > In my email I pointed out two places where VM forks impact crypto in bad > ways: > > - Session keys, wrt nonce reuse. > > - Random nonces, wrt nonce reuse. > > There are other problems that arise from VM forks too. But these stand > out because they are both quite catastrophic, whether it's duplicated > ECDSA random nonces, or whether it's the same session key used with the > same sequential counter to encrypt different plaintexts with something > like AES-GCM or ChaCha20Poly1305. These are both very, very bad things. > > And both things happen in: > > - Libraries: crypto lib random number generators (e.g. OpenSSL), crypto > lib session keys (e.g. any TLS library). > > - Applications: application level random number generators (e.g. > Bitcoin Core *facepalm*), application level session keys (e.g. > OpenSSH). Yes, some applications that are involved with security do have their own application level PRNGs, clearly they will have to either stop using customized PRNGs and use the library provided ones (or even just /dev/urandom if their needs are no performance critical) or adjust their own PRNGs to be safe using whatever mechanism will be provided. > So I don't think the "library vs application" distinction is really > meaningful here. Rather, things kind of fall apart all over the place > for a variety of reasons on VM fork. I am not really making a library vs application distinction, what I am saying is that the library uses case has a set of tighter constraints than the application one. Basically anything a library can use an application can as well, while the contrary is not true. Therefore it if we resolve the library problem, applications will have a solution as well. > > > - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/ > > > - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/ > > > - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/ > > > > 4c does sound like a decent solution, it is semantically identical to > > It does, yeah, but realistically it's never going to happen. I don't > think there's a near- or medium-term chance of changing hypervisor > semantics again. That means for 4-like solutions, there's 4a and 4b. I think 4a and 4b are fine mechanisms too, 4c is just more efficient, and potentially optimizable in HW. That said I think 3 (vDSO) is also a fine solution, and would not be disappointed if 3 was chosen over 4. I am not really after evaluating how it is done below the kernel boundary. As long as the effects are the same, semantically, from the user space pov. > By the way, that email of mine has inaccuracy in it. I complain about > being in irq context, but it turns out not to be the case; we're inside > of a kthread during the notification, which means we have a lot more > options on what we can do. > > If 4 is the solution that appeals to you most, do you want to try your > hand at a RFC patch for it? I don't yet know if that's the best > direction to take, but the devil is kind of in the details, so it might > be interesting to see how it pans out. I think it would be prudent to agree on the correct mechanisms before venturing into potentially invasive patches. Simo.
On Tue, May 03, 2022 at 01:27:44PM +0200, Jason A. Donenfeld wrote: > On Mon, May 02, 2022 at 05:43:21PM +0200, Lennart Poettering wrote: > > On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > > > > > Just wanted to double check with you that this change wouldn't break how > > > you're using it in systemd for /proc/sys/kernel/hostname: > > > > > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832 > > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465 > > > > > > I couldn't find anybody else actually polling on it. Interestingly, it > > > looks like sd_event_add_io uses epoll() inside, but you're not hitting > > > the bug that Jann pointed out (because I suppose you're not poll()ing on > > > an epoll fd). > > > > Well, if you made sure this still works, I am fine either way ;-) > > Actually... ugh. It doesn't work. systemd uses uname() to read the host > name, and doesn't actually read() the file descriptor after receiving > the poll event on it. So I guess I'll forget this, and maybe we'll have > to live with sysctl's poll() being broken. :( A kconfig option may let you do what you want, and allow older kernels to not break, however I am more curious how sysctl's approach to poll went unnnoticed for so long. But also, I'm curious if it was based on another poll implementation which may have been busted. But more importantly, how do we avoid this in the future? Luis
On Mon, May 02, 2022 at 04:06:01PM +0200, Jason A. Donenfeld wrote: >Events that poll() responds to are supposed to be consumed when the file >is read(), not by the poll() itself. By putting it on the poll() itself, >it makes it impossible to poll() on a epoll file descriptor, since the >event gets consumed too early. Jann wrote a PoC, available in the link >below. > >Reported-by: Jann Horn <jannh@google.com> >Cc: Kees Cook <keescook@chromium.org> >Cc: Luis Chamberlain <mcgrof@kernel.org> >Cc: linux-fsdevel@vger.kernel.org >Link: https://lore.kernel.org/lkml/CAG48ez1F0P7Wnp=PGhiUej=u=8CSF6gpD9J=Oxxg0buFRqV1tA@mail.gmail.com/ >Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> It seems to be my bug. This is indeed better. Also, I don't think it's unsafe to fix it like this neither. If my memory serves (it's what, 10+ years?), this was only tested and used with poll(), which will continue to work. There were plans to use it in one of systemd's tools, in which case we'd probably notice the misbehavior with epoll().... humn, checking now systemd's codebase: static int on_hostname_change(sd_event_source *es, int fd, uint32_t revents, void *userdata) { ... log_info("System hostname changed to '%s'.", full_hostname); ... } static int manager_watch_hostname(Manager *m) { int r; assert(m); m->hostname_fd = open("/proc/sys/kernel/hostname", O_RDONLY|O_CLOEXEC|O_NONBLOCK|O_NOCTTY); if (m->hostname_fd < 0) { log_warning_errno(errno, "Failed to watch hostname: %m"); return 0; } r = sd_event_add_io(m->event, &m->hostname_event_source, m->hostname_fd, 0, on_hostname_change, m); if (r < 0) { if (r == -EPERM) /* kernels prior to 3.2 don't support polling this file. Ignore the failure. */ m->hostname_fd = safe_close(m->hostname_fd); else return log_error_errno(r, "Failed to add hostname event source: %m"); } .... } and sd_event library uses epoll. So, it's apparently not working and it doesn't seem to be their intention to rely on the misbehavior. This makes me think it even deserves a Cc to stable. Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Lucas De Marchi >--- > fs/proc/proc_sysctl.c | 12 +++++++++--- > 1 file changed, 9 insertions(+), 3 deletions(-) > >diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c >index 7d9cfc730bd4..1aa145794207 100644 >--- a/fs/proc/proc_sysctl.c >+++ b/fs/proc/proc_sysctl.c >@@ -622,6 +622,14 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, > > static ssize_t proc_sys_read(struct kiocb *iocb, struct iov_iter *iter) > { >+ struct inode *inode = file_inode(iocb->ki_filp); >+ struct ctl_table_header *head = grab_header(inode); >+ struct ctl_table *table = PROC_I(inode)->sysctl_entry; >+ >+ if (!IS_ERR(head) && table->poll) >+ iocb->ki_filp->private_data = proc_sys_poll_event(table->poll); >+ sysctl_head_finish(head); >+ > return proc_sys_call_handler(iocb, iter, 0); > } > >@@ -668,10 +676,8 @@ static __poll_t proc_sys_poll(struct file *filp, poll_table *wait) > event = (unsigned long)filp->private_data; > poll_wait(filp, &table->poll->wait, wait); > >- if (event != atomic_read(&table->poll->event)) { >- filp->private_data = proc_sys_poll_event(table->poll); >+ if (event != atomic_read(&table->poll->event)) > ret = EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLPRI; >- } > > out: > sysctl_head_finish(head); >-- >2.35.1 >
Hi Lucas, On 5/12/22, Lucas De Marchi <lucas.demarchi@intel.com> wrote: > On Mon, May 02, 2022 at 04:06:01PM +0200, Jason A. Donenfeld wrote: >>Events that poll() responds to are supposed to be consumed when the file >>is read(), not by the poll() itself. By putting it on the poll() itself, >>it makes it impossible to poll() on a epoll file descriptor, since the >>event gets consumed too early. Jann wrote a PoC, available in the link >>below. >> >>Reported-by: Jann Horn <jannh@google.com> >>Cc: Kees Cook <keescook@chromium.org> >>Cc: Luis Chamberlain <mcgrof@kernel.org> >>Cc: linux-fsdevel@vger.kernel.org >>Link: >> https://lore.kernel.org/lkml/CAG48ez1F0P7Wnp=PGhiUej=u=8CSF6gpD9J=Oxxg0buFRqV1tA@mail.gmail.com/ >>Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > > It seems to be my bug. This is indeed better. Also, I don't think it's > unsafe > to fix it like this neither. If my memory serves (it's what, 10+ years?), > this > was only tested and used with poll(), which will continue to work. You are not correct. Please read the entire thread. This breaks systemd. Jason
Luis Chamberlain <mcgrof@kernel.org> writes: > On Tue, May 03, 2022 at 01:27:44PM +0200, Jason A. Donenfeld wrote: >> On Mon, May 02, 2022 at 05:43:21PM +0200, Lennart Poettering wrote: >> > On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote: >> > >> > > Just wanted to double check with you that this change wouldn't break how >> > > you're using it in systemd for /proc/sys/kernel/hostname: >> > > >> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832 >> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465 >> > > >> > > I couldn't find anybody else actually polling on it. Interestingly, it >> > > looks like sd_event_add_io uses epoll() inside, but you're not hitting >> > > the bug that Jann pointed out (because I suppose you're not poll()ing on >> > > an epoll fd). >> > >> > Well, if you made sure this still works, I am fine either way ;-) >> >> Actually... ugh. It doesn't work. systemd uses uname() to read the host >> name, and doesn't actually read() the file descriptor after receiving >> the poll event on it. So I guess I'll forget this, and maybe we'll have >> to live with sysctl's poll() being broken. :( We should be able to modify calling uname() to act the same as reading the file descriptor. > A kconfig option may let you do what you want, and allow older kernels > to not break, however I am more curious how sysctl's approach to poll > went unnnoticed for so long. But also, I'm curious if it was based on > another poll implementation which may have been busted. > > But more importantly, how do we avoid this in the future? Poll on files is weird and generally doesn't work (because files are always read to read or write). What did we do to make it work on these sysctl files? Eric
Hi Eric, On 5/12/22, Eric W. Biederman <ebiederm@xmission.com> wrote: > Luis Chamberlain <mcgrof@kernel.org> writes: > >> On Tue, May 03, 2022 at 01:27:44PM +0200, Jason A. Donenfeld wrote: >>> On Mon, May 02, 2022 at 05:43:21PM +0200, Lennart Poettering wrote: >>> > On Mo, 02.05.22 17:30, Jason A. Donenfeld (Jason@zx2c4.com) wrote: >>> > >>> > > Just wanted to double check with you that this change wouldn't break >>> > > how >>> > > you're using it in systemd for /proc/sys/kernel/hostname: >>> > > >>> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/journal/journald-server.c#L1832 >>> > > https://github.com/systemd/systemd/blob/39cd62c30c2e6bb5ec13ebc1ecf0d37ed015b1b8/src/resolve/resolved-manager.c#L465 >>> > > >>> > > I couldn't find anybody else actually polling on it. Interestingly, >>> > > it >>> > > looks like sd_event_add_io uses epoll() inside, but you're not >>> > > hitting >>> > > the bug that Jann pointed out (because I suppose you're not poll()ing >>> > > on >>> > > an epoll fd). >>> > >>> > Well, if you made sure this still works, I am fine either way ;-) >>> >>> Actually... ugh. It doesn't work. systemd uses uname() to read the host >>> name, and doesn't actually read() the file descriptor after receiving >>> the poll event on it. So I guess I'll forget this, and maybe we'll have >>> to live with sysctl's poll() being broken. :( > > We should be able to modify calling uname() to act the same as reading > the file descriptor. How? That sounds like madness. read() takes a fd. uname() doesn't. Are you proposing we walk through the fds of the process calling uname() til we find a matching one and then twiddle it's private context state? I mean I guess that'd work, but... Jason
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index 7d9cfc730bd4..1aa145794207 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -622,6 +622,14 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, static ssize_t proc_sys_read(struct kiocb *iocb, struct iov_iter *iter) { + struct inode *inode = file_inode(iocb->ki_filp); + struct ctl_table_header *head = grab_header(inode); + struct ctl_table *table = PROC_I(inode)->sysctl_entry; + + if (!IS_ERR(head) && table->poll) + iocb->ki_filp->private_data = proc_sys_poll_event(table->poll); + sysctl_head_finish(head); + return proc_sys_call_handler(iocb, iter, 0); } @@ -668,10 +676,8 @@ static __poll_t proc_sys_poll(struct file *filp, poll_table *wait) event = (unsigned long)filp->private_data; poll_wait(filp, &table->poll->wait, wait); - if (event != atomic_read(&table->poll->event)) { - filp->private_data = proc_sys_poll_event(table->poll); + if (event != atomic_read(&table->poll->event)) ret = EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLPRI; - } out: sysctl_head_finish(head);
Events that poll() responds to are supposed to be consumed when the file is read(), not by the poll() itself. By putting it on the poll() itself, it makes it impossible to poll() on a epoll file descriptor, since the event gets consumed too early. Jann wrote a PoC, available in the link below. Reported-by: Jann Horn <jannh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/lkml/CAG48ez1F0P7Wnp=PGhiUej=u=8CSF6gpD9J=Oxxg0buFRqV1tA@mail.gmail.com/ Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- fs/proc/proc_sysctl.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)