Message ID | 20231212231706.2680890-1-jeffxu@chromium.org |
---|---|
Headers | show |
Series | Introduce mseal() | expand |
On Tue, 12 Dec 2023 at 15:17, <jeffxu@chromium.org> wrote: > + > +**types**: bit mask to specify the sealing types, they are: I really want a real-life use-case for more than one bit of "don't modify". IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"? Or when would you *ever* say "seal this area, but mprotect()" is ok. IOW, I want to know why we don't just do the BSD immutable thing, and why we need this multi-level sealing thing. Linus
On Tue, Dec 12, 2023 at 11:16:55PM +0000, jeffxu@chromium.org wrote: > +config MSEAL > + default n Minor nit, "n" is always the default, no need to call it out here. > + bool "Enable mseal() system call" > + depends on MMU > + help > + Enable the virtual memory sealing. > + This feature allows sealing each virtual memory area separately with > + multiple sealing types. You might want to include more documentation as to what this is for, otherwise distros / users will not know if they need to enable this or not. thanks, greg k-h
On Tue, Dec 12, 2023 at 4:39 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, 12 Dec 2023 at 15:17, <jeffxu@chromium.org> wrote: > > + > > +**types**: bit mask to specify the sealing types, they are: > > I really want a real-life use-case for more than one bit of "don't modify". > For the real-life use case question, Stephen Röttger and I put description in the cover letter as well as the open discussion section (mseal() vs immutable()) of patch 0/11. Perhaps you are looking for more details in chrome usage of the API, e.g. code-wise ? > IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"? > The MADV_DONTNEED is OK for file-backed mapping. As state in man page of madvise: [1] "subsequent accesses of pages in the range will succeed, but will result in either repopulating the memory contents from the up-to-date contents of the underlying mapped file" > Or when would you *ever* say "seal this area, but mprotect()" is ok. > The fact that openBSD allows RW=>RO transaction, as in its man page [2] " At present, mprotect(2) may reduce permissions on immutable pages marked PROT_READ | PROT_WRITE to the less permissive PROT_READ." suggests application might desire multiple ways to seal the "PROT" bits. E.g. Applications that wants a full lockdown of PROT and PKEY might use SEAL_PROT_PKEY (Chrome case and implemented in this patch) Application that desires RW=>RO transaction, might implement SEAL_PROT_DOWNGRADEABLE, or specifically allow RW=>RO. (not implemented but can be added in future as extension if needed.) > IOW, I want to know why we don't just do the BSD immutable thing, and > why we need this multi-level sealing thing. > The details are discussed in mseal() vs immutable()) of the cover letter (patch 0/11) In short, BSD's immutable is designed specific for libc case, and Chrome case is just different (e.g. the lifetime of those mappings and requirement of free/discard unused memory). Single bit vs multi-bits are still up for discussion. If there are strong opinions on the multiple-bits approach, (and no objection on applying MM_SEAL_DISCARD_RO_ANON to the .text part during libc dynamic loading, which has no effect anyway because it is file backed.), we could combine all three bits into one. A side note is that we could not add something such as SEAL_PROT_DOWNGRADEABLE later, since pkey_mprotect is sealed. I'm open to one bit approach. If we took that approach, We might consider the following: mseal() or mseal(flags), flags are reserved for future use. I appreciate a direction on this. [1] https://man7.org/linux/man-pages/man2/madvise.2.html [2] https://man.openbsd.org/mimmutable.2 -Jeff > Linus
Jeff Xu <jeffxu@google.com> wrote: > > Or when would you *ever* say "seal this area, but mprotect()" is ok. > > > The fact that openBSD allows RW=>RO transaction, as in its man page [2] > > " At present, mprotect(2) may reduce permissions on immutable pages > marked PROT_READ | PROT_WRITE to the less permissive PROT_READ." Let me explain this. We encountered two places that needed this less-permission-transition. Both of these problems were found in either .data or bss, which the kernel makes immutable by default. The OpenBSD kernel makes those regions immutable BY DEFAULT, and there is no way to turn that off. One was in our libc malloc, which after initialization, wants to protect a control data structure from being written in the future. The other was in chrome v8, for the v8_flags variable, this is similarily mprotected to less permission after initialization to avoid tampering (because it's an amazing relative-address located control gadget). We introduced a different mechanism to solve these problem. So we added a new ELF section which annotates objects you need to be MUTABLE. If these are .data or .bss, they are placed in the MUTABLE region annotated with the following Program Header: Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align OPENBSD_MUTABLE 0x0e9000 0x00000000000ec000 0x00000000000ec000 0x001000 0x001000 RW 0x1000 associated with this Section Header [20] .openbsd.mutable PROGBITS 00000000000ec000 0e9000 001000 00 WA 0 0 4096 (It is vaguely similar to RELRO). You can place objects there using the a compiler __attribute__((section declaration, like this example from our libc/malloc.c code static union { struct malloc_readonly mopts; u_char _pad[MALLOC_PAGESIZE]; } malloc_readonly __attribute__((aligned(MALLOC_PAGESIZE))) __attribute__((section(".openbsd.mutable"))); During startup the code can set the protection and then the immutability of the object correctly. Since we have no purpose left for this permission reduction semantic upon immutable mappings, we may be deleting that behaviour in the future. I wrote that code, because I needed it to make progress with some difficult pieces of code. But we found a better way.
On Wed, 13 Dec 2023 at 16:36, Jeff Xu <jeffxu@google.com> wrote: > > > > IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"? > > > The MADV_DONTNEED is OK for file-backed mapping. Right. It makes no semantic difference. So there's no point to it. My point was that you added this magic flag for "not ok for RO anon mapping". It's such a *completely* random flag, that I go "that's just crazy random - make sealing _always_ disallow that case". So what I object to in this series is basically random small details that should just eb part of the basic act of sealing. I think sealing should just mean "you can't do any operations that have semantic meaning for the mapping, because it is SEALED". So I think sealing should automatically mean "can't do MADV_DONTNEED on anon memory", because that's basically equivalent to a munmap/remap operation. I also think that sealing should just automatically mean "can't do mprotect any more". And yes, the OpenBSD semantics of "immutable" apparently allowed reducing permissions, but even the openbsd man-page seems to think that was a bug, so we should just not allow it. And the openbsd case seems to be because of how they made certain things immutable by default, which is different from what this mseal() thing is. End result: I'd really like to make the thing conceptually simpler, rather than add all those random (*very* random in case of MADV_DONTNEED) special cases. Is there any actual practical example of why you'd want a half-sealed thing? And no, I didn't read the pdf that was attached. If it can't just be explained in plain language, it's not an explanation. I'd love for "sealed" to be just a single bit in the vm_flags things that we already have. Not a config option. Not some complicated thing that is hard to explain. A simple "I have set up this mapping, you can't change it any more". And if it cannot be that kind of thing, I want to have clear and obvious examples of why it can't be that simple thing. Not a pdf file that describes some google-chrome design. Something down-to-earth and practical (and not a "we might want this in the future" thing either). IOW, what is wrong with "THIS VMA SETUP CANNOT BE CHANGED ANY MORE"? Nothing less, but also nothing more. No random odd bits that need explaining. Linus
Jeff Xu <jeffxu@google.com> wrote: > In short, BSD's immutable is designed specific for libc case, and Chrome > case is just different (e.g. the lifetime of those mappings and requirement of > free/discard unused memory). That is not true. During the mimmutable design I took the entire software ecosystem into consideration. Not just libc. That is either uncharitable or uninformed. In OpenBSD, pretty much the only thing which calls mimmutable() is the shared library linker, which does so on all possible regions of all DSO objects, not just libc. For example, chrome loads 96 libraries, and all their text/data/bss/etc are immutable. All the static address space is immutable. It's the same for all other programs running in OpenBSD -- only transient heap and mmap spaces remain permission mutable. It is not just libc. What you are trying to do here with chrome is bring some sort of soft-immutable management to regions of memory, so that trusted parts of chrome can still change the permissions, but untrusted / gadgetry parts of chrome cannot change the permissions. That's a very different thing than what I set out to do with mimmutable(). I'm not aware of any other piece of software that needs this. I still can't wrap my head around the assurance model of the design. Maybe it is time to stop comparing mseal() to mimmutable(). Also, maybe this proposal should be using the name chromesyscall() instead -- then it could be extended indefinitely in the future...
On Thu, Dec 14, 2023 at 2:31 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 13 Dec 2023 at 16:36, Jeff Xu <jeffxu@google.com> wrote: > > > > > > > IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"? > > > > > The MADV_DONTNEED is OK for file-backed mapping. > > Right. It makes no semantic difference. So there's no point to it. > > My point was that you added this magic flag for "not ok for RO anon mapping". > > It's such a *completely* random flag, that I go "that's just crazy > random - make sealing _always_ disallow that case". > > So what I object to in this series is basically random small details > that should just eb part of the basic act of sealing. > > I think sealing should just mean "you can't do any operations that > have semantic meaning for the mapping, because it is SEALED". > > So I think sealing should automatically mean "can't do MADV_DONTNEED > on anon memory", because that's basically equivalent to a munmap/remap > operation. In Chrome, we have a use case to allow MADV_DONTNEED on sealed memory. We have a pkey-tagged heap and code region for JIT code. The regions are writable by page permissions, but we use the pkey to control write access. These regions are mmapped at process startup and we want to seal them to ensure that the pkey and page permissions can't change. Since these regions are used for dynamic allocations, we still need a way to release unneeded resources, i.e. madvise(DONTNEED) unused pages on free(). AIUI, the madvise(DONTNEED) should effectively only change the content of anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we added this special case: if you want to madvise(DONTNEED) an anonymous page, you should have write permissions to the page. In our allocator, on free we can then release resources via: * allow pkey writes * madvise(DONTNEED) * disallow pkey writes
On Thu, Dec 14, 2023 at 6:07 PM Stephen Röttger <sroettger@google.com> wrote: > > On Thu, Dec 14, 2023 at 2:31 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Wed, 13 Dec 2023 at 16:36, Jeff Xu <jeffxu@google.com> wrote: > > > > > > > > > > IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"? > > > > > > > The MADV_DONTNEED is OK for file-backed mapping. > > > > Right. It makes no semantic difference. So there's no point to it. > > > > My point was that you added this magic flag for "not ok for RO anon mapping". > > > > It's such a *completely* random flag, that I go "that's just crazy > > random - make sealing _always_ disallow that case". > > > > So what I object to in this series is basically random small details > > that should just eb part of the basic act of sealing. > > > > I think sealing should just mean "you can't do any operations that > > have semantic meaning for the mapping, because it is SEALED". > > > > So I think sealing should automatically mean "can't do MADV_DONTNEED > > on anon memory", because that's basically equivalent to a munmap/remap > > operation. > > In Chrome, we have a use case to allow MADV_DONTNEED on sealed memory. I don't want to be that guy (*believe me*), but what if there was a way to attach BPF programs to mm's? Such that you could handle 'seal failures' in BPF, and thus allow for this sort of weird semantics? e.g: madvise(MADV_DONTNEED) on a sealed region fails, kernel invokes the BPF program (that chrome loaded), BPF program sees it was a MADV_DONTNEED and allows it to proceed. It requires BPF but sounds like a good compromise in order to not get an ugly API?
On Thu, 14 Dec 2023 at 10:07, Stephen Röttger <sroettger@google.com> wrote: > > AIUI, the madvise(DONTNEED) should effectively only change the content of > anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we > added this special case: if you want to madvise(DONTNEED) an anonymous page, > you should have write permissions to the page. Hmm. I actually would be happier if we just made that change in general. Maybe even without sealing, but I agree that it *definitely* makes sense in general as a sealing thing. IOW, just saying "madvise(DONTNEED) needs write permissions to an anonymous mapping when sealed" makes 100% sense to me. Having a separate _flag_ to give sensible semantics is just odd. IOW, what I really want is exactly that "sensible semantics, not random flags". Particularly for new system calls with fairly specialized use, I think it's very important that the semantics are sensible on a conceptual level, and that we do not add system calls that are based on "random implementation issue of the day". Yes, yes, then as we have to maintain things long-term, and we hit some compatibility issue, at *THAT* point we'll end up facing nasty "we had an implementation that had these semantics in practice, so now we're stuck with it", but when introducing a new system call, let's try really hard to start off from those kinds of random things. Wouldn't it be lovely if we can just come up with a sane set of "this is what it means to seal a vma", and enumerate those, and make those sane conceptual rules be the initial definition. By all means have a "flags" argument for future cases when we figure out there was something wrong or the notion needed to be extended, but if we already *start* with random extensions, I feel there's something wrong with the whole concept. So I would really wish for the first version of mseal(start, len, flags); to have "flags=0" be the one and only case we actually handle initially, and only add a single PROT_SEAL flag to mmap() that says "create this mapping already pre-sealed". Strive very hard to make sealing be a single VM_SEALED flag in the vma->vm_flags that we already have, just admit that none of this matters on 32-bit architectures, so that VM_SEALED can just use one of the high flags that we have several free of (and that pkeys already depends on), and make this a standard feature with no #ifdef's. Can chrome live with that? And what would the required semantics be? I'll start the list: - you can't unmap or remap in any way (including over-mapping) - you can't change protections (but with architecture support like pkey, you can obviously change the protections indirectly with PKRU etc) - you can't do VM operations that change data without the area being writable (so the DONTNEED case - maybe there are others) - anything else? Wouldn't it be lovely to have just a single notion of sealing that is well-documented and makes sense, and doesn't require people to worry about odd special cases? And yes, we'd have the 'flags' argument for future special cases, and hope really hard that it's never needed. Linus
On Thu, Dec 14, 2023 at 12:14 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 14 Dec 2023 at 10:07, Stephen Röttger <sroettger@google.com> wrote: > > > > AIUI, the madvise(DONTNEED) should effectively only change the content of > > anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we > > added this special case: if you want to madvise(DONTNEED) an anonymous page, > > you should have write permissions to the page. > > Hmm. I actually would be happier if we just made that change in > general. Maybe even without sealing, but I agree that it *definitely* > makes sense in general as a sealing thing. > > IOW, just saying > > "madvise(DONTNEED) needs write permissions to an anonymous mapping when sealed" > > makes 100% sense to me. Having a separate _flag_ to give sensible > semantics is just odd. > > IOW, what I really want is exactly that "sensible semantics, not random flags". > > Particularly for new system calls with fairly specialized use, I think > it's very important that the semantics are sensible on a conceptual > level, and that we do not add system calls that are based on "random > implementation issue of the day". > > Yes, yes, then as we have to maintain things long-term, and we hit > some compatibility issue, at *THAT* point we'll end up facing nasty > "we had an implementation that had these semantics in practice, so now > we're stuck with it", but when introducing a new system call, let's > try really hard to start off from those kinds of random things. > > Wouldn't it be lovely if we can just come up with a sane set of "this > is what it means to seal a vma", and enumerate those, and make those > sane conceptual rules be the initial definition. By all means have a > "flags" argument for future cases when we figure out there was > something wrong or the notion needed to be extended, but if we already > *start* with random extensions, I feel there's something wrong with > the whole concept. > > So I would really wish for the first version of > > mseal(start, len, flags); > > to have "flags=0" be the one and only case we actually handle > initially, and only add a single PROT_SEAL flag to mmap() that says > "create this mapping already pre-sealed". > > Strive very hard to make sealing be a single VM_SEALED flag in the > vma->vm_flags that we already have, just admit that none of this > matters on 32-bit architectures, so that VM_SEALED can just use one of > the high flags that we have several free of (and that pkeys already > depends on), and make this a standard feature with no #ifdef's. > > Can chrome live with that? And what would the required semantics be? > I'll start the list: > > - you can't unmap or remap in any way (including over-mapping) > > - you can't change protections (but with architecture support like > pkey, you can obviously change the protections indirectly with PKRU > etc) > > - you can't do VM operations that change data without the area being > writable (so the DONTNEED case - maybe there are others) > > - anything else? > > Wouldn't it be lovely to have just a single notion of sealing that is > well-documented and makes sense, and doesn't require people to worry > about odd special cases? > > And yes, we'd have the 'flags' argument for future special cases, and > hope really hard that it's never needed. > Yes, those inputs make a lot of sense ! I will start the next version. In the meantime, if there are more comments, please continue to post, I appreciate those to make the API better and simple to use. -Jeff > Linus >
Some notes about compatibility between mimmutable(2) and mseal(). This morning, the "RW -> R demotion" code in mimmutable(2) was removed. As described previously, that was a development crutch to solved a problem but we found a better way with a new ELF section which is available at compile time with __attribute__((section(".openbsd.mutable"))). Which works great. I am syncronizing the madvise / msync behaviour further, we will be compatible. I have worried about madvise / msync for a long time, and audited vast amounts of the software ecosystem to come to a conclusion we can be more strict, but I never acted upon it. BTW, on OpenBSD and probably other related BSD operating systems, MADV_DONTNEED is non-destructive. However we have a destructive operation called MADV_FREE. msync() MS_INVALIDATE is also destructive. But all of these operations will now be prohibited, to syncronize the error return value situation. There is an one large difference remainig between mimmutable() and mseal(), which is how other system calls behave. We return EPERM for failures in all the system calls that fail upon immutable memory (since Oct 2022). You are returning EACESS. Before it is too late, do you want to reconsider that return value, or do you have a justification for the choice? I think this remains the blocker which would prevent software from doing #define mimmutable(addr, len) mseal(addr, len, 0)
On Sat, 20 Jan 2024 at 07:23, Theo de Raadt <deraadt@openbsd.org> wrote: > > There is an one large difference remainig between mimmutable() and mseal(), > which is how other system calls behave. > > We return EPERM for failures in all the system calls that fail upon > immutable memory (since Oct 2022). > > You are returning EACESS. > > Before it is too late, do you want to reconsider that return value, or > do you have a justification for the choice? I don't think there's any real reason for the difference. Jeff - mind changing the EACESS to EPERM, and we'll have something that is more-or-less compatible between Linux and OpenBSD? Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, 20 Jan 2024 at 07:23, Theo de Raadt <deraadt@openbsd.org> wrote: > > > > There is an one large difference remainig between mimmutable() and mseal(), > > which is how other system calls behave. > > > > We return EPERM for failures in all the system calls that fail upon > > immutable memory (since Oct 2022). > > > > You are returning EACESS. > > > > Before it is too late, do you want to reconsider that return value, or > > do you have a justification for the choice? > > I don't think there's any real reason for the difference. > > Jeff - mind changing the EACESS to EPERM, and we'll have something > that is more-or-less compatible between Linux and OpenBSD? (I tried to remember why I chose EPERM, replaying the view from the German castle during kernel compiles...) In mmap, EACCESS already means something. [EACCES] The flag PROT_READ was specified as part of the prot parameter and fd was not open for reading. The flags MAP_SHARED and PROT_WRITE were specified as part of the flags and prot parameters and fd was not open for writing. In mprotect, the situation is similar [EACCES] The process does not have sufficient access to the underlying memory object to provide the requested protection. immutable isn't an aspect of the underlying object, but an aspect of the mapping. Anyways, it is common for one errno value to have multiple causes. But this error-aliasing can make it harder to figure things out when studying a "system call trace" a program, and I strongly believe in keeping systems are simple as possible. For all the memory mapping control operations, EPERM was available and unambiguous.
On Sat, Jan 20, 2024 at 8:40 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, 20 Jan 2024 at 07:23, Theo de Raadt <deraadt@openbsd.org> wrote: > > > > There is an one large difference remainig between mimmutable() and mseal(), > > which is how other system calls behave. > > > > We return EPERM for failures in all the system calls that fail upon > > immutable memory (since Oct 2022). > > > > You are returning EACESS. > > > > Before it is too late, do you want to reconsider that return value, or > > do you have a justification for the choice? > > I don't think there's any real reason for the difference. > > Jeff - mind changing the EACESS to EPERM, and we'll have something > that is more-or-less compatible between Linux and OpenBSD? > Sounds Good. I will make the necessary changes in the next version. -Jeff > Linus
From: Jeff Xu <jeffxu@chromium.org> This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. The new mseal() is an architecture independent syscall, and with following signature: mseal(void addr, size_t len, unsigned long types, unsigned long flags) addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to documentation patch (mseal.rst) of this patch set. Those are also fully covered by the selftest. types: bit mask to specify the sealing types. MM_SEAL_BASE MM_SEAL_PROT_PKEY MM_SEAL_DISCARD_RO_ANON MM_SEAL_SEAL The MM_SEAL_BASE: The base package includes the features common to all VMA sealing types. It prevents sealed VMAs from: 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Move or expand a different vma into the current location, via mremap(). 3> Modifying sealed VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. We consider the MM_SEAL_BASE feature, on which other sealing features will depend. For instance, it probably does not make sense to seal PROT_PKEY without sealing the BASE, and the kernel will implicitly add SEAL_BASE for SEAL_PROT_PKEY. The MM_SEAL_PROT_PKEY: Seal PROT and PKEY of the address range, i.e. mprotect() and pkey_mprotect() will be denied if the memory is sealed with MM_SEAL_PROT_PKEY. The MM_SEAL_DISCARD_RO_ANON Certain types of madvise() operations are destructive [6], such as MADV_DONTNEED, which can effectively alter region contents by discarding pages, especially when memory is anonymous. This blocks such operations for anonymous memory which is not writable to the user. The MM_SEAL_SEAL MM_SEAL_SEAL denies adding a new seal for an VMA. This is similar to F_SEAL_SEAL in fcntl. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. -------------------------------------------------------------------- Change history: =============== V3: - Abandon per-syscall approach, (Suggested by Linus Torvalds). - Organize sealing types around their functionality, such as MM_SEAL_BASE, MM_SEAL_PROT_PKEY. - Extend the scope of sealing from calls originated in userspace to both kernel and userspace. (Suggested by Linus Torvalds) - Add seal type support in mmap(). (Suggested by Pedro Falcato) - Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent destructive operations of madvise. (Suggested by Jann Horn and Stephen Röttger) - Make sealed VMAs mergeable. (Suggested by Jann Horn) - Add MAP_SEALABLE to mmap() (Detail see new discussions) - Add documentation - mseal.rst Work in progress: ================= - update man page for mseal() and mmap() Open discussions: ================= Several open discussions from V1/V2 were not incorporated into V3. I believe these are worth more discussion for future versions of this patch series. 1> mseal() vs mimmutable() mseal(bitmasks for multiple seal types) BASE + PROT_PKEY+ MM_SEAL_DISCARD_RO_ANON Apply PROT_PKEY implies BASE, same for DISCARD_RO_ANON mimmutable() (openBSD) This is equal to SEAL_BASE + SEAL_PROT_PKEY in mseal() Plus allowing downgrade from W=>NW (OpenBSD) Doesn’t have MM_SEAL_DISCARD_RO_ANON mimmutable() is designed for memory sealing in libc, and mseal() is designed for both Chrome browser and libc. For the two memory ranges that Chrome browser wants to seal, as discussed previously, the allocator still needs to free or discard memory for the sealed memory. For performance reasons, we have explored two solutions in the past: first, using PKEY-based munmap() [7], and second, separating SEAL_MPROTECT (v1 of this patch set). Recently, we have experimented with an alternative approach that allows us to remove the separation of SEAL_MPROTECT. For those two memory ranges, Chrome browser will use BASE + PROT_PKEY + DISCARD_RO_ANON for all its sealing needs. In the case of libc, the .text segment can be sealed with the BASE and PROT_PKEY, and the RO data segments can be sealed with the BASE + PROT_PKEY + DISCARD_RO_ANON.