mbox series

[v8,0/4] Introduce mseal

Message ID 20240131175027.3287009-1-jeffxu@chromium.org
Headers show
Series Introduce mseal | expand

Message

Jeff Xu Jan. 31, 2024, 5:50 p.m. UTC
From: Jeff Xu <jeffxu@chromium.org>

This patchset proposes a new mseal() syscall for the Linux kernel.

In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.

Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.

Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.

Two system calls are involved in sealing the map:  mmap() and mseal().

The new mseal() is an syscall on 64 bit CPU, and with
following signature:

int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.

mseal() blocks following operations for the given memory range.

1> Unmapping, moving to another location, and shrinking the size,
   via munmap() and mremap(), can leave an empty space, therefore can
   be replaced with a VMA with a new set of attributes.

2> Moving or expanding a different VMA into the current location,
   via mremap().

3> Modifying a VMA via mmap(MAP_FIXED).

4> Size expansion, via mremap(), does not appear to pose any specific
   risks to sealed VMAs. It is included anyway because the use case is
   unclear. In any case, users can rely on merging to expand a sealed VMA.

5> mprotect() and pkey_mprotect().

6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
   memory, when users don't have write permission to the memory. Those
   behaviors can alter region contents by discarding pages, effectively a
   memset(0) for anonymous memory.

In addition: mmap() has two related changes.

The PROT_SEAL bit in prot field of mmap(). When present, it marks
the map sealed since creation.

The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks
the map as sealable. A map created without MAP_SEALABLE will not support
sealing, i.e. mseal() will fail.

Applications that don't care about sealing will expect their behavior
unchanged. For those that need sealing support, opt-in by adding
MAP_SEALABLE in mmap().

The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.

Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.

Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).

However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.

Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.

In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental
in shaping this patch:

Jann Horn: raising awareness and providing valuable insights on the
  destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Pedro Falcato: suggesting sealing in the mmap().
Theo de Raadt: sharing the experiences and insight gained from
  implementing mimmutable() in OpenBSD.

Change history:
===============
V8:
- perf optimization in mmap. (Liam R. Howlett)
- add one testcase (test_seal_zero_address) 
- Update mseal.rst to add note for MAP_SEALABLE.

V7:
- fix index.rst (Randy Dunlap)
- fix arm build (Randy Dunlap)
- return EPERM for blocked operations (Theo de Raadt)
https://lore.kernel.org/linux-mm/20240122152905.2220849-2-jeffxu@chromium.org/T/

V6:
- Drop RFC from subject, Given Linus's general approval.
- Adjust syscall number for mseal (main Jan.11/2024) 
- Code style fix (Matthew Wilcox)
- selftest: use ksft macros (Muhammad Usama Anjum)
- Document fix. (Randy Dunlap)
https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/

V5:
- fix build issue in mseal-Wire-up-mseal-syscall
  (Suggested by Linus Torvalds, and Greg KH)
- updates on selftest.
https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r

V4:
(Suggested by Linus Torvalds)
- new signature: mseal(start,len,flags)
- 32 bit is not supported. vm_seal is removed, use vm_flags instead.
- single bit in vm_flags for sealed state.
- CONFIG_MSEAL kernel config is removed.
- single bit of PROT_SEAL in the "Prot" field of mmap().
Other changes:
- update selftest (Suggested by Muhammad Usama Anjum)
- update documentation.
https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/

V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
  MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
  both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
  destructive operations of madvise. (Suggested by Jann Horn and
  Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap()
- Add documentation - mseal.rst
https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.org/

v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/

v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/

----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/

Jeff Xu (4):
  mseal: Wire up mseal syscall
  mseal: add mseal syscall
  selftest mm/mseal memory sealing
  mseal:add documentation

 Documentation/userspace-api/index.rst       |    1 +
 Documentation/userspace-api/mseal.rst       |  215 ++
 arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
 arch/arm/tools/syscall.tbl                  |    1 +
 arch/arm64/include/asm/unistd.h             |    2 +-
 arch/arm64/include/asm/unistd32.h           |    2 +
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
 arch/s390/kernel/syscalls/syscall.tbl       |    1 +
 arch/sh/kernel/syscalls/syscall.tbl         |    1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
 include/linux/syscalls.h                    |    1 +
 include/uapi/asm-generic/mman-common.h      |    8 +
 include/uapi/asm-generic/unistd.h           |    5 +-
 kernel/sys_ni.c                             |    1 +
 mm/Makefile                                 |    4 +
 mm/internal.h                               |   48 +
 mm/madvise.c                                |   12 +
 mm/mmap.c                                   |   35 +-
 mm/mprotect.c                               |   10 +
 mm/mremap.c                                 |   31 +
 mm/mseal.c                                  |  343 ++++
 tools/testing/selftests/mm/.gitignore       |    1 +
 tools/testing/selftests/mm/Makefile         |    1 +
 tools/testing/selftests/mm/mseal_test.c     | 2024 +++++++++++++++++++
 33 files changed, 2756 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/userspace-api/mseal.rst
 create mode 100644 mm/mseal.c
 create mode 100644 tools/testing/selftests/mm/mseal_test.c

Comments

Theo de Raadt Feb. 1, 2024, 1:46 a.m. UTC | #1
Jeff Xu <jeffxu@chromium.org> wrote:

> I considered Theo's inputs from OpenBSD's perspective regarding the
> difference, and I wasn't convinced that Linux should remove these. In
> my view, those are two different kernels code, and the difference in
> Linux is not added without reasons (for MAP_SEALABLE, there is a note
> in the documentation section with details).

That note is describing a fiction.

> I would love to hear more from Linux developers on this.

I'm not sure you are capable of listening.

But I'll repeat for others to stop this train wreck:


1. When execve() maps a programs's .data section, does the kernel set
   MAP_SEALABLE on that region?  Or does it not set MAP_SEALABLE?

   Does the kernel seal the .data section?  It cannot, because of RELRO
   and IFUNCS.  Do you know what those are?  (like in OpenBSD) the kernel
   cannot and will *not* seal the .data section, it lets later code do that.

2. When execve() maps a programs's .bss section, does the kernel set
   MAP_SEALABLE on that region?  Or does it not set MAP_SEALABLE?

   Does the kernel seal the .bss section?  It cannot, because of RELRO
   and IFUNCS.  Do you know what those are?  (like in OpenBSD) the kernel
   cannot and will *not* seal the .bss section, it lets later code do that.

In the proposed diff, the kernel does not set MAP_SEALABLE on those
regions.

How does a userland program seal the .data and .bss regions?

It cannot.  It is too late to set the MAP_SEALABLE, because the kernel
already decided not do to it.

So those regions cannot be sealed.

3. When execve() maps a programs's stack, does the kernel set
   MAP_SEALABLE on that region?  Or does it not set MAP_SEALABLE?

In the proposed diff, the kernel does not set MAP_SEALABLE.

You think you can seal the stack in the kernel??  Sorry to be the bearer
of bad news, but glibc has code which on occasion will mprotects the
stack executable.

But if userland decides that mprotect case won't occur -- how does a
userland program seal its stack?  It is now too late to set MAP_SEALABLE.

So the stack must remain unsealed.

4. What about the text segment?

5. Do you know what a text-relocation is?  They are now rare, but there
   are still compile/linker stages which will produce them, and there is
   software which requires that to work.  It means userland fixes it's
   own .text, then calls mprotect.  The kernel does not know if this will
   happen.

6. When execve() maps the .text segment, will it set MAP_SEALABLE?

If it doesn't set it, userland cannot seal it's text after it makes the
decision to do.


You can continue to extrapolate those same points for all other segments
of a static binary, all segments of a dynamic binary, all segments of the
shared library linker.

And then you can go further, and recognize the logic that will be needed
in the shared library linker to *make the same decisions*.

In each case, the *decision* to make a mapping happens in one piece of
code, and the decision to use and NOW SEAL THAT MAPPING, happens in a
different piece of code.


The only answer to these problems will be to always set MAP_SEALABLE.
To go through the entire Linux ecosystem, and change every call to mmap()
to use this new MAP_SEALABLE flag, and it will look something like this:

+#ifndef MAP_SEALABLE
+#define MAP_SEALABLE 0
+#endif
-	ptr = mmap(...., MAP...
-	ptr = mmap(...., MAP_SEALABLE | MAP...

Every single one of them, and you'll need to do it in the kernel.




If you had spent a second trying to make this work in a second piece of
software, you would have realized that the ONLY way this could work
is by adding a flag with the opposite meaning:

   MAP_NOTSEALABLE

But nothing will use that.  I promise you


> I would love to hear more from Linux developers on this.

I'm not sure you are capable of listening.
Bird, Tim Feb. 1, 2024, 4:56 p.m. UTC | #2
> -----Original Message-----
> From: Theo de Raadt <deraadt@openbsd.org>
> > I would love to hear more from Linux developers on this.
> 
> I'm not sure you are capable of listening.
> 

Theo,

It is possible to make your technical points, and even to express frustration that it has
been difficult to get them across, without resorting to personal attacks.

 -- Tim
Theo de Raadt Feb. 1, 2024, 10:54 p.m. UTC | #3
> To me, this means Linus is OK with the general signatures of the APIs.


Linus, you are in for a shock when the proposal doesn't work for glibc
and all the applications!
Linus Torvalds Feb. 1, 2024, 11:15 p.m. UTC | #4
On Thu, 1 Feb 2024 at 14:54, Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Linus, you are in for a shock when the proposal doesn't work for glibc
> and all the applications!

Heh. I've enjoyed seeing your argumentative style that made you so
famous back in the days. Maybe it's always been there, but I haven't
seen the BSD people in so long that I'd forgotten all about it.

That said, famously argumentative or not, I think Theo is right, and I
do think the MAP_SEALABLE bit is nonsensical.

If somebody wants to mseal() a memory region, why would they need to
express that ahead of time?

So the part I think is sane is the mseal() system call itself, in that
it allows *potential* future expansion of the semantics.

But hopefully said future expansion isn't even needed, and all users
want the base experience, which is why I think PROT_SEAL (both to mmap
and to mprotect) makes sense as an alternative form.

So yes, to my mind

    mprotect(addr, len, PROT_READ);
    mseal(addr, len, 0);

should basically give identical results to

    mprotect(addr, len, PROT_READ | PROT_SEAL);

and using PROT_SEAL at mmap() time is similarly the same obvious
notion of "map this, and then seal that mapping".

The reason for having "mseal()" as a separate call at all from the
PROT_SEAL bit is that it does allow possible future expansion (while
PROT_SEAL is just a single bit, and it won't change semantics) but
also so that you can do whatever prep-work in stages if you want to,
and then just go "now we seal it all".

          Linus
Theo de Raadt Feb. 1, 2024, 11:43 p.m. UTC | #5
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So yes, to my mind
> 
>     mprotect(addr, len, PROT_READ);
>     mseal(addr, len, 0);
> 
> should basically give identical results to
> 
>     mprotect(addr, len, PROT_READ | PROT_SEAL);
> 
> and using PROT_SEAL at mmap() time is similarly the same obvious
> notion of "map this, and then seal that mapping".

I think that isn't easy to do.  Let's expand it to show error checking.

    if (mprotect(addr, len, PROT_READ) == -1)
       react to the errno value
    if (mseal(addr, len, 0) == -1)
       react to the errno value

and

    if (mprotect(addr, len, PROT_READ | PROT_SEAL) == -1)
       react to the errno value

For current mprotect(), the errno values are mostly related to range
issues with the parameters.

After sealing a region, mprotect() also has the new errno EPERM.

But what is the return value supposed to be from "PROT_READ | PROT_SEAL"
over various sub-region types?

Say I have a region 3 pages long.  One page is unmapped, one page is
regular, and one page is sealed.  Re-arrange those 3 pages in all 6
permutations.  Try them all.

Does the returned errno change, based upon the order?
Does it do part of the operation, or all of the operation?

If the sealed page is first, the regular page is second, and the unmapped
page is 3rd, does it return an error or return 0?  Does it change the
permission on the 3rd page?  If it returns an error, has it changed any
permissions?

I don't think the diff follows the principle of

if an error is returned --> we know nothing was changed.
if success is returned --> we know all the requests were satisfied

> The reason for having "mseal()" as a separate call at all from the
> PROT_SEAL bit is that it does allow possible future expansion (while
> PROT_SEAL is just a single bit, and it won't change semantics) but
> also so that you can do whatever prep-work in stages if you want to,
> and then just go "now we seal it all".




How about you add basic mseal() that is maximum compatible with mimmutable(),
and then we can all talk about whether PROT_SEAL makes sense once there
are applications that demand it, and can prove they need it?
Liam R. Howlett Feb. 2, 2024, 3:13 p.m. UTC | #6
* Jeff Xu <jeffxu@google.com> [240201 22:15]:
> On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Jeff Xu <jeffxu@chromium.org> [240131 20:27]:
> > > On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
> > > <Liam.Howlett@oracle.com> wrote:
> > > >
> >
> > Having to opt-in to allowing mseal will probably not work well.
> I'm leaving the opt-in discussion in Linus's thread.
> 
> > Initial library mappings happen in one huge chunk then it's cut up into
> > smaller VMAs, at least that's what I see with my maple tree tracing.  If
> > you opt-in, then the entire library will have to opt-in and so the
> > 'discourage inadvertent sealing' argument is not very strong.
> >
> Regarding "The initial library mappings happen in one huge chunk then
> it is cut up into smaller VMAS", this is not a problem.
> 
> As example of elf loading (fs/binfmt_elf.c), there is just a few
> places to pass in what type of memory to be allocated, e.g.
> MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can  add MAP_SEALABLE at those
> places.
> If glic does additional splitting on the memory range, by using
> mprotect(), then the MAP_SEALABLE is automatically applied after
> splitting.
> If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE).

You are adding a flag that requires a new glibc.  When I try to point
out how this is unnecessary and excessive, you tell me it's fine and
probably not a whole lot of work.

This isn't working with developers, you are dismissing the developers
who are trying to help you.

Can you please:

Provide code that uses this feature.

Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and
32 VMAs.

Provide code that tests and checks the failure paths.  Failures at the
start, middle, and end of the modifications.

Document what happens in those failure paths.

And, most importantly: keep an open mind and allow your opinion to
change when presented with new information.

All of these things are to help you.  We need to know what needs fixing
so you can be successful.


Thanks,
Liam
Pedro Falcato Feb. 2, 2024, 6:51 p.m. UTC | #7
On Fri, Feb 2, 2024 at 5:59 PM Jeff Xu <jeffxu@chromium.org> wrote:
>
> On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > Jeff Xu <jeffxu@chromium.org> wrote:
> >
> > > Even without free.
> > > I personally do not like the heap getting sealed like that.
> > >
> > > Component A.
> > > p=malloc(4096);
> > > writing something to p.
> > >
> > > Compohave nent B:
> > > mprotect(p,4096, RO)
> > > mseal(p,4096)
> > >
> > > This will split the heap VMA, and prevent the heap from shrinking, if
> > > this is in a frequent code path, then it might hurt the process's
> > > memory usage.
> > >
> > > The existing code is more likely to use malloc than mmap(), so it is
> > > easier for dev to seal a piece of data belonging to another component.
> > > I hope this pattern is not wide-spreading.
> > >
> > > The ideal way will be just changing the library A to use mmap.
> >
> > I think you are lacking some test programs to see how it actually
> > behaves; the effect is worse than you think, and the impact is immediately
> > visible to the programmer, and the lesson is clear:
> >
> >         you can only seal objects which you gaurantee never get recycled.
> >
> >         Pushing a sealed object back into reuse is a disasterous bug.
> >
> >         Noone should call this interface, unless they understand that.
> >
> > I'll say again, you don't have a test program for various allocators to
> > understand how it behaves.  The failure modes described in your docuemnts
> > are not correct.
> >
> I understand what you mean: I will add that part to the document:
> Try to recycle a sealed memory is disastrous, e.g.
> p=malloc(4096);
> mprotect(p,4096,RO)
> mseal(p,4096)
> free(p);
>
> My point is:
> I think sealing an object from the heap is a bad pattern in general,
> even dev doesn't free it. That was one of the reasons for the sealable
> flag, I hope saying this doesn't be perceived as looking for excuses.

The point you're missing is that adding MAP_SEALABLE reduces
composability. With MAP_SEALABLE, everything that mmaps some part of
the address space that may ever be sealed will need to be modified to
know about MAP_SEALABLE.

Say you did the same thing for mprotect. MAP_PROTECT would control the
mprotectability of the map. You'd stop:

p = malloc(4096);
mprotect(p, 4096, PROT_READ);
free(p);

! But you'd need to change every spot that mmap()'s something to know
about and use MAP_PROTECT: all "producers" of mmap memory would need
to know about the consumers doing mprotect(). So now either all mmap()
callers mindlessly add MAP_PROTECT out of fear the consumers do
mprotect (and you gain nothing from MAP_PROTECT), or the mmap()
callers need to know the consumers call mprotect(), and thus you
introduce a huge layering violation (and you actually lose from having
MAP_PROTECT).

Hopefully you can map the above to MAP_SEALABLE. Or to any other m*()
operation. For example, if chrome runs on an older glibc that does not
know about MAP_SEALABLE, it will not be able to mseal() its own shared
libraries' .text (even if, yes, that should ideally be left to ld.so).

IMO, UNIX API design has historically mostly been "play stupid games,
win stupid prizes", which is e.g: why things like close(STDOUT_FILENO)
work. If you close stdout (and don't dup/reopen something to stdout)
and printf(), things will break, and you get to keep both pieces.
There's no O_CLOSEABLE, just as there's no O_DUPABLE.
Theo de Raadt Feb. 2, 2024, 7:32 p.m. UTC | #8
> What I'm more concerned about is what happens if you call mseal() on a
> range and it can mseal a portion.  Like, what happens to the first vma
> in your test_seal_unmapped_middle case?  I see it returns an error, but
> is the first VMA mseal()'ed? (no it's not, but test that)

That is correct, Liam.

Unix system calls must be atomic.

They either return an error, and that is a promise they made no changes.

Or they do the work required, and then return success.

In OpenBSD, all mimmutable() aspects were carefully studied to gaurantee
this behaviour.

I am not an expert in the Linux kernel to make the assessment; someone
who is qualified must make that assessment.  Fuzzing with tests is a good
way to judge it simpler.
Jeff Xu Feb. 2, 2024, 8:57 p.m. UTC | #9
On Fri, Feb 2, 2024 at 12:37 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 2 Feb 2024 at 11:32, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > Unix system calls must be atomic.
> >
> > They either return an error, and that is a promise they made no changes.
>
> That's actually not true, and never has been.
>
> It's a good thing to aim for, but several errors means "some or all
> may have been done".
>
> EFAULT (for various system calls), ENOMEM and other errors are all
> things that can happen after some of the system call has already been
> done, and the rest failed.
>
> There are lots of examples, but to pick one obvious VM example,
> something like mlock() may well return an error after the area has
> been successfully locked, but then the population of said pages failed
> for some reason.
>
> Of course, implementations can differ, and POSIX sometimes has insane
> language that is actively incorrect.
>
> Furthermore, the definition of "atomic" is unclear. For example, POSIX
> claims that a "write()" system call is one atomic thing for regular
> files, and some people think that means that you see all or nothing.
> That's simply not true, and you'll see the write progress in various
> indirect ways (look at intermediate file size with 'stat', look at
> intermediate contents with 'mmap' etc etc).
>
> So I agree that atomicity is something that people should always
> *strive* for, but it's not some kind of final truth or absolute
> requirement.
>
> In the specific case of mseal(), I suspect there are very few reasons
> ever *not* to be atomic, so in this particular context atomicity is
> likely always something that should be guaranteed. But I just wanted
> to point out that it's most definitely not a black-and-white issue in
> the general case.
>
Thanks.
At least I got this part done right for mseal() :-)

-Jeff


>              Linus
>
Jeff Xu Feb. 2, 2024, 9:20 p.m. UTC | #10
On Fri, Feb 2, 2024 at 10:52 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Fri, Feb 2, 2024 at 5:59 PM Jeff Xu <jeffxu@chromium.org> wrote:
> >
> > On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt <deraadt@openbsd.org> wrote:
> > >
> > > Jeff Xu <jeffxu@chromium.org> wrote:
> > >
> > > > Even without free.
> > > > I personally do not like the heap getting sealed like that.
> > > >
> > > > Component A.
> > > > p=malloc(4096);
> > > > writing something to p.
> > > >
> > > > Compohave nent B:
> > > > mprotect(p,4096, RO)
> > > > mseal(p,4096)
> > > >
> > > > This will split the heap VMA, and prevent the heap from shrinking, if
> > > > this is in a frequent code path, then it might hurt the process's
> > > > memory usage.
> > > >
> > > > The existing code is more likely to use malloc than mmap(), so it is
> > > > easier for dev to seal a piece of data belonging to another component.
> > > > I hope this pattern is not wide-spreading.
> > > >
> > > > The ideal way will be just changing the library A to use mmap.
> > >
> > > I think you are lacking some test programs to see how it actually
> > > behaves; the effect is worse than you think, and the impact is immediately
> > > visible to the programmer, and the lesson is clear:
> > >
> > >         you can only seal objects which you gaurantee never get recycled.
> > >
> > >         Pushing a sealed object back into reuse is a disasterous bug.
> > >
> > >         Noone should call this interface, unless they understand that.
> > >
> > > I'll say again, you don't have a test program for various allocators to
> > > understand how it behaves.  The failure modes described in your docuemnts
> > > are not correct.
> > >
> > I understand what you mean: I will add that part to the document:
> > Try to recycle a sealed memory is disastrous, e.g.
> > p=malloc(4096);
> > mprotect(p,4096,RO)
> > mseal(p,4096)
> > free(p);
> >
> > My point is:
> > I think sealing an object from the heap is a bad pattern in general,
> > even dev doesn't free it. That was one of the reasons for the sealable
> > flag, I hope saying this doesn't be perceived as looking for excuses.
>
> The point you're missing is that adding MAP_SEALABLE reduces
> composability. With MAP_SEALABLE, everything that mmaps some part of
> the address space that may ever be sealed will need to be modified to
> know about MAP_SEALABLE.
>
> Say you did the same thing for mprotect. MAP_PROTECT would control the
> mprotectability of the map. You'd stop:
>
> p = malloc(4096);
> mprotect(p, 4096, PROT_READ);
> free(p);
>
> ! But you'd need to change every spot that mmap()'s something to know
> about and use MAP_PROTECT: all "producers" of mmap memory would need
> to know about the consumers doing mprotect(). So now either all mmap()
> callers mindlessly add MAP_PROTECT out of fear the consumers do
> mprotect (and you gain nothing from MAP_PROTECT), or the mmap()
> callers need to know the consumers call mprotect(), and thus you
> introduce a huge layering violation (and you actually lose from having
> MAP_PROTECT).
>
> Hopefully you can map the above to MAP_SEALABLE. Or to any other m*()
> operation. For example, if chrome runs on an older glibc that does not
> know about MAP_SEALABLE, it will not be able to mseal() its own shared
> libraries' .text (even if, yes, that should ideally be left to ld.so).
>
I think I have heard enough complaints about MAP_SEALABLE from Linux
developers and Linus in the last two days to convince myself that it
is a bad idea :)

For the last time, I was trying to limit the scope of mseal() limited
to two known cases. And MAP_SEALABLE is a reversible decision, a
system ctrl can turn it off, or we can obsolete it in future. (this
was mentioned in the document of V8).

I will rest my case. Obviously from the feedback,  it is loud and
clear that we want to be able to seal all the memory.

> IMO, UNIX API design has historically mostly been "play stupid games,
> win stupid prizes", which is e.g: why things like close(STDOUT_FILENO)
> work. If you close stdout (and don't dup/reopen something to stdout)
> and printf(), things will break, and you get to keep both pieces.
> There's no O_CLOSEABLE, just as there's no O_DUPABLE.
>
> --
> Pedro
David Laight Feb. 4, 2024, 7:39 p.m. UTC | #11
...
> IMO, UNIX API design has historically mostly been "play stupid games,
> win stupid prizes", which is e.g: why things like close(STDOUT_FILENO)
> work. If you close stdout (and don't dup/reopen something to stdout)
> and printf(), things will break, and you get to keep both pieces.

That is pretty much why libraries must never use printf().
(Try telling that to people at work!)

In the days when processes could only have 20 files open
it was a much bigger problem.
You couldn't afford to not use 0, 1 and 2.
A certain daemon ended up using fd 1 as a pipe to another daemon.
Someone accidentally used printf() instead of fprintf() for a trace.
When the 10k stdio buffer filled the text got written to the pipe.
The expected fixed size message had a 32bit 'trailer' size.
Although no defined messages supported trailers the second daemon
synchronously discarded the trailer - with the expected side effect.

Wasn't my bug, and someone else found it, but I'd read the broken
code a few times without seeing the fubar.

Trouble is it all worked for quite a long time...

	David
 

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Suren Baghdasaryan Feb. 5, 2024, 10:13 p.m. UTC | #12
On Fri, Feb 2, 2024 at 8:46 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> [240202 18:36]:
> > On Fri, 2 Feb 2024 at 13:18, Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > There will be a larger performance cost to checking up front without
> > > allowing the partial completion.
> >
> > I suspect that for mseal(), the only half-way common case will be
> > sealing an area that is entirely contained within one vma.
>
> Agreed.
>
> >
> > So the cost will be the vma splitting (if it's not the whole vma), and
> > very unlikely to be any kind of "walk the vma's to check that they can
> > all be sealed" loop up-front.
>
> That's the cost of calling mseal(), and I think that will be totally
> reasonable.
>
> I'm more concerned with the other calls that do affect more than one vma
> that will now have to ensure there is not an mseal'ed vma among the
> affected area.
>
> As you pointed out, we don't do atomic updates and so we have to add a
> loop at the beginning to check this new special case, which is what this
> patch set does today.  That means we're going to be looping through
> twice for any call that could fail if one is mseal'ed. This includes
> munmap() and mprotect().
>
> The impact will vary based on how many vma's are handled. I'd like some
> numbers on this so we can see if it is a concern, which Jeff has agreed
> to provide in the future - Thank you, Jeff.

Yes please. The additional walk Liam points to seems to be happening
even if we don't use mseal at all. Android apps often create thousands
of VMAs, so a small regression to a syscall like mprotect might cause
a very visible regression to app launch times (one of the key metrics
for Android). Having performance impact numbers here would be very
helpful.

>
> It also means we're modifying the behaviour of those calls so they could
> fail before anything changes (regardless of where the failure would
> occur), and we could still fail later when another aspect of a vma would
> cause a failure as we do today.  We are paying the price for a more
> atomic update, but we aren't trying very hard to be atomic with our
> updates - we don't have many (virtually no) vma checks before
> modifications start.
>
> For instance, we could move the mprotect check for map_deny_write_exec()
> to the pre-update loop to make it more atomic in nature.  This one seems
> somewhat related to mseal, so it would be better if they were both
> checked atomic(ish) together.  Although, I wonder if the user visible
> changes would be acceptable and worth the risk.
>
> We will have two classes of updates to vma's: the more atomic view and
> the legacy view.  The question of what happens when the two mix, or
> where a specific check should go will get (more) confusing.
>
> Thanks,
> Liam
>