mbox series

[v2,00/15] x86/boot: Rework PE header generation

Message ID 20230912090051.4014114-17-ardb@google.com
Headers show
Series x86/boot: Rework PE header generation | expand

Message

Ard Biesheuvel Sept. 12, 2023, 9 a.m. UTC
From: Ard Biesheuvel <ardb@kernel.org>

Now that the EFI stub boot flow no longer relies on memory that is
executable and writable at the same time, we can reorganize the PE/COFF
view of the kernel image and expose the decompressor binary's code and
r/o data as a .text section and data/bss as a .data section, using 4k
alignment and limited permissions.

Doing so is necessary for compatibility with hardening measures that are
being rolled out on x86 PCs built to run Windows (i.e., the majority of
them). The EFI boot environment that the Linux EFI stub executes in is
especially sensitive to safety issues, given that a vulnerability in the
loader of one OS can be abused to attack another.

In true x86 fashion, this is a lot more complicated than on other
architectures, which have implemented this code/data split with 4k
alignment from the beginning. The complicating factor here is that the
boot image consists of two different parts, which are stitched together
and fixed up using a special build tool.

After this series is applied, the only remaining task performed by the
build tool is generating the CRC-32. Even though this checksum is
usually wrong (given that distro kernels are signed for secure boot in a
way that corrupts the CRC), this feature is retained as we cannot be
sure that nobody is relying on this.

This supersedes the work proposed by Evgeniy last year, which did a
major rewrite of the build tool in order to clean it up, before updating
it to generate the new 4k aligned image layout. As this series proves,
the build tool is mostly unnecessary, and we have too many of those
already.

Changes since v1:
- drop patch that removed the CRC and the build tool
- do not use fixed setup_size but derive it in the setup.ld linker
  script
- reorganize the PE header so the .compat section only covers its
  payload and the padding that follows it
- add hpa's ack to patch #4

Cc: Evgeniy Baskov <baskov@ispras.ru>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Jones <pjones@redhat.com>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>

Ard Biesheuvel (15):
  x86/efi: Drop EFI stub .bss from .data section
  x86/efi: Disregard setup header of loaded image
  x86/efi: Drop alignment flags from PE section headers
  x86/boot: Remove the 'bugger off' message
  x86/boot: Omit compression buffer from PE/COFF image memory footprint
  x86/boot: Drop redundant code setting the root device
  x86/boot: Grab kernel_info offset from zoffset header directly
  x86/boot: Drop references to startup_64
  x86/boot: Set EFI handover offset directly in header asm
  x86/boot: Define setup size in linker script
  x86/boot: Derive file size from _edata symbol
  x86/boot: Construct PE/COFF .text section from assembler
  x86/boot: Drop PE/COFF .reloc section
  x86/boot: Split off PE/COFF .data section
  x86/boot: Increase section and file alignment to 4k/512

 arch/x86/boot/Makefile                  |   2 +-
 arch/x86/boot/compressed/vmlinux.lds.S  |   6 +-
 arch/x86/boot/header.S                  | 213 ++++++---------
 arch/x86/boot/setup.ld                  |  14 +-
 arch/x86/boot/tools/build.c             | 273 +-------------------
 drivers/firmware/efi/libstub/Makefile   |   7 -
 drivers/firmware/efi/libstub/x86-stub.c |  46 +---
 7 files changed, 114 insertions(+), 447 deletions(-)

Comments

Ard Biesheuvel Sept. 16, 2023, 7:14 p.m. UTC | #1
On Sat, 16 Sept 2023 at 11:11, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Ard Biesheuvel <ardb@kernel.org> wrote:
>
> > > > So, the first 8 patches broke the x86-64-defconfig-ish Qemu bzImage bootup,
> > > > due to the 8th patch:
> > > >
> > > >   988b52b207a9fe74c3699bda8c2256714926b94b is the first bad commit
> > > >   commit 988b52b207a9fe74c3699bda8c2256714926b94b
> > > >   Author: Ard Biesheuvel <ardb@kernel.org>
> > > >   Date:   Tue Sep 12 09:01:01 2023 +0000
> > > >
> > > >       x86/boot: Define setup size in linker script
> > > >
> > > > I've removed it for now - but this side effect was not expected.
> > > >
> > >
> > > No, definitely not expected. I tested various combinations of i386 /
> > > x86_64 built with GCC / Clang doing EFI or BIOS boot.
> > >
> > > I'll rebase the remaining stuff onto -tip and see if I can reproduce this.
> >
> > This is actually quite bizarre. x86_64_defconfig has
> > CONFIG_EFI_MIXED=y and i tested that this change produces the exact
> > same bzImage binary in that case.
> >
> > Could you send me the .config and the QEMU command line perhaps?
>
> So the patch below is the delta between v2 and v3 - that is expected
> to fix the bzImage boot crash, right?
>

Yes.

ld.bfd does something unexpected [to me] here, and the resulting value
turns out not to be a multiple of 512 at all.

With this tweak, my claim that this patch does not affect the binary
bzImage at all actually holds for ld.bfd as well (provided that
CONFIG_EFI_MIXED=y and CONFIG_LOCAL_VERSION_AUTO is disabled)

So if this still does not work in your test, could you please disable
CONFIG_LOCAL_VERSION_AUTO and compare the md5sums of the two builds?

Thanks,


> --- tip.orig/arch/x86/boot/setup.ld
> +++ tip/arch/x86/boot/setup.ld
> @@ -41,7 +41,7 @@ SECTIONS
>                 LONG(0x5a5aaa55)
>
>                 /* Reserve some extra space for the reloc and compat sections */
> -               setup_size = ABSOLUTE(ALIGN(. + 64, 512));
> +               setup_size = ALIGN(ABSOLUTE(.) + 64, 512);
>                 setup_sects = ABSOLUTE(setup_size / 512);
>         }
>
>
Ingo Molnar Sept. 17, 2023, 5:50 p.m. UTC | #2
* Ard Biesheuvel <ardb@kernel.org> wrote:

> On Sat, 16 Sept 2023 at 11:11, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > > > > So, the first 8 patches broke the x86-64-defconfig-ish Qemu bzImage bootup,
> > > > > due to the 8th patch:
> > > > >
> > > > >   988b52b207a9fe74c3699bda8c2256714926b94b is the first bad commit
> > > > >   commit 988b52b207a9fe74c3699bda8c2256714926b94b
> > > > >   Author: Ard Biesheuvel <ardb@kernel.org>
> > > > >   Date:   Tue Sep 12 09:01:01 2023 +0000
> > > > >
> > > > >       x86/boot: Define setup size in linker script
> > > > >
> > > > > I've removed it for now - but this side effect was not expected.
> > > > >
> > > >
> > > > No, definitely not expected. I tested various combinations of i386 /
> > > > x86_64 built with GCC / Clang doing EFI or BIOS boot.
> > > >
> > > > I'll rebase the remaining stuff onto -tip and see if I can reproduce this.
> > >
> > > This is actually quite bizarre. x86_64_defconfig has
> > > CONFIG_EFI_MIXED=y and i tested that this change produces the exact
> > > same bzImage binary in that case.
> > >
> > > Could you send me the .config and the QEMU command line perhaps?
> >
> > So the patch below is the delta between v2 and v3 - that is expected
> > to fix the bzImage boot crash, right?
> >
> 
> Yes.
> 
> ld.bfd does something unexpected [to me] here, and the resulting value
> turns out not to be a multiple of 512 at all.
> 
> With this tweak, my claim that this patch does not affect the binary
> bzImage at all actually holds for ld.bfd as well (provided that
> CONFIG_EFI_MIXED=y and CONFIG_LOCAL_VERSION_AUTO is disabled)

Ok - it boots & works fine for me too now, with the full series applied.

Thanks,

	Ingo
Jan Hendrik Farr Oct. 3, 2023, 2:02 a.m. UTC | #3
On 12 09:00:51, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> Now that the EFI stub boot flow no longer relies on memory that is
> executable and writable at the same time, we can reorganize the PE/COFF
> view of the kernel image and expose the decompressor binary's code and
> r/o data as a .text section and data/bss as a .data section, using 4k
> alignment and limited permissions.
> 
> Doing so is necessary for compatibility with hardening measures that are
> being rolled out on x86 PCs built to run Windows (i.e., the majority of
> them). The EFI boot environment that the Linux EFI stub executes in is
> especially sensitive to safety issues, given that a vulnerability in the
> loader of one OS can be abused to attack another.

This split is also useful for the work of kexecing the next kernel as an
EFI application. With the current EFI stub I have to set the memory both
writable and executable which results in W^X warnings with a default
config.

What made this more confusing was that the flags of the .text section in
current EFI stub bzImages are set to
IMAGE_SCN_MEM_EXECUTE | IMAGE_SCN_MEM_READ. So if you load that section
according to those flags the EFI stub will quickly run into issues.

I assume current firmware on x86 machines does not set any restricted
permissions on the memory. Can someone enlighten me on their behavior?


> In true x86 fashion, this is a lot more complicated than on other
> architectures, which have implemented this code/data split with 4k
> alignment from the beginning. The complicating factor here is that the
> boot image consists of two different parts, which are stitched together
> and fixed up using a special build tool.
> 
> After this series is applied, the only remaining task performed by the
> build tool is generating the CRC-32. Even though this checksum is
> usually wrong (given that distro kernels are signed for secure boot in a
> way that corrupts the CRC), this feature is retained as we cannot be
> sure that nobody is relying on this.
> 
> This supersedes the work proposed by Evgeniy last year, which did a
> major rewrite of the build tool in order to clean it up, before updating
> it to generate the new 4k aligned image layout. As this series proves,
> the build tool is mostly unnecessary, and we have too many of those
> already.
> 
> Changes since v1:
> - drop patch that removed the CRC and the build tool
> - do not use fixed setup_size but derive it in the setup.ld linker
>   script
> - reorganize the PE header so the .compat section only covers its
>   payload and the padding that follows it
> - add hpa's ack to patch #4
> 
> Cc: Evgeniy Baskov <baskov@ispras.ru>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Jones <pjones@redhat.com>
> Cc: Matthew Garrett <mjg59@srcf.ucam.org>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> 
> Ard Biesheuvel (15):
>   x86/efi: Drop EFI stub .bss from .data section
>   x86/efi: Disregard setup header of loaded image
>   x86/efi: Drop alignment flags from PE section headers
>   x86/boot: Remove the 'bugger off' message
>   x86/boot: Omit compression buffer from PE/COFF image memory footprint
>   x86/boot: Drop redundant code setting the root device
>   x86/boot: Grab kernel_info offset from zoffset header directly
>   x86/boot: Drop references to startup_64
>   x86/boot: Set EFI handover offset directly in header asm
>   x86/boot: Define setup size in linker script
>   x86/boot: Derive file size from _edata symbol
>   x86/boot: Construct PE/COFF .text section from assembler
>   x86/boot: Drop PE/COFF .reloc section
>   x86/boot: Split off PE/COFF .data section
>   x86/boot: Increase section and file alignment to 4k/512
> 
>  arch/x86/boot/Makefile                  |   2 +-
>  arch/x86/boot/compressed/vmlinux.lds.S  |   6 +-
>  arch/x86/boot/header.S                  | 213 ++++++---------
>  arch/x86/boot/setup.ld                  |  14 +-
>  arch/x86/boot/tools/build.c             | 273 +-------------------
>  drivers/firmware/efi/libstub/Makefile   |   7 -
>  drivers/firmware/efi/libstub/x86-stub.c |  46 +---
>  7 files changed, 114 insertions(+), 447 deletions(-)
> 
> -- 
> 2.42.0.283.g2d96d420d3-goog
>
Ard Biesheuvel Oct. 23, 2023, 11:22 a.m. UTC | #4
On Tue, 3 Oct 2023 at 04:03, Jan Hendrik Farr <kernel@jfarr.cc> wrote:
>
> On 12 09:00:51, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > Now that the EFI stub boot flow no longer relies on memory that is
> > executable and writable at the same time, we can reorganize the PE/COFF
> > view of the kernel image and expose the decompressor binary's code and
> > r/o data as a .text section and data/bss as a .data section, using 4k
> > alignment and limited permissions.
> >
> > Doing so is necessary for compatibility with hardening measures that are
> > being rolled out on x86 PCs built to run Windows (i.e., the majority of
> > them). The EFI boot environment that the Linux EFI stub executes in is
> > especially sensitive to safety issues, given that a vulnerability in the
> > loader of one OS can be abused to attack another.
>
> This split is also useful for the work of kexecing the next kernel as an
> EFI application. With the current EFI stub I have to set the memory both
> writable and executable which results in W^X warnings with a default
> config.
>
> What made this more confusing was that the flags of the .text section in
> current EFI stub bzImages are set to
> IMAGE_SCN_MEM_EXECUTE | IMAGE_SCN_MEM_READ. So if you load that section
> according to those flags the EFI stub will quickly run into issues.
>
> I assume current firmware on x86 machines does not set any restricted
> permissions on the memory. Can someone enlighten me on their behavior?
>

No current x86 firmware does not use restricted permissions at all.
All memory is mapped with both writable and executable permissions,
except maybe the stack.

The x86 Linux kernel has been depending on this behavior too, up until
recently (fixes are in -rc now for the v6.6 release). Before this, it
would copy its own executable image around in memory.

So EFI based kexec will need to support this behavior if it targets
older x86 kernels, although I am skeptical that this is a useful
design goal.

I have been experimenting with running the EFI stub code in user space
all the way until ExitBootServices(). The same might work for UKI if
it is layered cleanly on top of the EFI APIs (rather than poking into
system registers or page tables under the hood).

How this would work with signed images etc is TBD but I quite like the
idea of running everything in user space and having a minimal
purgatory (or none at all) if we can simply populate the entire
address space while running unprivileged, and just branch to it in the
kexec() syscall. I imagine this being something like a userspace
helper that is signed/trusted itself, and gets invoked by the kernel
to run EFI images that are trusted and tagged as being executable
unprivileged.
Jan Hendrik Farr Oct. 23, 2023, 5:35 p.m. UTC | #5
On 23 13:22:53, Ard Biesheuvel wrote:
> On Tue, 3 Oct 2023 at 04:03, Jan Hendrik Farr <kernel@jfarr.cc> wrote:
> >
> > On 12 09:00:51, Ard Biesheuvel wrote:
> > > From: Ard Biesheuvel <ardb@kernel.org>
> > >
> > > Now that the EFI stub boot flow no longer relies on memory that is
> > > executable and writable at the same time, we can reorganize the PE/COFF
> > > view of the kernel image and expose the decompressor binary's code and
> > > r/o data as a .text section and data/bss as a .data section, using 4k
> > > alignment and limited permissions.
> > >
> > > Doing so is necessary for compatibility with hardening measures that are
> > > being rolled out on x86 PCs built to run Windows (i.e., the majority of
> > > them). The EFI boot environment that the Linux EFI stub executes in is
> > > especially sensitive to safety issues, given that a vulnerability in the
> > > loader of one OS can be abused to attack another.
> >
> > This split is also useful for the work of kexecing the next kernel as an
> > EFI application. With the current EFI stub I have to set the memory both
> > writable and executable which results in W^X warnings with a default
> > config.
> >
> > What made this more confusing was that the flags of the .text section in
> > current EFI stub bzImages are set to
> > IMAGE_SCN_MEM_EXECUTE | IMAGE_SCN_MEM_READ. So if you load that section
> > according to those flags the EFI stub will quickly run into issues.
> >
> > I assume current firmware on x86 machines does not set any restricted
> > permissions on the memory. Can someone enlighten me on their behavior?
> >
> 
> No current x86 firmware does not use restricted permissions at all.
> All memory is mapped with both writable and executable permissions,
> except maybe the stack.
> 
> The x86 Linux kernel has been depending on this behavior too, up until
> recently (fixes are in -rc now for the v6.6 release). Before this, it
> would copy its own executable image around in memory.
> 
> So EFI based kexec will need to support this behavior if it targets
> older x86 kernels, although I am skeptical that this is a useful
> design goal.

I don't see this as an important goal either.

> I have been experimenting with running the EFI stub code in user space
> all the way until ExitBootServices(). The same might work for UKI if
> it is layered cleanly on top of the EFI APIs (rather than poking into
> system registers or page tables under the hood).
> 
> How this would work with signed images etc is TBD but I quite like the
> idea of running everything in user space and having a minimal
> purgatory (or none at all) if we can simply populate the entire
> address space while running unprivileged, and just branch to it in the
> kexec() syscall. I imagine this being something like a userspace
> helper that is signed/trusted itself, and gets invoked by the kernel
> to run EFI images that are trusted and tagged as being executable
> unprivileged.

I've been experimenting with running EFI apps inside kernel space instead.
This is the more natural approach for signed images. Sure, a malicious EFI
app could do arbitrary stuff in kernel mode, but they're signed. On the other
hand running this entirely in user space would at least guarantee that the
system can not crash due to a misbehaving EFI app (at least until
ExitBootServices()).

The question of whether or not to make this the job of a userspace helper that
is signed must have come up when kexec_file_load syscall was added. It would
have also been an option at the time to extend trust to a signed version of
the userspace kexec tool.

Why was kexec_file_load created instead of restricting kexec_load to a signed
version of the kexec userspace tool?
Dave Young Oct. 24, 2023, 8:21 a.m. UTC | #6
On Tue, 24 Oct 2023 at 01:37, Jan Hendrik Farr <kernel@jfarr.cc> wrote:
>
> On 23 13:22:53, Ard Biesheuvel wrote:
> > On Tue, 3 Oct 2023 at 04:03, Jan Hendrik Farr <kernel@jfarr.cc> wrote:
> > >
> > > On 12 09:00:51, Ard Biesheuvel wrote:
> > > > From: Ard Biesheuvel <ardb@kernel.org>
> > > >
> > > > Now that the EFI stub boot flow no longer relies on memory that is
> > > > executable and writable at the same time, we can reorganize the PE/COFF
> > > > view of the kernel image and expose the decompressor binary's code and
> > > > r/o data as a .text section and data/bss as a .data section, using 4k
> > > > alignment and limited permissions.
> > > >
> > > > Doing so is necessary for compatibility with hardening measures that are
> > > > being rolled out on x86 PCs built to run Windows (i.e., the majority of
> > > > them). The EFI boot environment that the Linux EFI stub executes in is
> > > > especially sensitive to safety issues, given that a vulnerability in the
> > > > loader of one OS can be abused to attack another.
> > >
> > > This split is also useful for the work of kexecing the next kernel as an
> > > EFI application. With the current EFI stub I have to set the memory both
> > > writable and executable which results in W^X warnings with a default
> > > config.
> > >
> > > What made this more confusing was that the flags of the .text section in
> > > current EFI stub bzImages are set to
> > > IMAGE_SCN_MEM_EXECUTE | IMAGE_SCN_MEM_READ. So if you load that section
> > > according to those flags the EFI stub will quickly run into issues.
> > >
> > > I assume current firmware on x86 machines does not set any restricted
> > > permissions on the memory. Can someone enlighten me on their behavior?
> > >
> >
> > No current x86 firmware does not use restricted permissions at all.
> > All memory is mapped with both writable and executable permissions,
> > except maybe the stack.
> >
> > The x86 Linux kernel has been depending on this behavior too, up until
> > recently (fixes are in -rc now for the v6.6 release). Before this, it
> > would copy its own executable image around in memory.
> >
> > So EFI based kexec will need to support this behavior if it targets
> > older x86 kernels, although I am skeptical that this is a useful
> > design goal.
>
> I don't see this as an important goal either.
>
> > I have been experimenting with running the EFI stub code in user space
> > all the way until ExitBootServices(). The same might work for UKI if
> > it is layered cleanly on top of the EFI APIs (rather than poking into
> > system registers or page tables under the hood).
> >
> > How this would work with signed images etc is TBD but I quite like the
> > idea of running everything in user space and having a minimal
> > purgatory (or none at all) if we can simply populate the entire
> > address space while running unprivileged, and just branch to it in the
> > kexec() syscall. I imagine this being something like a userspace
> > helper that is signed/trusted itself, and gets invoked by the kernel
> > to run EFI images that are trusted and tagged as being executable
> > unprivileged.
>
> I've been experimenting with running EFI apps inside kernel space instead.
> This is the more natural approach for signed images. Sure, a malicious EFI
> app could do arbitrary stuff in kernel mode, but they're signed. On the other
> hand running this entirely in user space would at least guarantee that the
> system can not crash due to a misbehaving EFI app (at least until
> ExitBootServices()).
>
> The question of whether or not to make this the job of a userspace helper that
> is signed must have come up when kexec_file_load syscall was added. It would
> have also been an option at the time to extend trust to a signed version of
> the userspace kexec tool.
>
> Why was kexec_file_load created instead of restricting kexec_load to a signed
> version of the kexec userspace tool?

I think one of the reasons is that it is hard to handle dynamic linked
libraries, not only the kexec-tools binary.

Thanks
Dave
Dave Young Oct. 24, 2023, 8:31 a.m. UTC | #7
On Tue, 24 Oct 2023 at 16:21, Dave Young <dyoung@redhat.com> wrote:
>
> On Tue, 24 Oct 2023 at 01:37, Jan Hendrik Farr <kernel@jfarr.cc> wrote:
> >
> > On 23 13:22:53, Ard Biesheuvel wrote:
> > > On Tue, 3 Oct 2023 at 04:03, Jan Hendrik Farr <kernel@jfarr.cc> wrote:
> > > >
> > > > On 12 09:00:51, Ard Biesheuvel wrote:
> > > > > From: Ard Biesheuvel <ardb@kernel.org>
> > > > >
> > > > > Now that the EFI stub boot flow no longer relies on memory that is
> > > > > executable and writable at the same time, we can reorganize the PE/COFF
> > > > > view of the kernel image and expose the decompressor binary's code and
> > > > > r/o data as a .text section and data/bss as a .data section, using 4k
> > > > > alignment and limited permissions.
> > > > >
> > > > > Doing so is necessary for compatibility with hardening measures that are
> > > > > being rolled out on x86 PCs built to run Windows (i.e., the majority of
> > > > > them). The EFI boot environment that the Linux EFI stub executes in is
> > > > > especially sensitive to safety issues, given that a vulnerability in the
> > > > > loader of one OS can be abused to attack another.
> > > >
> > > > This split is also useful for the work of kexecing the next kernel as an
> > > > EFI application. With the current EFI stub I have to set the memory both
> > > > writable and executable which results in W^X warnings with a default
> > > > config.
> > > >
> > > > What made this more confusing was that the flags of the .text section in
> > > > current EFI stub bzImages are set to
> > > > IMAGE_SCN_MEM_EXECUTE | IMAGE_SCN_MEM_READ. So if you load that section
> > > > according to those flags the EFI stub will quickly run into issues.
> > > >
> > > > I assume current firmware on x86 machines does not set any restricted
> > > > permissions on the memory. Can someone enlighten me on their behavior?
> > > >
> > >
> > > No current x86 firmware does not use restricted permissions at all.
> > > All memory is mapped with both writable and executable permissions,
> > > except maybe the stack.
> > >
> > > The x86 Linux kernel has been depending on this behavior too, up until
> > > recently (fixes are in -rc now for the v6.6 release). Before this, it
> > > would copy its own executable image around in memory.
> > >
> > > So EFI based kexec will need to support this behavior if it targets
> > > older x86 kernels, although I am skeptical that this is a useful
> > > design goal.
> >
> > I don't see this as an important goal either.
> >
> > > I have been experimenting with running the EFI stub code in user space
> > > all the way until ExitBootServices(). The same might work for UKI if
> > > it is layered cleanly on top of the EFI APIs (rather than poking into
> > > system registers or page tables under the hood).
> > >
> > > How this would work with signed images etc is TBD but I quite like the
> > > idea of running everything in user space and having a minimal
> > > purgatory (or none at all) if we can simply populate the entire
> > > address space while running unprivileged, and just branch to it in the
> > > kexec() syscall. I imagine this being something like a userspace
> > > helper that is signed/trusted itself, and gets invoked by the kernel
> > > to run EFI images that are trusted and tagged as being executable
> > > unprivileged.
> >
> > I've been experimenting with running EFI apps inside kernel space instead.
> > This is the more natural approach for signed images. Sure, a malicious EFI
> > app could do arbitrary stuff in kernel mode, but they're signed. On the other
> > hand running this entirely in user space would at least guarantee that the
> > system can not crash due to a misbehaving EFI app (at least until
> > ExitBootServices()).
> >
> > The question of whether or not to make this the job of a userspace helper that
> > is signed must have come up when kexec_file_load syscall was added. It would
> > have also been an option at the time to extend trust to a signed version of
> > the userspace kexec tool.
> >
> > Why was kexec_file_load created instead of restricting kexec_load to a signed
> > version of the kexec userspace tool?
>
> I think one of the reasons is that it is hard to handle dynamic linked
> libraries, not only the kexec-tools binary.

Hmm, another one is that ptrace needs to be disabled in some way,
anyway I think it is way too complicated, but I do not remember the
details, added Vivek in cc.
See this article: https://lwn.net/Articles/532778/

>
> Thanks
> Dave