Message ID | 20230912090051.4014114-17-ardb@google.com |
---|---|
Headers | show |
Series | x86/boot: Rework PE header generation | expand |
On Sat, 16 Sept 2023 at 11:11, Ingo Molnar <mingo@kernel.org> wrote: > > > * Ard Biesheuvel <ardb@kernel.org> wrote: > > > > > So, the first 8 patches broke the x86-64-defconfig-ish Qemu bzImage bootup, > > > > due to the 8th patch: > > > > > > > > 988b52b207a9fe74c3699bda8c2256714926b94b is the first bad commit > > > > commit 988b52b207a9fe74c3699bda8c2256714926b94b > > > > Author: Ard Biesheuvel <ardb@kernel.org> > > > > Date: Tue Sep 12 09:01:01 2023 +0000 > > > > > > > > x86/boot: Define setup size in linker script > > > > > > > > I've removed it for now - but this side effect was not expected. > > > > > > > > > > No, definitely not expected. I tested various combinations of i386 / > > > x86_64 built with GCC / Clang doing EFI or BIOS boot. > > > > > > I'll rebase the remaining stuff onto -tip and see if I can reproduce this. > > > > This is actually quite bizarre. x86_64_defconfig has > > CONFIG_EFI_MIXED=y and i tested that this change produces the exact > > same bzImage binary in that case. > > > > Could you send me the .config and the QEMU command line perhaps? > > So the patch below is the delta between v2 and v3 - that is expected > to fix the bzImage boot crash, right? > Yes. ld.bfd does something unexpected [to me] here, and the resulting value turns out not to be a multiple of 512 at all. With this tweak, my claim that this patch does not affect the binary bzImage at all actually holds for ld.bfd as well (provided that CONFIG_EFI_MIXED=y and CONFIG_LOCAL_VERSION_AUTO is disabled) So if this still does not work in your test, could you please disable CONFIG_LOCAL_VERSION_AUTO and compare the md5sums of the two builds? Thanks, > --- tip.orig/arch/x86/boot/setup.ld > +++ tip/arch/x86/boot/setup.ld > @@ -41,7 +41,7 @@ SECTIONS > LONG(0x5a5aaa55) > > /* Reserve some extra space for the reloc and compat sections */ > - setup_size = ABSOLUTE(ALIGN(. + 64, 512)); > + setup_size = ALIGN(ABSOLUTE(.) + 64, 512); > setup_sects = ABSOLUTE(setup_size / 512); > } > >
* Ard Biesheuvel <ardb@kernel.org> wrote: > On Sat, 16 Sept 2023 at 11:11, Ingo Molnar <mingo@kernel.org> wrote: > > > > > > * Ard Biesheuvel <ardb@kernel.org> wrote: > > > > > > > So, the first 8 patches broke the x86-64-defconfig-ish Qemu bzImage bootup, > > > > > due to the 8th patch: > > > > > > > > > > 988b52b207a9fe74c3699bda8c2256714926b94b is the first bad commit > > > > > commit 988b52b207a9fe74c3699bda8c2256714926b94b > > > > > Author: Ard Biesheuvel <ardb@kernel.org> > > > > > Date: Tue Sep 12 09:01:01 2023 +0000 > > > > > > > > > > x86/boot: Define setup size in linker script > > > > > > > > > > I've removed it for now - but this side effect was not expected. > > > > > > > > > > > > > No, definitely not expected. I tested various combinations of i386 / > > > > x86_64 built with GCC / Clang doing EFI or BIOS boot. > > > > > > > > I'll rebase the remaining stuff onto -tip and see if I can reproduce this. > > > > > > This is actually quite bizarre. x86_64_defconfig has > > > CONFIG_EFI_MIXED=y and i tested that this change produces the exact > > > same bzImage binary in that case. > > > > > > Could you send me the .config and the QEMU command line perhaps? > > > > So the patch below is the delta between v2 and v3 - that is expected > > to fix the bzImage boot crash, right? > > > > Yes. > > ld.bfd does something unexpected [to me] here, and the resulting value > turns out not to be a multiple of 512 at all. > > With this tweak, my claim that this patch does not affect the binary > bzImage at all actually holds for ld.bfd as well (provided that > CONFIG_EFI_MIXED=y and CONFIG_LOCAL_VERSION_AUTO is disabled) Ok - it boots & works fine for me too now, with the full series applied. Thanks, Ingo
On 12 09:00:51, Ard Biesheuvel wrote: > From: Ard Biesheuvel <ardb@kernel.org> > > Now that the EFI stub boot flow no longer relies on memory that is > executable and writable at the same time, we can reorganize the PE/COFF > view of the kernel image and expose the decompressor binary's code and > r/o data as a .text section and data/bss as a .data section, using 4k > alignment and limited permissions. > > Doing so is necessary for compatibility with hardening measures that are > being rolled out on x86 PCs built to run Windows (i.e., the majority of > them). The EFI boot environment that the Linux EFI stub executes in is > especially sensitive to safety issues, given that a vulnerability in the > loader of one OS can be abused to attack another. This split is also useful for the work of kexecing the next kernel as an EFI application. With the current EFI stub I have to set the memory both writable and executable which results in W^X warnings with a default config. What made this more confusing was that the flags of the .text section in current EFI stub bzImages are set to IMAGE_SCN_MEM_EXECUTE | IMAGE_SCN_MEM_READ. So if you load that section according to those flags the EFI stub will quickly run into issues. I assume current firmware on x86 machines does not set any restricted permissions on the memory. Can someone enlighten me on their behavior? > In true x86 fashion, this is a lot more complicated than on other > architectures, which have implemented this code/data split with 4k > alignment from the beginning. The complicating factor here is that the > boot image consists of two different parts, which are stitched together > and fixed up using a special build tool. > > After this series is applied, the only remaining task performed by the > build tool is generating the CRC-32. Even though this checksum is > usually wrong (given that distro kernels are signed for secure boot in a > way that corrupts the CRC), this feature is retained as we cannot be > sure that nobody is relying on this. > > This supersedes the work proposed by Evgeniy last year, which did a > major rewrite of the build tool in order to clean it up, before updating > it to generate the new 4k aligned image layout. As this series proves, > the build tool is mostly unnecessary, and we have too many of those > already. > > Changes since v1: > - drop patch that removed the CRC and the build tool > - do not use fixed setup_size but derive it in the setup.ld linker > script > - reorganize the PE header so the .compat section only covers its > payload and the padding that follows it > - add hpa's ack to patch #4 > > Cc: Evgeniy Baskov <baskov@ispras.ru> > Cc: Borislav Petkov <bp@alien8.de> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Peter Jones <pjones@redhat.com> > Cc: Matthew Garrett <mjg59@srcf.ucam.org> > Cc: Gerd Hoffmann <kraxel@redhat.com> > Cc: Kees Cook <keescook@chromium.org> > Cc: "H. Peter Anvin" <hpa@zytor.com> > > Ard Biesheuvel (15): > x86/efi: Drop EFI stub .bss from .data section > x86/efi: Disregard setup header of loaded image > x86/efi: Drop alignment flags from PE section headers > x86/boot: Remove the 'bugger off' message > x86/boot: Omit compression buffer from PE/COFF image memory footprint > x86/boot: Drop redundant code setting the root device > x86/boot: Grab kernel_info offset from zoffset header directly > x86/boot: Drop references to startup_64 > x86/boot: Set EFI handover offset directly in header asm > x86/boot: Define setup size in linker script > x86/boot: Derive file size from _edata symbol > x86/boot: Construct PE/COFF .text section from assembler > x86/boot: Drop PE/COFF .reloc section > x86/boot: Split off PE/COFF .data section > x86/boot: Increase section and file alignment to 4k/512 > > arch/x86/boot/Makefile | 2 +- > arch/x86/boot/compressed/vmlinux.lds.S | 6 +- > arch/x86/boot/header.S | 213 ++++++--------- > arch/x86/boot/setup.ld | 14 +- > arch/x86/boot/tools/build.c | 273 +------------------- > drivers/firmware/efi/libstub/Makefile | 7 - > drivers/firmware/efi/libstub/x86-stub.c | 46 +--- > 7 files changed, 114 insertions(+), 447 deletions(-) > > -- > 2.42.0.283.g2d96d420d3-goog >
On Tue, 3 Oct 2023 at 04:03, Jan Hendrik Farr <kernel@jfarr.cc> wrote: > > On 12 09:00:51, Ard Biesheuvel wrote: > > From: Ard Biesheuvel <ardb@kernel.org> > > > > Now that the EFI stub boot flow no longer relies on memory that is > > executable and writable at the same time, we can reorganize the PE/COFF > > view of the kernel image and expose the decompressor binary's code and > > r/o data as a .text section and data/bss as a .data section, using 4k > > alignment and limited permissions. > > > > Doing so is necessary for compatibility with hardening measures that are > > being rolled out on x86 PCs built to run Windows (i.e., the majority of > > them). The EFI boot environment that the Linux EFI stub executes in is > > especially sensitive to safety issues, given that a vulnerability in the > > loader of one OS can be abused to attack another. > > This split is also useful for the work of kexecing the next kernel as an > EFI application. With the current EFI stub I have to set the memory both > writable and executable which results in W^X warnings with a default > config. > > What made this more confusing was that the flags of the .text section in > current EFI stub bzImages are set to > IMAGE_SCN_MEM_EXECUTE | IMAGE_SCN_MEM_READ. So if you load that section > according to those flags the EFI stub will quickly run into issues. > > I assume current firmware on x86 machines does not set any restricted > permissions on the memory. Can someone enlighten me on their behavior? > No current x86 firmware does not use restricted permissions at all. All memory is mapped with both writable and executable permissions, except maybe the stack. The x86 Linux kernel has been depending on this behavior too, up until recently (fixes are in -rc now for the v6.6 release). Before this, it would copy its own executable image around in memory. So EFI based kexec will need to support this behavior if it targets older x86 kernels, although I am skeptical that this is a useful design goal. I have been experimenting with running the EFI stub code in user space all the way until ExitBootServices(). The same might work for UKI if it is layered cleanly on top of the EFI APIs (rather than poking into system registers or page tables under the hood). How this would work with signed images etc is TBD but I quite like the idea of running everything in user space and having a minimal purgatory (or none at all) if we can simply populate the entire address space while running unprivileged, and just branch to it in the kexec() syscall. I imagine this being something like a userspace helper that is signed/trusted itself, and gets invoked by the kernel to run EFI images that are trusted and tagged as being executable unprivileged.
On 23 13:22:53, Ard Biesheuvel wrote: > On Tue, 3 Oct 2023 at 04:03, Jan Hendrik Farr <kernel@jfarr.cc> wrote: > > > > On 12 09:00:51, Ard Biesheuvel wrote: > > > From: Ard Biesheuvel <ardb@kernel.org> > > > > > > Now that the EFI stub boot flow no longer relies on memory that is > > > executable and writable at the same time, we can reorganize the PE/COFF > > > view of the kernel image and expose the decompressor binary's code and > > > r/o data as a .text section and data/bss as a .data section, using 4k > > > alignment and limited permissions. > > > > > > Doing so is necessary for compatibility with hardening measures that are > > > being rolled out on x86 PCs built to run Windows (i.e., the majority of > > > them). The EFI boot environment that the Linux EFI stub executes in is > > > especially sensitive to safety issues, given that a vulnerability in the > > > loader of one OS can be abused to attack another. > > > > This split is also useful for the work of kexecing the next kernel as an > > EFI application. With the current EFI stub I have to set the memory both > > writable and executable which results in W^X warnings with a default > > config. > > > > What made this more confusing was that the flags of the .text section in > > current EFI stub bzImages are set to > > IMAGE_SCN_MEM_EXECUTE | IMAGE_SCN_MEM_READ. So if you load that section > > according to those flags the EFI stub will quickly run into issues. > > > > I assume current firmware on x86 machines does not set any restricted > > permissions on the memory. Can someone enlighten me on their behavior? > > > > No current x86 firmware does not use restricted permissions at all. > All memory is mapped with both writable and executable permissions, > except maybe the stack. > > The x86 Linux kernel has been depending on this behavior too, up until > recently (fixes are in -rc now for the v6.6 release). Before this, it > would copy its own executable image around in memory. > > So EFI based kexec will need to support this behavior if it targets > older x86 kernels, although I am skeptical that this is a useful > design goal. I don't see this as an important goal either. > I have been experimenting with running the EFI stub code in user space > all the way until ExitBootServices(). The same might work for UKI if > it is layered cleanly on top of the EFI APIs (rather than poking into > system registers or page tables under the hood). > > How this would work with signed images etc is TBD but I quite like the > idea of running everything in user space and having a minimal > purgatory (or none at all) if we can simply populate the entire > address space while running unprivileged, and just branch to it in the > kexec() syscall. I imagine this being something like a userspace > helper that is signed/trusted itself, and gets invoked by the kernel > to run EFI images that are trusted and tagged as being executable > unprivileged. I've been experimenting with running EFI apps inside kernel space instead. This is the more natural approach for signed images. Sure, a malicious EFI app could do arbitrary stuff in kernel mode, but they're signed. On the other hand running this entirely in user space would at least guarantee that the system can not crash due to a misbehaving EFI app (at least until ExitBootServices()). The question of whether or not to make this the job of a userspace helper that is signed must have come up when kexec_file_load syscall was added. It would have also been an option at the time to extend trust to a signed version of the userspace kexec tool. Why was kexec_file_load created instead of restricting kexec_load to a signed version of the kexec userspace tool?
On Tue, 24 Oct 2023 at 01:37, Jan Hendrik Farr <kernel@jfarr.cc> wrote: > > On 23 13:22:53, Ard Biesheuvel wrote: > > On Tue, 3 Oct 2023 at 04:03, Jan Hendrik Farr <kernel@jfarr.cc> wrote: > > > > > > On 12 09:00:51, Ard Biesheuvel wrote: > > > > From: Ard Biesheuvel <ardb@kernel.org> > > > > > > > > Now that the EFI stub boot flow no longer relies on memory that is > > > > executable and writable at the same time, we can reorganize the PE/COFF > > > > view of the kernel image and expose the decompressor binary's code and > > > > r/o data as a .text section and data/bss as a .data section, using 4k > > > > alignment and limited permissions. > > > > > > > > Doing so is necessary for compatibility with hardening measures that are > > > > being rolled out on x86 PCs built to run Windows (i.e., the majority of > > > > them). The EFI boot environment that the Linux EFI stub executes in is > > > > especially sensitive to safety issues, given that a vulnerability in the > > > > loader of one OS can be abused to attack another. > > > > > > This split is also useful for the work of kexecing the next kernel as an > > > EFI application. With the current EFI stub I have to set the memory both > > > writable and executable which results in W^X warnings with a default > > > config. > > > > > > What made this more confusing was that the flags of the .text section in > > > current EFI stub bzImages are set to > > > IMAGE_SCN_MEM_EXECUTE | IMAGE_SCN_MEM_READ. So if you load that section > > > according to those flags the EFI stub will quickly run into issues. > > > > > > I assume current firmware on x86 machines does not set any restricted > > > permissions on the memory. Can someone enlighten me on their behavior? > > > > > > > No current x86 firmware does not use restricted permissions at all. > > All memory is mapped with both writable and executable permissions, > > except maybe the stack. > > > > The x86 Linux kernel has been depending on this behavior too, up until > > recently (fixes are in -rc now for the v6.6 release). Before this, it > > would copy its own executable image around in memory. > > > > So EFI based kexec will need to support this behavior if it targets > > older x86 kernels, although I am skeptical that this is a useful > > design goal. > > I don't see this as an important goal either. > > > I have been experimenting with running the EFI stub code in user space > > all the way until ExitBootServices(). The same might work for UKI if > > it is layered cleanly on top of the EFI APIs (rather than poking into > > system registers or page tables under the hood). > > > > How this would work with signed images etc is TBD but I quite like the > > idea of running everything in user space and having a minimal > > purgatory (or none at all) if we can simply populate the entire > > address space while running unprivileged, and just branch to it in the > > kexec() syscall. I imagine this being something like a userspace > > helper that is signed/trusted itself, and gets invoked by the kernel > > to run EFI images that are trusted and tagged as being executable > > unprivileged. > > I've been experimenting with running EFI apps inside kernel space instead. > This is the more natural approach for signed images. Sure, a malicious EFI > app could do arbitrary stuff in kernel mode, but they're signed. On the other > hand running this entirely in user space would at least guarantee that the > system can not crash due to a misbehaving EFI app (at least until > ExitBootServices()). > > The question of whether or not to make this the job of a userspace helper that > is signed must have come up when kexec_file_load syscall was added. It would > have also been an option at the time to extend trust to a signed version of > the userspace kexec tool. > > Why was kexec_file_load created instead of restricting kexec_load to a signed > version of the kexec userspace tool? I think one of the reasons is that it is hard to handle dynamic linked libraries, not only the kexec-tools binary. Thanks Dave
On Tue, 24 Oct 2023 at 16:21, Dave Young <dyoung@redhat.com> wrote: > > On Tue, 24 Oct 2023 at 01:37, Jan Hendrik Farr <kernel@jfarr.cc> wrote: > > > > On 23 13:22:53, Ard Biesheuvel wrote: > > > On Tue, 3 Oct 2023 at 04:03, Jan Hendrik Farr <kernel@jfarr.cc> wrote: > > > > > > > > On 12 09:00:51, Ard Biesheuvel wrote: > > > > > From: Ard Biesheuvel <ardb@kernel.org> > > > > > > > > > > Now that the EFI stub boot flow no longer relies on memory that is > > > > > executable and writable at the same time, we can reorganize the PE/COFF > > > > > view of the kernel image and expose the decompressor binary's code and > > > > > r/o data as a .text section and data/bss as a .data section, using 4k > > > > > alignment and limited permissions. > > > > > > > > > > Doing so is necessary for compatibility with hardening measures that are > > > > > being rolled out on x86 PCs built to run Windows (i.e., the majority of > > > > > them). The EFI boot environment that the Linux EFI stub executes in is > > > > > especially sensitive to safety issues, given that a vulnerability in the > > > > > loader of one OS can be abused to attack another. > > > > > > > > This split is also useful for the work of kexecing the next kernel as an > > > > EFI application. With the current EFI stub I have to set the memory both > > > > writable and executable which results in W^X warnings with a default > > > > config. > > > > > > > > What made this more confusing was that the flags of the .text section in > > > > current EFI stub bzImages are set to > > > > IMAGE_SCN_MEM_EXECUTE | IMAGE_SCN_MEM_READ. So if you load that section > > > > according to those flags the EFI stub will quickly run into issues. > > > > > > > > I assume current firmware on x86 machines does not set any restricted > > > > permissions on the memory. Can someone enlighten me on their behavior? > > > > > > > > > > No current x86 firmware does not use restricted permissions at all. > > > All memory is mapped with both writable and executable permissions, > > > except maybe the stack. > > > > > > The x86 Linux kernel has been depending on this behavior too, up until > > > recently (fixes are in -rc now for the v6.6 release). Before this, it > > > would copy its own executable image around in memory. > > > > > > So EFI based kexec will need to support this behavior if it targets > > > older x86 kernels, although I am skeptical that this is a useful > > > design goal. > > > > I don't see this as an important goal either. > > > > > I have been experimenting with running the EFI stub code in user space > > > all the way until ExitBootServices(). The same might work for UKI if > > > it is layered cleanly on top of the EFI APIs (rather than poking into > > > system registers or page tables under the hood). > > > > > > How this would work with signed images etc is TBD but I quite like the > > > idea of running everything in user space and having a minimal > > > purgatory (or none at all) if we can simply populate the entire > > > address space while running unprivileged, and just branch to it in the > > > kexec() syscall. I imagine this being something like a userspace > > > helper that is signed/trusted itself, and gets invoked by the kernel > > > to run EFI images that are trusted and tagged as being executable > > > unprivileged. > > > > I've been experimenting with running EFI apps inside kernel space instead. > > This is the more natural approach for signed images. Sure, a malicious EFI > > app could do arbitrary stuff in kernel mode, but they're signed. On the other > > hand running this entirely in user space would at least guarantee that the > > system can not crash due to a misbehaving EFI app (at least until > > ExitBootServices()). > > > > The question of whether or not to make this the job of a userspace helper that > > is signed must have come up when kexec_file_load syscall was added. It would > > have also been an option at the time to extend trust to a signed version of > > the userspace kexec tool. > > > > Why was kexec_file_load created instead of restricting kexec_load to a signed > > version of the kexec userspace tool? > > I think one of the reasons is that it is hard to handle dynamic linked > libraries, not only the kexec-tools binary. Hmm, another one is that ptrace needs to be disabled in some way, anyway I think it is way too complicated, but I do not remember the details, added Vivek in cc. See this article: https://lwn.net/Articles/532778/ > > Thanks > Dave
From: Ard Biesheuvel <ardb@kernel.org> Now that the EFI stub boot flow no longer relies on memory that is executable and writable at the same time, we can reorganize the PE/COFF view of the kernel image and expose the decompressor binary's code and r/o data as a .text section and data/bss as a .data section, using 4k alignment and limited permissions. Doing so is necessary for compatibility with hardening measures that are being rolled out on x86 PCs built to run Windows (i.e., the majority of them). The EFI boot environment that the Linux EFI stub executes in is especially sensitive to safety issues, given that a vulnerability in the loader of one OS can be abused to attack another. In true x86 fashion, this is a lot more complicated than on other architectures, which have implemented this code/data split with 4k alignment from the beginning. The complicating factor here is that the boot image consists of two different parts, which are stitched together and fixed up using a special build tool. After this series is applied, the only remaining task performed by the build tool is generating the CRC-32. Even though this checksum is usually wrong (given that distro kernels are signed for secure boot in a way that corrupts the CRC), this feature is retained as we cannot be sure that nobody is relying on this. This supersedes the work proposed by Evgeniy last year, which did a major rewrite of the build tool in order to clean it up, before updating it to generate the new 4k aligned image layout. As this series proves, the build tool is mostly unnecessary, and we have too many of those already. Changes since v1: - drop patch that removed the CRC and the build tool - do not use fixed setup_size but derive it in the setup.ld linker script - reorganize the PE header so the .compat section only covers its payload and the padding that follows it - add hpa's ack to patch #4 Cc: Evgeniy Baskov <baskov@ispras.ru> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Jones <pjones@redhat.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Ard Biesheuvel (15): x86/efi: Drop EFI stub .bss from .data section x86/efi: Disregard setup header of loaded image x86/efi: Drop alignment flags from PE section headers x86/boot: Remove the 'bugger off' message x86/boot: Omit compression buffer from PE/COFF image memory footprint x86/boot: Drop redundant code setting the root device x86/boot: Grab kernel_info offset from zoffset header directly x86/boot: Drop references to startup_64 x86/boot: Set EFI handover offset directly in header asm x86/boot: Define setup size in linker script x86/boot: Derive file size from _edata symbol x86/boot: Construct PE/COFF .text section from assembler x86/boot: Drop PE/COFF .reloc section x86/boot: Split off PE/COFF .data section x86/boot: Increase section and file alignment to 4k/512 arch/x86/boot/Makefile | 2 +- arch/x86/boot/compressed/vmlinux.lds.S | 6 +- arch/x86/boot/header.S | 213 ++++++--------- arch/x86/boot/setup.ld | 14 +- arch/x86/boot/tools/build.c | 273 +------------------- drivers/firmware/efi/libstub/Makefile | 7 - drivers/firmware/efi/libstub/x86-stub.c | 46 +--- 7 files changed, 114 insertions(+), 447 deletions(-)