mbox series

[RFT,v3,00/21] x86: strict separation of startup code

Message ID 20250512190834.332684-23-ardb+git@google.com
Headers show
Series x86: strict separation of startup code | expand

Message

Ard Biesheuvel May 12, 2025, 7:08 p.m. UTC
From: Ard Biesheuvel <ardb@kernel.org>

!!! Boot tested on non-SEV guest ONLY !!!!

This RFT series implements a strict separation between startup code and
ordinary code, where startup code is built in a way that tolerates being
invoked from the initial 1:1 mapping of memory.

The existsing approach of emitting this code into .head.text and
checking for absolute relocations in that section is not 100% safe, and
produces diagnostics that are sometimes difficult to interpret. [0]

Instead, rely on symbol prefixes, similar to how this is implemented for
the EFI stub and for the startup code in the arm64 port. This ensures
that startup code can only call other startup code, unless a special
symbol alias is emitted that exposes a non-startup routine to the
startup code.

This is somewhat intrusive, as there are many data objects that are
referenced both by startup code and by ordinary code, and an alias needs
to be emitted for each of those. If startup code references anything
that has not been made available to it explicitly, a build time link
error will occur.

This ultimately allows the .head.text section to be dropped entirely, as
it no longer has a special significance. Instead, code that only
executes at boot is emitted into .init.text as it should.

The majority of changes is around early SEV code. The main issue is that
the use of GHCB pages and SVSM calling areas in code that may run from
both the 1:1 mapping and the kernel virtual mapping is problematic as it
relies on __pa() to perform VA to PA translations, which are ambiguous
in this context. Also, __pa() pulls in non-trivial instrumented code
when CONFIG_DEBUG_VIRTUAL=y and so it is better to avoid VA to PA
translations altogether in the startup code.

Change since RFT/v2:
- Rebase onto tip/x86/boot and drop the patches from the previous
  revision that have been applied in the meantime.
- Omit the pgtable_l5_enabled() changes for now, and just expose PIC
  aliases for the variables in question - this can be sorted later.
- Don't use the boot SVSM calling area in snp_kexec_finish(), but pass
  down the correct per-CPU one to the early page state API.
- Rename arch/x86/coco/sev/sev-noinstr.o to arch/x86/coco/sev/noinstr.o
- Further reduce the amount of SEV code that needs to be constructed in
  a special way.

Change since RFC/v1:
- Include a major disentanglement/refactor of the SEV-SNP startup code,
  so that only code that really needs to run from the 1:1 mapping is
  included in the startup/ code

- Incorporate some early notes from Ingo

Build tested defconfig and allmodconfig

!!! Boot tested on non-SEV guest ONLY !!!!

Again, I will need to lean on Tom to determine whether this breaks
SEV-SNP guest boot. As I mentioned before, I am still waiting for
SEV-SNP capable hardware to be delivered.

Cc: Borislav Petkov <bp@alien8.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>

[0] https://lore.kernel.org/all/CAHk-=wj7k9nvJn6cpa3-5Ciwn2RGyE605BMkjWE4MqnvC9E92A@mail.gmail.com/

Ard Biesheuvel (21):
  x86/sev: Separate MSR and GHCB based snp_cpuid() via a callback
  x86/sev: Use MSR protocol for remapping SVSM calling area
  x86/sev: Use MSR protocol only for early SVSM PVALIDATE call
  x86/sev: Run RMPADJUST on SVSM calling area page to test VMPL
  x86/sev: Move GHCB page based HV communication out of startup code
  x86/sev: Avoid global variable to store virtual address of SVSM area
  x86/sev: Move MSR save/restore out of early page state change helper
  x86/sev: Share implementation of MSR-based page state change
  x86/sev: Pass SVSM calling area down to early page state change API
  x86/sev: Use boot SVSM CA for all startup and init code
  x86/boot: Drop redundant RMPADJUST in SEV SVSM presence check
  x86/sev: Unify SEV-SNP hypervisor feature check
  x86/sev: Provide PIC aliases for SEV related data objects
  x86/boot: Provide PIC aliases for 5-level paging related constants
  x86/sev: Move __sev_[get|put]_ghcb() into separate noinstr object
  x86/sev: Export startup routines for later use
  x86/boot: Create a confined code area for startup code
  x86/boot: Move startup code out of __head section
  x86/boot: Disallow absolute symbol references in startup code
  x86/boot: Revert "Reject absolute references in .head.text"
  x86/boot: Get rid of the .head.text section

 arch/x86/boot/compressed/sev-handle-vc.c   |   3 +
 arch/x86/boot/compressed/sev.c             | 132 ++-----
 arch/x86/boot/startup/Makefile             |  21 ++
 arch/x86/boot/startup/exports.h            |  12 +
 arch/x86/boot/startup/gdt_idt.c            |   4 +-
 arch/x86/boot/startup/map_kernel.c         |   4 +-
 arch/x86/boot/startup/sev-shared.c         | 361 +++++---------------
 arch/x86/boot/startup/sev-startup.c        | 190 ++---------
 arch/x86/boot/startup/sme.c                |  29 +-
 arch/x86/coco/sev/Makefile                 |   6 +-
 arch/x86/coco/sev/core.c                   | 179 +++++++---
 arch/x86/coco/sev/{sev-nmi.c => noinstr.c} |  74 ++++
 arch/x86/coco/sev/vc-handle.c              |   3 +-
 arch/x86/coco/sev/vc-shared.c              | 143 +++++++-
 arch/x86/include/asm/init.h                |   6 -
 arch/x86/include/asm/setup.h               |   1 +
 arch/x86/include/asm/sev-internal.h        |  29 +-
 arch/x86/include/asm/sev.h                 |  67 +++-
 arch/x86/kernel/head64.c                   |   5 +-
 arch/x86/kernel/head_32.S                  |   2 +-
 arch/x86/kernel/head_64.S                  |  10 +-
 arch/x86/kernel/vmlinux.lds.S              |   7 +-
 arch/x86/mm/mem_encrypt_amd.c              |   6 -
 arch/x86/mm/mem_encrypt_boot.S             |   6 +-
 arch/x86/platform/pvh/head.S               |   2 +-
 arch/x86/tools/relocs.c                    |   8 +-
 26 files changed, 621 insertions(+), 689 deletions(-)
 create mode 100644 arch/x86/boot/startup/exports.h
 rename arch/x86/coco/sev/{sev-nmi.c => noinstr.c} (61%)


base-commit: ed4d95d033e359f9445e85bf5a768a5859a5830b

Comments

Borislav Petkov May 12, 2025, 7:17 p.m. UTC | #1
On Mon, May 12, 2025 at 09:08:35PM +0200, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> !!! Boot tested on non-SEV guest ONLY !!!!

...

> !!! Boot tested on non-SEV guest ONLY !!!!
> 
> Again, I will need to lean on Tom to determine whether this breaks
> SEV-SNP guest boot. As I mentioned before, I am still waiting for
> SEV-SNP capable hardware to be delivered.

Ingo, please do not rush this stuff in before Tom and I have tested it
successfully with SEV* guests.

Thanks!
Ingo Molnar May 13, 2025, 10:02 a.m. UTC | #2
* Borislav Petkov <bp@alien8.de> wrote:

> On Mon, May 12, 2025 at 09:08:35PM +0200, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> > 
> > !!! Boot tested on non-SEV guest ONLY !!!!
> 
> ...
> 
> > !!! Boot tested on non-SEV guest ONLY !!!!
> > 
> > Again, I will need to lean on Tom to determine whether this breaks
> > SEV-SNP guest boot. As I mentioned before, I am still waiting for
> > SEV-SNP capable hardware to be delivered.
> 
> Ingo, please do not rush this stuff in before Tom and I have tested it
> successfully with SEV* guests.
> 
> Thanks!

I don't intend to rush it, but note that AMD's SEV-SNP testing is 
lagging a *lot* at the moment: Ard asked for testing the -v2 series on 
May 4:

    https://lore.kernel.org/r/20250504095230.2932860-25-ardb+git@google.com

That request for testing was ignored AFAICS. It's May 13 and still 
crickets.

We also had SEV-SNP boot bugs pending since August 2024, that nobody 
but (eventually) AMD triggered. Ie. very few people outside of the 
vendor are testing SEV-SNP AFAICS, and even vendor testing is sporadic 
...

Please ask AMD internally to get SEV-SNP tested more reliably. Testing 
this -v3 series would be a good start. Hint, hint. ;-)

Thanks!

	Ingo
Ingo Molnar May 13, 2025, 11:22 a.m. UTC | #3
* Borislav Petkov <bp@alien8.de> wrote:

> On Tue, May 13, 2025 at 12:02:22PM +0200, Ingo Molnar wrote:
> > I don't intend to rush it,
> 
> Thanks.
> 
> > That request for testing was ignored AFAICS. It's May 13 and still 
> > crickets.
> 
> Not ignored - Tom and I are testing but we're busy as hell too.

Yeah, so the problem is that SEV* is hardware that basically no active 
tester outside of the vendor (AMD) owns and is testing against 
development trees AFAICS.

> > We also had SEV-SNP boot bugs pending since August 2024, that 
> > nobody but (eventually) AMD triggered.
> 
> Where?

I did a quick Git search, and here are a few examples:

For example, this commit from last summer:

  6c3211796326 ("x86/sev: Add SNP-specific unaccepted memory support")

... was only fixed recently:

  d54d610243a4 ("x86/boot/sev: Avoid shared GHCB page for early memory acceptance")

Or this commit from June 2024:

  34ff65901735 ("x86/sev: Use kernel provided SVSM Calling Areas")

... was only fixed a few days ago:

  f7387eff4bad ("x86/sev: Fix operator precedence in GHCB_MSR_VMPL_REQ_LEVEL macro")

Or this commit from June 2024:

  fcd042e86422 ("x86/sev: Perform PVALIDATE using the SVSM when not at VMPL0")

... was fixed a few weeks ago:

  8ed12ab1319b ("x86/boot/sev: Support memory acceptance in the EFI stub under SVSM")

Ie. bugfix latencies here were 10+ months.

Note that two of those fixes were from Ard who is working on further 
robustifying the startup code - a much needed change.

Ie. when Ard is asking for SEV-SNP testing for WIP series, which he did 
10+ days ago, you should not ignore it ... or if you do ignore his 
request for testing, you should not complain about the changes being 
merged eventually, once they pass review & testing on non-SEV 
platforms.

> > Ie. very few people outside of the vendor are testing SEV-SNP 
> > AFAICS, and even vendor testing is sporadic ...
> 
> Not true - SEV* testing happens on a daily basis.

If you didn't have time to personally test Ard's -v2 series since May 
2, that's OK: I can merge these proposed changes in an RFT branch so 
that it gets tested in the daily testing flow. [see further below for 
the Git link]

In other words: please no "gatekeeping". Please don't force Ard into a 
catch-22 situation where he cannot test the patches on SEV-SNP, but you 
are blocking these x86 startup code changes on the grounds that they 
weren't tested on SEV-SNP ...

> > Please ask AMD internally to get SEV-SNP tested more reliably. 
> > Testing this -v3 series would be a good start. Hint, hint. ;-)
> 
> We test everything that goes into linux-next. We haven't started 
> testing unreviewed patchsets yet because we don't do that - that 
> stuff is moving.
>
> So if you want to merge something, just ping me or Tom and we'll test 
> it.

Here's Ard's request from May 2:

    https://lore.kernel.org/r/20250504095230.2932860-25-ardb+git@google.com

    "Again, I will need to lean on Tom to determine whether this breaks 
     SEV-SNP guest boot. As I mentioned before, I am still waiting for 
     SEV-SNP capable hardware to be delivered."

This request for testing was ignored AFAICS.

> But you have to give us ample time to do so - you can't merge 
> something which Ard sent *on the same day*.

Sure: -v2 was sent more than 10 days ago, and the testing request was 
ignored AFAICS. Do 10 days count as 'ample time'?

Anyway, to make it even lower-overhead to test these changes, I've put 
the -v3 series into the WIP.x86/boot tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/mingo/tip.git WIP.x86/boot

Only lightly tested. Just a "boots/doesn't boot" kind of quick feedback 
would be much appreciated.

Note that naturally this tree is still subject to rebasing, as review 
feedback is incorporated.

Thanks!

	Ingo
Ard Biesheuvel May 13, 2025, 1:58 p.m. UTC | #4
On Tue, 13 May 2025 at 14:55, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Mon, 12 May 2025 at 20:11, Ard Biesheuvel <ardb+git@google.com> wrote:
> >
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > The early page state change API is mostly only used very early, when
> > only the boot time SVSM calling area is in use. However, this API is
> > also called by the kexec finishing code, which runs very late, and
> > potentially from a different CPU (which uses a different calling area).
> >
> > To avoid pulling the per-CPU SVSM calling area pointers and related SEV
> > state into the startup code, refactor the page state change API so the
> > SVSM calling area virtual and physical addresses can be provided by the
> > caller.
> >
> > No functional change intended.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>
> This patch is broken - I'll send a followup fix asap.

... or actually, the one before it.
Ard Biesheuvel May 13, 2025, 3:01 p.m. UTC | #5
On Tue, 13 May 2025 at 15:17, Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, May 13, 2025 at 01:22:16PM +0200, Ingo Molnar wrote:
...
> > Note that two of those fixes were from Ard who is working on further
> > robustifying the startup code - a much needed change.
>
> Really? Much needed huh?
>
> Please do explain why is it much needed?
>
> Because the reason Ard is doing it is a different one but maybe
> I misunderstood him...
>

I will refrain from inserting myself into the intra-tip review and
testing policy debate, but let me at least provide a quick recap of
what I am doing here and why.

Since commit

   c88d71508e36 x86/boot/64: Rewrite startup_64() in C

dated Jun 6 2017, we have been using C code on the boot path in a way
that is not supported by the toolchain, i.e., to execute non-PIC C
code from a mapping of memory that is different from the one provided
to the linker. It should have been obvious at the time that this was a
bad idea, given the need to sprinkle fixup_pointer() calls left and
right to manipulate global variables (including non-pointer variables)
without crashing.

This C startup code has been expanding, and in particular, the SEV-SNP
startup code has been expanding over the past couple of years, and
grown many of these warts, where the C code needs to use special
annotations or helpers to access global objects.

Google uses Clang internally, and as expected, it does not behave
quite like GCC in this regard either. The result is that the SEV-SNP
boot tended to break in cryptic ways with Clang built kernels, due to
absolute references in the startup code that runs before those
absolute references are mapped.

I've done a preliminary pass upstream with RIP_REL_REF() and
rip_rel_ptr() and the use of the .head.text section for startup code
to ensure that we detect such issues at build time, and it has already
resulted in a number of true positives where the code in question
would have failed at boot time. At this point, I'm not aware of any
issues caused by absolute references going undetected.

However, Linus kindly indicated that the resulting diagnostics
produced by the relocs tool do not meet his high standards, and so I
proposed another approach, which I am implementing now (see cover
letter for details). Note that this approach is also much more robust,
as annotating things as __head by hand to get it emitted into the
section to which the diagnostics are applied is obviously not
foolproof.

Fixing the existing 5-level paging and kernel mapping code was rather
straight-forward. However, splitting up the SEV-SNP code has been
rather challenging due to the way it was put together, i.e., as a
single source file used everywhere, and to which additional
functionality has been added piecemeal (i.e., the SVSM support).

It is obvious that these changes should be tested before being merged,
hence the RFT in the subject. And I have been struggling a bit to get
access to usable hardware. (I do have access to internal development
systems, but those do not fit the 'usable' description by any measure,
given that I have to go through the cloud VM orchestration APIs to
test boot a simple kernel image).

What Boris might allude to is the fact that some of these changes also
form a prerequisite for being able to construct a generic EFI zboot
image for x86, which is a long-term objective that I am working on in
the background. But this is not the main reason.

In any case, there is no urgency wrt these changes as far as I am
concerned, and given that I already found an issue myself with v3,
perhaps it is better if we disregard it for the time being, and we can
come back to it for the next cycle. In the mean time, I can compare
notes with Boris and Tom directly to ensure that this is in the right
shape, and perhaps we could at least fix the pgtable_l5_enabled() mess
as well (for which I sent out a RFC/v3 today).
Ard Biesheuvel May 13, 2025, 9:31 p.m. UTC | #6
On Tue, 13 May 2025 at 17:44, Borislav Petkov <bp@alien8.de> wrote:
>
...
> > I've done a preliminary pass upstream with RIP_REL_REF() and
> > rip_rel_ptr() and the use of the .head.text section for startup code
> > to ensure that we detect such issues at build time, and it has already
> > resulted in a number of true positives where the code in question
> > would have failed at boot time. At this point, I'm not aware of any
> > issues caused by absolute references going undetected.
> >
> > However, Linus kindly indicated that the resulting diagnostics
> > produced by the relocs tool do not meet his high standards, and so I
> > proposed another approach, which I am implementing now (see cover
> > letter for details). Note that this approach is also much more robust,
> > as annotating things as __head by hand to get it emitted into the
> > section to which the diagnostics are applied is obviously not
> > foolproof.
>
> AFAIR, this is what you've done for arm64 too, right?
>

The new approach proposed in this series is what I implemented before
for arm64, yes.