diff mbox series

[v2] hw/i386: place setup_data at fixed place in memory

Message ID 20220804004411.1343158-1-Jason@zx2c4.com
State New
Headers show
Series [v2] hw/i386: place setup_data at fixed place in memory | expand

Commit Message

Jason A. Donenfeld Aug. 4, 2022, 12:44 a.m. UTC
The boot parameter header refers to setup_data at an absolute address,
and each setup_data refers to the next setup_data at an absolute address
too. Currently QEMU simply puts the setup_datas right after the kernel
image, and since the kernel_image is loaded at prot_addr -- a fixed
address knowable to QEMU apriori -- the setup_data absolute address
winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.

This mostly works fine, so long as the kernel image really is loaded at
prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
generally EFI doesn't give a good way of predicting where it's going to
load the kernel. So when it loads it at some address != prot_addr, the
absolute addresses in setup_data now point somewhere bogus, causing
crashes when EFI stub tries to follow the next link.

Fix this by placing setup_data at some fixed place in memory, relative
to real_addr, not as part of the kernel image, and then pointing the
setup_data absolute address to that fixed place in memory. This way,
even if OVMF or other chains relocate the kernel image, the boot
parameter still points to the correct absolute address.

Fixes: 3cbeb52467 ("hw/i386: add device tree support")
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Daniel P. Berrangé <berrange@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 hw/i386/x86.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

Comments

Daniel P. Berrangé Aug. 4, 2022, 9:25 a.m. UTC | #1
On Thu, Aug 04, 2022 at 10:58:36AM +0200, Laszlo Ersek wrote:
> On 08/04/22 09:03, Michael S. Tsirkin wrote:
> > On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
> >> The boot parameter header refers to setup_data at an absolute address,
> >> and each setup_data refers to the next setup_data at an absolute address
> >> too. Currently QEMU simply puts the setup_datas right after the kernel
> >> image, and since the kernel_image is loaded at prot_addr -- a fixed
> >> address knowable to QEMU apriori -- the setup_data absolute address
> >> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
> >>
> >> This mostly works fine, so long as the kernel image really is loaded at
> >> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
> >> generally EFI doesn't give a good way of predicting where it's going to
> >> load the kernel. So when it loads it at some address != prot_addr, the
> >> absolute addresses in setup_data now point somewhere bogus, causing
> >> crashes when EFI stub tries to follow the next link.
> >>
> >> Fix this by placing setup_data at some fixed place in memory, relative
> >> to real_addr, not as part of the kernel image, and then pointing the
> >> setup_data absolute address to that fixed place in memory. This way,
> >> even if OVMF or other chains relocate the kernel image, the boot
> >> parameter still points to the correct absolute address.
> >>
> >> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
> >> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> >> Cc: Paolo Bonzini <pbonzini@redhat.com>
> >> Cc: Richard Henderson <richard.henderson@linaro.org>
> >> Cc: Peter Maydell <peter.maydell@linaro.org>
> >> Cc: Michael S. Tsirkin <mst@redhat.com>
> >> Cc: Daniel P. Berrangé <berrange@redhat.com>
> >> Cc: Gerd Hoffmann <kraxel@redhat.com>
> >> Cc: Ard Biesheuvel <ardb@kernel.org>
> >> Cc: linux-efi@vger.kernel.org
> >> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> > 
> > Didn't read the patch yet.
> > Adding Laszlo.
> 
> As I said in
> <http://mid.mail-archive.com/8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com>,
> I don't believe that the setup_data chaining described in
> <https://www.kernel.org/doc/Documentation/x86/boot.rst> can be made work
> for UEFI guests at all, with QEMU pre-populating the links with GPAs.
> 
> However, rather than introducing a new info channel, or reusing an
> existent one (ACPI linker/loader, GUID-ed structure chaining in pflash),
> for the sake of this feature, I suggest simply disabling this feature
> for UEFI guests. setup_data chaining has not been necessary for UEFI
> guests for years (this is the first time I've heard about it in more
> than a decade), and the particular use case (provide guests with good
> random seed) is solved for UEFI guests via virtio-rng / EFI_RNG_PROTOCOL.
> 
> ... Now, here's my problem: microvm, and Xen.
> 
> As far as I can tell, QEMU can determine (it already does determine)
> whether the guest uses UEFI or not, for the "pc" and "q35" machine
> types. But not for microvm or Xen!
> 
> Namely, from pc_system_firmware_init() [hw/i386/pc_sysfw.c], we can
> derive that
> 
>   pflash_cfi01_get_blk(pcms->flash[0])
> 
> returning NULL vs. non-NULL stands for "BIOS vs. UEFI". Note that this
> is only valid after the inital part of pc_system_firmware_init() has run
> ("Map legacy -drive if=pflash to machine properties"), but that is not a
> problem, given the following call tree:

I don't beleve that's a valid check anymore since Gerd introduced the
ability to load UEFI via -bios, for UEFI builds without persistent
variables. ( a8152c4e4613c70c2f0573a82babbc8acc00cf90 )


> Which is a big problem for my idea, because QEMU has no way of
> identifying whether microvm is going to boot a custom SeaBIOS binary
> (where the current setup_data chaining is OK) or a custom OVMF binary
> (where the current setup_data chaining crashes the guest kernel).
> 
> So I thought that for pc and q35, I should be able to propose a fix,
> based on:
> 
>   pflash_cfi01_get_blk(pcms->flash[0])
> 
> but it turns out I don't know what to do about Xen; and worse, for
> MicroVM, it's currently *impossible* for QEMU to tell apart UEFI from
> other guest firmwares.

Yep, and ultimately the inability to distinguish UEFI vs other firmware
is arguably correct by design, as the QEMU <-> firmware interface is
supposed to be arbitrarily pluggable for any firmware implementation
not  limited to merely UEFI + seabios.

> For now I suggest either reverting the original patch, or at least not
> enabling the knob by default for any machine types. In particular, when
> using MicroVM, the user must leave the knob disabled when direct booting
> a kernel on OVMF, and the user may or may not enable the knob when
> direct booting a kernel on SeaBIOS.

Having it opt-in via a knob would defeat Jason's goal of having the seed
available automatically.



With regards,
Daniel
Jason A. Donenfeld Aug. 4, 2022, 12:03 p.m. UTC | #2
Hi Daniel,

On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> Yep, and ultimately the inability to distinguish UEFI vs other firmware
> is arguably correct by design, as the QEMU <-> firmware interface is
> supposed to be arbitrarily pluggable for any firmware implementation
> not  limited to merely UEFI + seabios.

Indeed, I agree with this.

> 
> > For now I suggest either reverting the original patch, or at least not
> > enabling the knob by default for any machine types. In particular, when
> > using MicroVM, the user must leave the knob disabled when direct booting
> > a kernel on OVMF, and the user may or may not enable the knob when
> > direct booting a kernel on SeaBIOS.
> 
> Having it opt-in via a knob would defeat Jason's goal of having the seed
> available automatically.

Yes, adding a knob is absolutely out of the question.

It also doesn't actually solve the problem: this triggers when QEMU
passes a DTB too. It's not just for the new RNG seed thing. This bug
isn't new.

Jason
Daniel P. Berrangé Aug. 4, 2022, 12:11 p.m. UTC | #3
On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> Hi Daniel,
> 
> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> > Yep, and ultimately the inability to distinguish UEFI vs other firmware
> > is arguably correct by design, as the QEMU <-> firmware interface is
> > supposed to be arbitrarily pluggable for any firmware implementation
> > not  limited to merely UEFI + seabios.
> 
> Indeed, I agree with this.
> 
> > 
> > > For now I suggest either reverting the original patch, or at least not
> > > enabling the knob by default for any machine types. In particular, when
> > > using MicroVM, the user must leave the knob disabled when direct booting
> > > a kernel on OVMF, and the user may or may not enable the knob when
> > > direct booting a kernel on SeaBIOS.
> > 
> > Having it opt-in via a knob would defeat Jason's goal of having the seed
> > available automatically.
> 
> Yes, adding a knob is absolutely out of the question.
> 
> It also doesn't actually solve the problem: this triggers when QEMU
> passes a DTB too. It's not just for the new RNG seed thing. This bug
> isn't new.

In the other thread I also mentioned that this RNG Seed addition has
caused a bug with AMD SEV too, making boot measurement attestation
fail because the kernel blob passed to the firmware no longer matches
what the tenant expects, due to the injected seed.

With regards,
Daniel
Jason A. Donenfeld Aug. 4, 2022, 12:17 p.m. UTC | #4
Hi Ard,

On Thu, Aug 4, 2022 at 2:16 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> > > Hi Daniel,
> > >
> > > On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> > > > Yep, and ultimately the inability to distinguish UEFI vs other firmware
> > > > is arguably correct by design, as the QEMU <-> firmware interface is
> > > > supposed to be arbitrarily pluggable for any firmware implementation
> > > > not  limited to merely UEFI + seabios.
> > >
> > > Indeed, I agree with this.
> > >
> > > >
> > > > > For now I suggest either reverting the original patch, or at least not
> > > > > enabling the knob by default for any machine types. In particular, when
> > > > > using MicroVM, the user must leave the knob disabled when direct booting
> > > > > a kernel on OVMF, and the user may or may not enable the knob when
> > > > > direct booting a kernel on SeaBIOS.
> > > >
> > > > Having it opt-in via a knob would defeat Jason's goal of having the seed
> > > > available automatically.
> > >
> > > Yes, adding a knob is absolutely out of the question.
> > >
> > > It also doesn't actually solve the problem: this triggers when QEMU
> > > passes a DTB too. It's not just for the new RNG seed thing. This bug
> > > isn't new.
> >
> > In the other thread I also mentioned that this RNG Seed addition has
> > caused a bug with AMD SEV too, making boot measurement attestation
> > fail because the kernel blob passed to the firmware no longer matches
> > what the tenant expects, due to the injected seed.
> >
>
> I was actually expecting this to be an issue in the
> signing/attestation department as well, and you just confirmed my
> suspicion.
>
> But does this mean that populating the setup_data pointer is out of
> the question altogether? Or only that putting the setup_data linked
> list nodes inside the image is a problem?

If you look at the v2 patch, populating boot_param->setup_data winds
up being a fixed value. So even if that part was a problem (though I
don't think it is), it won't be with the v2 patch, since it's always
the same.

Jason
Jason A. Donenfeld Aug. 4, 2022, 12:47 p.m. UTC | #5
On Thu, Aug 4, 2022 at 2:11 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi Laszlo,
>
> On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote:
> > None of the existing info passing methods seem early enough, generic
> > enough, and secure enough (at the same time)...
>
> Can you look at the v2 patch? It seems to work on every configuration I
> throw at it. Keep in mind that setup_data is only used very, very early.
> I can think of a few other places to put it too, looking at the x86
> memory map, that will survive long enough.
>
> I think this might actually be a straightforwardly solvable problem if
> you think about it more basically.

And just to put things in perspective here... We only need like 48
bytes or something at some easy fixed address. That's not much. That's
*got* to be a fairly tractable problem. If v2 has issues, I can't see
why there wouldn't be a different easy place to put a meger 48 bytes
of stuff that then is allowed to be wiped out after early boot.

Jason
Jason A. Donenfeld Aug. 4, 2022, 1:07 p.m. UTC | #6
Hi Daniel,

On Thu, Aug 4, 2022 at 2:54 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
> > The boot parameter header refers to setup_data at an absolute address,
> > and each setup_data refers to the next setup_data at an absolute address
> > too. Currently QEMU simply puts the setup_datas right after the kernel
> > image, and since the kernel_image is loaded at prot_addr -- a fixed
> > address knowable to QEMU apriori -- the setup_data absolute address
> > winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
> >
> > This mostly works fine, so long as the kernel image really is loaded at
> > prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
> > generally EFI doesn't give a good way of predicting where it's going to
> > load the kernel. So when it loads it at some address != prot_addr, the
> > absolute addresses in setup_data now point somewhere bogus, causing
> > crashes when EFI stub tries to follow the next link.
> >
> > Fix this by placing setup_data at some fixed place in memory, relative
> > to real_addr, not as part of the kernel image, and then pointing the
> > setup_data absolute address to that fixed place in memory. This way,
> > even if OVMF or other chains relocate the kernel image, the boot
> > parameter still points to the correct absolute address.
> >
> > Fixes: 3cbeb52467 ("hw/i386: add device tree support")
> > Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <richard.henderson@linaro.org>
> > Cc: Peter Maydell <peter.maydell@linaro.org>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Daniel P. Berrangé <berrange@redhat.com>
> > Cc: Gerd Hoffmann <kraxel@redhat.com>
> > Cc: Ard Biesheuvel <ardb@kernel.org>
> > Cc: linux-efi@vger.kernel.org
> > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> > ---
> >  hw/i386/x86.c | 38 ++++++++++++++++++++------------------
> >  1 file changed, 20 insertions(+), 18 deletions(-)
> >
> > diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> > index 050eedc0c8..8b853abf38 100644
> > --- a/hw/i386/x86.c
> > +++ b/hw/i386/x86.c
>
>
> >      if (!legacy_no_rng_seed) {
> > -        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
> > -        kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH;
> > -        kernel = g_realloc(kernel, kernel_size);
> > -        setup_data = (struct setup_data *)(kernel + setup_data_offset);
> > +        setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH;
> > +        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
> > +        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
> >          setup_data->next = cpu_to_le64(first_setup_data);
> > -        first_setup_data = prot_addr + setup_data_offset;
> > +        first_setup_data = setup_data_base + setup_data_total_len;
> > +        setup_data_total_len += setup_data_item_len;
> >          setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
> >          setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
> >          qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
> >      }
> >
> > -    /* Offset 0x250 is a pointer to the first setup_data link. */
> > -    stq_p(header + 0x250, first_setup_data);
> > +    if (first_setup_data) {
> > +            /* Offset 0x250 is a pointer to the first setup_data link. */
> > +            stq_p(header + 0x250, first_setup_data);
> > +            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
> > +                         setup_data_base, NULL, NULL, NULL, NULL, false);
> > +    }
>
> The boot measurements with AMD SEV now succeed, but I'm a little
> worried about the implications of adding this ROM, when a few lines
> later here we're discarding the 'header' changes for AMD SEV. Is
> this still going to operate correctly in the guest OS if we've
> discarded the header changes below ?

I'll add a !sev_enabled() condition to that block too, so it also
skips adding the ROM, for v3.

Jason
Jason A. Donenfeld Aug. 4, 2022, 1:28 p.m. UTC | #7
On Thu, Aug 4, 2022 at 3:25 PM Laszlo Ersek <lersek@redhat.com> wrote:
>
> On 08/04/22 14:16, Ard Biesheuvel wrote:
> > On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >>
> >> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> >>> Hi Daniel,
> >>>
> >>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> >>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware
> >>>> is arguably correct by design, as the QEMU <-> firmware interface is
> >>>> supposed to be arbitrarily pluggable for any firmware implementation
> >>>> not  limited to merely UEFI + seabios.
> >>>
> >>> Indeed, I agree with this.
> >>>
> >>>>
> >>>>> For now I suggest either reverting the original patch, or at least not
> >>>>> enabling the knob by default for any machine types. In particular, when
> >>>>> using MicroVM, the user must leave the knob disabled when direct booting
> >>>>> a kernel on OVMF, and the user may or may not enable the knob when
> >>>>> direct booting a kernel on SeaBIOS.
> >>>>
> >>>> Having it opt-in via a knob would defeat Jason's goal of having the seed
> >>>> available automatically.
> >>>
> >>> Yes, adding a knob is absolutely out of the question.
> >>>
> >>> It also doesn't actually solve the problem: this triggers when QEMU
> >>> passes a DTB too. It's not just for the new RNG seed thing. This bug
> >>> isn't new.
> >>
> >> In the other thread I also mentioned that this RNG Seed addition has
> >> caused a bug with AMD SEV too, making boot measurement attestation
> >> fail because the kernel blob passed to the firmware no longer matches
> >> what the tenant expects, due to the injected seed.
> >>
> >
> > I was actually expecting this to be an issue in the
> > signing/attestation department as well, and you just confirmed my
> > suspicion.
> >
> > But does this mean that populating the setup_data pointer is out of
> > the question altogether? Or only that putting the setup_data linked
> > list nodes inside the image is a problem?
>
> QEMU already has to inject a whole bunch of stuff into confidential
> computing guests. The way it's done (IIRC) is that the non-compressed,
> trailing part of pflash (basically where the reset vector code lives
> too) is populated at OVMF build time with a chain of GUID-ed structures,
> and fields of those structures are filled in (at OVMF build time) from
> various fixed PCDs. The fixed PCDs in turn are populated from the FD
> files, using various MEMFD regions. When QEMU launches the guest, it can
> parse the GPAs from the on-disk pflash image (traversing the list of
> GUID-ed structs), at which addresses the guest firmware will then expect
> the various crypto artifacts to be injected.
>
> The point is that "who's in control" is reversed. The firmware exposes
> (at build time) at what GPAs it can accept data injections, and QEMU
> follows that. Of course the firmware ensures that nothing else in the
> firmware will try to own those GPAs.
>
> The only thing that needed to be hard-coded when this feature was
> introduced was the "entry point", that is, the flash offset at which
> QEMU starts the GUID-ed structure traversal.
>
> AMD and IBM developers can likely much better describe this mechanism,
> as I've not dealt with it in over a year. The QEMU side code is in
> "hw/i386/pc_sysfw_ovmf.c".
>
> We can make setup_data chaining work with OVMF, but the whole chain
> should be located in a GPA range that OVMF dictates.

It sounds like what you describe is pretty OVMF-specific though,
right? Do we want to tie things together so tightly like that?

Given we only need 48 bytes or so, isn't there a more subtle place we
could just throw this in ram that doesn't need such complex
coordination?

Jason
Jason A. Donenfeld Aug. 4, 2022, 10:56 p.m. UTC | #8
Hey Laszlo,

On Thu, Aug 04, 2022 at 03:56:54PM +0200, Laszlo Ersek wrote:
> - do we want setup_data chaining work generally?
> 
> - or do we want only the random seed injection to stop crashing OVMF guests?

Preferably the first - generally. Which brings us to your point:
 
> > Given we only need 48 bytes or so, isn't there a more subtle place we
> > could just throw this in ram that doesn't need such complex
> > coordination?
> 
> These tricks add up and go wrong after a while. The pedantic
> reservations in the firmware have proved necessary.
> 
> IIUC, with v2, the setup_data_base address would (most frequently) be 96
> KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does
> not reserve away the area, UefiCpuPkg or other drivers might allocate an
> overlapping chunk, even if only temporarily. That might not break the
> firmware, but it could overwrite the random seed. 

Yea, so we don't want an address that something else might overwrite. So
my question is: isn't there some 48 bytes or so available in some low
address (or maybe a high one?) that is traditionally reserved for some
hardware function, and so software doesn't use it, but it turns out QEMU
doesn't use it for anything either, so we can get away placing it at
that address? It seems like there *ought* to be something like that. I
just don't (yet) know what it is...

Jason
Gerd Hoffmann Aug. 16, 2022, 8:55 a.m. UTC | #9
Hi,

> > We can make setup_data chaining work with OVMF, but the whole chain
> > should be located in a GPA range that OVMF dictates.
> 
> It sounds like what you describe is pretty OVMF-specific though,
> right? Do we want to tie things together so tightly like that?
> 
> Given we only need 48 bytes or so, isn't there a more subtle place we
> could just throw this in ram that doesn't need such complex
> coordination?

Joining the party late (and still catching up the thread).  Given we
don't need that anyway with EFI, only with legacy BIOS:  Can't that just
be a protocol between qemu and pc-bios/optionrom/*boot*.S on how to pass
those 48 bytes random seed?

take care,
  Gerd
Jason A. Donenfeld Aug. 18, 2022, 3:38 p.m. UTC | #10
Hey Gerd,

On Tue, Aug 16, 2022 at 10:55:11AM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> > > We can make setup_data chaining work with OVMF, but the whole chain
> > > should be located in a GPA range that OVMF dictates.
> > 
> > It sounds like what you describe is pretty OVMF-specific though,
> > right? Do we want to tie things together so tightly like that?
> > 
> > Given we only need 48 bytes or so, isn't there a more subtle place we
> > could just throw this in ram that doesn't need such complex
> > coordination?
> 
> Joining the party late (and still catching up the thread).  Given we
> don't need that anyway with EFI, only with legacy BIOS:  Can't that just
> be a protocol between qemu and pc-bios/optionrom/*boot*.S on how to pass
> those 48 bytes random seed?

Actually, I want this to work with EFI, very much so.

If our objective was to just not break EFI, the solution would be
simple: in the kernel we can have EFISTUB ignore the setup_data field
from the image, and then bump the boot header protocol number. If QEMU
sees the boot protocol number is below this one, then it won't set
setup_data. Done, fixed.

Except I think there's value in passing seeds even through with EFI.

Your option ROM idea is interesting; somebody mentioned that elsewhere
too I think. I'm wondering, though: do option ROMs still run when
EFI/OVMF is being used?

Jason
Gerd Hoffmann Aug. 19, 2022, 6:40 a.m. UTC | #11
On Thu, Aug 18, 2022 at 05:38:37PM +0200, Jason A. Donenfeld wrote:
> Hey Gerd,
> 
> > Joining the party late (and still catching up the thread).  Given we
> > don't need that anyway with EFI, only with legacy BIOS:  Can't that just
> > be a protocol between qemu and pc-bios/optionrom/*boot*.S on how to pass
> > those 48 bytes random seed?
> 
> Actually, I want this to work with EFI, very much so.

With EFI the kernel stub gets some random seed via EFI_RNG_PROTOCOL.
I can't see any good reason to derive from that.  It works no matter
how the kernel gets loaded.

OVMF ships with a driver for the virtio-rng device.  So just add that
to your virtual machine and seeding works fine ...

> If our objective was to just not break EFI, the solution would be
> simple: in the kernel we can have EFISTUB ignore the setup_data field
> from the image, and then bump the boot header protocol number. If QEMU
> sees the boot protocol number is below this one, then it won't set
> setup_data. Done, fixed.

As mentioned elsewhere in the thread patching in physical addresses on
qemu side isn't going to fly due to the different load methods we have.

> Your option ROM idea is interesting; somebody mentioned that elsewhere
> too I think.

Doing the setup_data patching in the option rom has the advantage that
it'll only happen with that specific load method being used.  Also the
option rom knows where it places stuff in memory so it is in a much
better position to find a good & non-conflicting place for the random
seed.  Also reserve/allocate memory if needed etc.

> I'm wondering, though: do option ROMs still run when
> EFI/OVMF is being used?

No, they are not used with EFI.  OVMF has a completely independent
implementation for direct kernel boot.

The options I see for EFI are:

  (1) Do nothing and continue to depend on virtio-rng.
  (2) Implement an efi driver which gets those 48 seed bytes from
      qemu by whatever means we'll define and hands them out via
      EFI_RNG_PROTOCOL.

take care,
  Gerd
Ard Biesheuvel Aug. 19, 2022, 7:16 a.m. UTC | #12
On Fri, 19 Aug 2022 at 08:41, Gerd Hoffmann <kraxel@redhat.com> wrote:
>
> On Thu, Aug 18, 2022 at 05:38:37PM +0200, Jason A. Donenfeld wrote:
> > Hey Gerd,
> >
> > > Joining the party late (and still catching up the thread).  Given we
> > > don't need that anyway with EFI, only with legacy BIOS:  Can't that just
> > > be a protocol between qemu and pc-bios/optionrom/*boot*.S on how to pass
> > > those 48 bytes random seed?
> >
> > Actually, I want this to work with EFI, very much so.

Even if we wire this up for EFI in some way, it will only affect
direct kernel boot using -kernel/-initrd etc, which is a niche use
case at best (AIUI libvirt uses it for the initial boot only)

I personally rely on it heavily for development, and I imagine others
might too, but that is hardly relevant here.

> With EFI the kernel stub gets some random seed via EFI_RNG_PROTOCOL.
> I can't see any good reason to derive from that.  It works no matter
> how the kernel gets loaded.
>
> OVMF ships with a driver for the virtio-rng device.  So just add that
> to your virtual machine and seeding works fine ...
>

... or we find other ways for the platform to speak to the OS, using
EFI protocols or other EFI methods.

Currently, the 'pure EFI' boot code is arch agnostic - it can be built
and run on any architecture that supports EFI boot. Adding Linux+x86
specific hacks to it is out of the question.

So that means that setup_data provided by QEMU will be consumed
directly by the kernel (when doing direct kernel boot only), using an
out of band channel that EFI/OVMF are completely unaware of, putting
it outside the scope of secure boot, measure boot, etc.

This is not acceptable to me.

> > If our objective was to just not break EFI, the solution would be
> > simple: in the kernel we can have EFISTUB ignore the setup_data field
> > from the image, and then bump the boot header protocol number. If QEMU
> > sees the boot protocol number is below this one, then it won't set
> > setup_data. Done, fixed.
>
> As mentioned elsewhere in the thread patching in physical addresses on
> qemu side isn't going to fly due to the different load methods we have.
>

And it conflates the file representation with the in-memory
representation, which I object to fundamentally - setup_data is part
of the file image, and becomes part of the in-memory representation
when it gets staged in memory for booting, which only happens in the
EFI stub when using pure EFI boot.

Using setup_data as a hidden comms channel between QEMU and the core
kernel breaks that distinction.

> > Your option ROM idea is interesting; somebody mentioned that elsewhere
> > too I think.
>
> Doing the setup_data patching in the option rom has the advantage that
> it'll only happen with that specific load method being used.  Also the
> option rom knows where it places stuff in memory so it is in a much
> better position to find a good & non-conflicting place for the random
> seed.  Also reserve/allocate memory if needed etc.
>

Exactly. This is the only sensible place to do this.

> > I'm wondering, though: do option ROMs still run when
> > EFI/OVMF is being used?
>
> No, they are not used with EFI.  OVMF has a completely independent
> implementation for direct kernel boot.
>
> The options I see for EFI are:
>
>   (1) Do nothing and continue to depend on virtio-rng.
>   (2) Implement an efi driver which gets those 48 seed bytes from
>       qemu by whatever means we'll define and hands them out via
>       EFI_RNG_PROTOCOL.
>

We could explore other options, but SETUP_RNG_SEED is fundamentally
incompatible with EFI boot (or any other boot method where the image
is treated as an opaque file by the firmware/loader), so that is not
an acceptable approach to me.
diff mbox series

Patch

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 050eedc0c8..8b853abf38 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -760,36 +760,36 @@  static bool load_elfboot(const char *kernel_filename,
     fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ENTRY, pvh_start_addr);
     fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, mh_load_addr);
     fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, elf_kernel_size);
 
     return true;
 }
 
 void x86_load_linux(X86MachineState *x86ms,
                     FWCfgState *fw_cfg,
                     int acpi_data_size,
                     bool pvh_enabled,
                     bool legacy_no_rng_seed)
 {
     bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled;
     uint16_t protocol;
     int setup_size, kernel_size, cmdline_size;
-    int dtb_size, setup_data_offset;
+    int dtb_size, setup_data_item_len, setup_data_total_len = 0;
     uint32_t initrd_max;
-    uint8_t header[8192], *setup, *kernel;
-    hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0;
+    uint8_t header[8192], *setup, *kernel, *setup_datas = NULL;
+    hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0, setup_data_base;
     FILE *f;
     char *vmode;
     MachineState *machine = MACHINE(x86ms);
     struct setup_data *setup_data;
     const char *kernel_filename = machine->kernel_filename;
     const char *initrd_filename = machine->initrd_filename;
     const char *dtb_filename = machine->dtb;
     const char *kernel_cmdline = machine->kernel_cmdline;
     SevKernelLoaderContext sev_load_ctx = {};
     enum { RNG_SEED_LENGTH = 32 };
 
     /* Align to 16 bytes as a paranoia measure */
     cmdline_size = (strlen(kernel_cmdline) + 16) & ~15;
 
     /* load the kernel header */
     f = fopen(kernel_filename, "rb");
@@ -886,32 +886,33 @@  void x86_load_linux(X86MachineState *x86ms,
     if (protocol < 0x200 || !(header[0x211] & 0x01)) {
         /* Low kernel */
         real_addr    = 0x90000;
         cmdline_addr = 0x9a000 - cmdline_size;
         prot_addr    = 0x10000;
     } else if (protocol < 0x202) {
         /* High but ancient kernel */
         real_addr    = 0x90000;
         cmdline_addr = 0x9a000 - cmdline_size;
         prot_addr    = 0x100000;
     } else {
         /* High and recent kernel */
         real_addr    = 0x10000;
         cmdline_addr = 0x20000;
         prot_addr    = 0x100000;
     }
+    setup_data_base = real_addr + 0x8000;
 
     /* highest address for loading the initrd */
     if (protocol >= 0x20c &&
         lduw_p(header + 0x236) & XLF_CAN_BE_LOADED_ABOVE_4G) {
         /*
          * Linux has supported initrd up to 4 GB for a very long time (2007,
          * long before XLF_CAN_BE_LOADED_ABOVE_4G which was added in 2013),
          * though it only sets initrd_max to 2 GB to "work around bootloader
          * bugs". Luckily, QEMU firmware(which does something like bootloader)
          * has supported this.
          *
          * It's believed that if XLF_CAN_BE_LOADED_ABOVE_4G is set, initrd can
          * be loaded into any address.
          *
          * In addition, initrd_max is uint32_t simply because QEMU doesn't
          * support the 64-bit boot protocol (specifically the ext_ramdisk_image
@@ -1049,60 +1050,61 @@  void x86_load_linux(X86MachineState *x86ms,
     fclose(f);
 
     /* append dtb to kernel */
     if (dtb_filename) {
         if (protocol < 0x209) {
             fprintf(stderr, "qemu: Linux kernel too old to load a dtb\n");
             exit(1);
         }
 
         dtb_size = get_image_size(dtb_filename);
         if (dtb_size <= 0) {
             fprintf(stderr, "qemu: error reading dtb %s: %s\n",
                     dtb_filename, strerror(errno));
             exit(1);
         }
 
-        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
-        kernel_size = setup_data_offset + sizeof(struct setup_data) + dtb_size;
-        kernel = g_realloc(kernel, kernel_size);
-
-
-        setup_data = (struct setup_data *)(kernel + setup_data_offset);
+        setup_data_item_len = sizeof(struct setup_data) + dtb_size;
+        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
+        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = prot_addr + setup_data_offset;
+        first_setup_data = setup_data_base + setup_data_total_len;
+        setup_data_total_len += setup_data_item_len;
         setup_data->type = cpu_to_le32(SETUP_DTB);
         setup_data->len = cpu_to_le32(dtb_size);
-
         load_image_size(dtb_filename, setup_data->data, dtb_size);
     }
 
     if (!legacy_no_rng_seed) {
-        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
-        kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH;
-        kernel = g_realloc(kernel, kernel_size);
-        setup_data = (struct setup_data *)(kernel + setup_data_offset);
+        setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH;
+        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
+        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
         setup_data->next = cpu_to_le64(first_setup_data);
-        first_setup_data = prot_addr + setup_data_offset;
+        first_setup_data = setup_data_base + setup_data_total_len;
+        setup_data_total_len += setup_data_item_len;
         setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
         setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
         qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
     }
 
-    /* Offset 0x250 is a pointer to the first setup_data link. */
-    stq_p(header + 0x250, first_setup_data);
+    if (first_setup_data) {
+            /* Offset 0x250 is a pointer to the first setup_data link. */
+            stq_p(header + 0x250, first_setup_data);
+            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
+                         setup_data_base, NULL, NULL, NULL, NULL, false);
+    }
 
     /*
      * If we're starting an encrypted VM, it will be OVMF based, which uses the
      * efi stub for booting and doesn't require any values to be placed in the
      * kernel header.  We therefore don't update the header so the hash of the
      * kernel on the other side of the fw_cfg interface matches the hash of the
      * file the user passed in.
      */
     if (!sev_enabled()) {
         memcpy(setup, header, MIN(sizeof(header), setup_size));
     }
 
     fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr);
     fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size);
     fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size);
     sev_load_ctx.kernel_data = (char *)kernel;