diff mbox series

[PATCHv7,10/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory

Message ID 20220614120231.48165-11-kirill.shutemov@linux.intel.com
State Superseded
Headers show
Series mm, x86/cc: Implement support for unaccepted memory | expand

Commit Message

Kirill A. Shutemov June 14, 2022, 12:02 p.m. UTC
load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
The unwanted loads are typically harmless. But, they might be made to
totally unrelated or even unmapped memory. load_unaligned_zeropad()
relies on exception fixup (#PF, #GP and now #VE) to recover from these
unwanted loads.

But, this approach does not work for unaccepted memory. For TDX, a load
from unaccepted memory will not lead to a recoverable exception within
the guest. The guest will exit to the VMM where the only recourse is to
terminate the guest.

There are three parts to fix this issue and comprehensively avoid access
to unaccepted memory. Together these ensure that an extra “guard” page
is accepted in addition to the memory that needs to be used.

1. Implicitly extend the range_contains_unaccepted_memory(start, end)
   checks up to end+2M if ‘end’ is aligned on a 2M boundary.
2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
   aligned on a 2M boundary.
3. Set PageUnaccepted() on both memory that itself needs to be accepted
   *and* memory where the next page needs to be accepted. Essentially,
   make PageUnaccepted(page) a marker for whether work needs to be done
   to make ‘page’ usable. That work might include accepting pages in
   addition to ‘page’ itself.

Side note: This leads to something strange. Pages which were accepted
	   at boot, marked by the firmware as accepted and will never
	   _need_ to be accepted might have PageUnaccepted() set on
	   them. PageUnaccepted(page) is a cue to ensure that the next
	   page is accepted before ‘page’ can be used.

This is an actual, real-world problem which was discovered during TDX
testing.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/mm/unaccepted_memory.c         | 36 +++++++++++++++++++++++++
 drivers/firmware/efi/libstub/x86-stub.c |  7 +++++
 2 files changed, 43 insertions(+)

Comments

Dave Hansen June 23, 2022, 5:19 p.m. UTC | #1
On 6/14/22 05:02, Kirill A. Shutemov wrote:
> load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> The unwanted loads are typically harmless. But, they might be made to
> totally unrelated or even unmapped memory. load_unaligned_zeropad()
> relies on exception fixup (#PF, #GP and now #VE) to recover from these
> unwanted loads.
> 
> But, this approach does not work for unaccepted memory. For TDX, a load
> from unaccepted memory will not lead to a recoverable exception within
> the guest. The guest will exit to the VMM where the only recourse is to
> terminate the guest.
> 
> There are three parts to fix this issue and comprehensively avoid access
> to unaccepted memory. Together these ensure that an extra “guard” page
> is accepted in addition to the memory that needs to be used.
> 
> 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
>    checks up to end+2M if ‘end’ is aligned on a 2M boundary.
> 2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
>    aligned on a 2M boundary.
> 3. Set PageUnaccepted() on both memory that itself needs to be accepted
>    *and* memory where the next page needs to be accepted. Essentially,
>    make PageUnaccepted(page) a marker for whether work needs to be done
>    to make ‘page’ usable. That work might include accepting pages in
>    addition to ‘page’ itself.
...

That all looks pretty good.

> diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
> index 1df918b21469..bcd56fe82b9e 100644
> --- a/arch/x86/mm/unaccepted_memory.c
> +++ b/arch/x86/mm/unaccepted_memory.c
> @@ -23,6 +23,38 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>  	bitmap = __va(boot_params.unaccepted_memory);
>  	range_start = start / PMD_SIZE;
>  
> +	/*
> +	 * load_unaligned_zeropad() can lead to unwanted loads across page
> +	 * boundaries. The unwanted loads are typically harmless. But, they
> +	 * might be made to totally unrelated or even unmapped memory.
> +	 * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
> +	 * #VE) to recover from these unwanted loads.
> +	 *
> +	 * But, this approach does not work for unaccepted memory. For TDX, a
> +	 * load from unaccepted memory will not lead to a recoverable exception
> +	 * within the guest. The guest will exit to the VMM where the only
> +	 * recourse is to terminate the guest.
> +	 *
> +	 * There are three parts to fix this issue and comprehensively avoid
> +	 * access to unaccepted memory. Together these ensure that an extra
> +	 * “guard” page is accepted in addition to the memory that needs to be
> +	 * used:
> +	 *
> +	 * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
> +	 *    checks up to end+2M if ‘end’ is aligned on a 2M boundary.
> +	 *
> +	 * 2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
> +	 *    aligned on a 2M boundary.
> +	 *
> +	 * 3. Set PageUnaccepted() on both memory that itself needs to be
> +	 *    accepted *and* memory where the next page needs to be accepted.
> +	 *    Essentially, make PageUnaccepted(page) a marker for whether work
> +	 *    needs to be done to make ‘page’ usable. That work might include
> +	 *    accepting pages in addition to ‘page’ itself.
> +	 */

One nit with this: I'd much rather add one sentence to these to help tie
the code implementing it with this comment.  Maybe something like:

 * 2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
 *    aligned on a 2M boundary. (immediately following this comment)


> +	if (!(end % PMD_SIZE))
> +		end += PMD_SIZE;
> +
>  	spin_lock_irqsave(&unaccepted_memory_lock, flags);
>  	for_each_set_bitrange_from(range_start, range_end, bitmap,
>  				   DIV_ROUND_UP(end, PMD_SIZE)) {
> @@ -46,6 +78,10 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
>  
>  	bitmap = __va(boot_params.unaccepted_memory);
>  
> +	/* See comment on load_unaligned_zeropad() in accept_memory() */
> +	if (!(end % PMD_SIZE))
> +		end += PMD_SIZE;

It's a wee bit hard to follow this back to the comment that it
references, even with them sitting next to each other in this diff.  How
about adding:

	/*
	 * Also consider the unaccepted state of the *next* page.  See
	 * fix #1 in the comment on load_unaligned_zeropad() in
	 * accept_memory().
	 */

>  	spin_lock_irqsave(&unaccepted_memory_lock, flags);
>  	while (start < end) {
>  		if (test_bit(start / PMD_SIZE, bitmap)) {
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index b91c89100b2d..bc1110509de4 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -709,6 +709,13 @@ static efi_status_t allocate_unaccepted_memory(struct boot_params *params,
>  		return EFI_SUCCESS;
>  	}
>  
> +	/*
> +	 * range_contains_unaccepted_memory() may need to check one 2M chunk
> +	 * beyond the end of RAM to deal with load_unaligned_zeropad(). Make
> +	 * sure that the bitmap is large enough handle it.
> +	 */
> +	max_addr += PMD_SIZE;

I guess the alternative to this would have been to record 'max_addr',
then special case 'max_addr'+2M in the bitmap checks.  I agree this is
probably nicer.

Also, the changelog needs to at least *mention* this little tidbit.  It
was a bit of a surprise when I got here.

With those fixed:

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Borislav Petkov July 26, 2022, 10:21 a.m. UTC | #2
On Tue, Jun 14, 2022 at 03:02:27PM +0300, Kirill A. Shutemov wrote:
> But, this approach does not work for unaccepted memory. For TDX, a load
> from unaccepted memory will not lead to a recoverable exception within
> the guest. The guest will exit to the VMM where the only recourse is to
> terminate the guest.

FTR, this random-memory-access-to-unaccepted-memory-is-deadly thing is
really silly. We should be able to handle such cases - because they do
happen often - in a more resilient way. Just look at the complex dance
this patch needs to do just to avoid this.

IOW, this part of the coco technology needs improvement.

Just sayin...
Borislav Petkov July 26, 2022, 5:25 p.m. UTC | #3
On Tue, Jun 14, 2022 at 03:02:27PM +0300, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
> index 1df918b21469..bcd56fe82b9e 100644
> --- a/arch/x86/mm/unaccepted_memory.c
> +++ b/arch/x86/mm/unaccepted_memory.c
> @@ -23,6 +23,38 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>  	bitmap = __va(boot_params.unaccepted_memory);
>  	range_start = start / PMD_SIZE;
>  
> +	/*
> +	 * load_unaligned_zeropad() can lead to unwanted loads across page
> +	 * boundaries. The unwanted loads are typically harmless. But, they
> +	 * might be made to totally unrelated or even unmapped memory.
> +	 * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
> +	 * #VE) to recover from these unwanted loads.
> +	 *
> +	 * But, this approach does not work for unaccepted memory. For TDX, a
> +	 * load from unaccepted memory will not lead to a recoverable exception
> +	 * within the guest. The guest will exit to the VMM where the only
> +	 * recourse is to terminate the guest.
> +	 *
> +	 * There are three parts to fix this issue and comprehensively avoid
> +	 * access to unaccepted memory. Together these ensure that an extra
> +	 * “guard” page is accepted in addition to the memory that needs to be
> +	 * used:
> +	 *
> +	 * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
> +	 *    checks up to end+2M if ‘end’ is aligned on a 2M boundary.
> +	 *
> +	 * 2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
> +	 *    aligned on a 2M boundary.

Why do we need those unicode quotes and backticks in there?

verify_diff: Warning: Unicode char [“] (0x8220 in line: +	 * “guard” page is accepted in addition to the memory that needs to be
verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 *    checks up to end+2M if ‘end’ is aligned on a 2M boundary.
verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 * 2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 *    needs to be done to make ‘page’ usable. That work might include
verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 *    accepting pages in addition to ‘page’ itself.
Dave Hansen July 26, 2022, 5:46 p.m. UTC | #4
On 7/26/22 10:25, Borislav Petkov wrote:
> Why do we need those unicode quotes and backticks in there?
> 
> verify_diff: Warning: Unicode char [“] (0x8220 in line: +	 * “guard” page is accepted in addition to the memory that needs to be
> verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 *    checks up to end+2M if ‘end’ is aligned on a 2M boundary.
> verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 * 2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
> verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 *    needs to be done to make ‘page’ usable. That work might include
> verify_diff: Warning: Unicode char [‘] (0x8216 in line: +	 *    accepting pages in addition to ‘page’ itself.

I've been encouraging folks to stick their changelogs in a Google Doc
(or even Word) when they're writing them.  This gives them better
spelling and grammar checking than is available in most editors and also
makes it easier for folks to improve it collaboratively.  I find it a
lot more efficient than sending 10 copies back and forth in email.

The downside is that those fancy programs insert unicode willy nilly for
stuff like this.  You usually need to catch it with scripts because it's
hard to spot visually.

It might make a good checkpatch addition, if it's not already there.
Andy Lutomirski July 26, 2022, 8:17 p.m. UTC | #5
On Tue, Jun 14, 2022, at 5:02 AM, Kirill A. Shutemov wrote:
> load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> The unwanted loads are typically harmless. But, they might be made to
> totally unrelated or even unmapped memory. load_unaligned_zeropad()
> relies on exception fixup (#PF, #GP and now #VE) to recover from these
> unwanted loads.
>
> But, this approach does not work for unaccepted memory. For TDX, a load
> from unaccepted memory will not lead to a recoverable exception within
> the guest. The guest will exit to the VMM where the only recourse is to
> terminate the guest.

Why is unaccepted memory marked present in the direct map in the first place?

Having kernel code assume that every valid address is followed by several bytes of memory that may be read without side effects other than #PF also seems like a mistake, but I probably won’t win that fight. But sticking guard pages in front of definitely-not-logically present pages seems silly to me.  Let’s just not map it.

(What if MMIO memory is mapped next to regular memory?  Doing random unaligned reads that cross into MMIO seems unwise.)
Dave Hansen Aug. 2, 2022, 11:46 p.m. UTC | #6
On 7/26/22 03:21, Borislav Petkov wrote:
> On Tue, Jun 14, 2022 at 03:02:27PM +0300, Kirill A. Shutemov wrote:
>> But, this approach does not work for unaccepted memory. For TDX, a load
>> from unaccepted memory will not lead to a recoverable exception within
>> the guest. The guest will exit to the VMM where the only recourse is to
>> terminate the guest.
> FTR, this random-memory-access-to-unaccepted-memory-is-deadly thing is
> really silly. We should be able to handle such cases - because they do
> happen often - in a more resilient way. Just look at the complex dance
> this patch needs to do just to avoid this.
> 
> IOW, this part of the coco technology needs improvement.

This particular wound is self-inflicted.  The hardware can *today*
generate a #VE for these accesses.  But, to make writing the #VE code
more straightforward, we asked that the hardware not even bother
delivering the exception.  At the time, nobody could come up with a case
why there would ever be a legitimate, non-buggy access to unaccepted memory.

We learned about load_unaligned_zeropad() the hard way.  I never ran
into it and never knew it was there.  Dangit.

We _could_ go back to the way it was originally.  We could add
load_unaligned_zeropad() support to the #VE handler, and there's little
risk of load_unaligned_zeropad() itself being used in the
interrupts-disabled window early in the #VE handler.  That would get rid
of all the nasty adjacent page handling in the unaccepted memory code.

But, that would mean that we can land in the #VE handler from more
contexts.  Any normal, non-buggy use of load_unaligned_zeropad() can end
up there, obviously.  We would, for instance, need to be more careful
about #VE recursion.  We'd also have to make sure that _bugs_ that land
in the #VE handler can still be handled in a sane way.

To sum it all up, I'm not happy with the complexity of the page
acceptance code either but I'm not sure that it's bad tradeoff compared
to greater #VE complexity or fragility.

Does anyone think we should go back and really reconsider this?
Dave Hansen Aug. 3, 2022, 2:02 p.m. UTC | #7
On 8/2/22 16:46, Dave Hansen wrote:
> To sum it all up, I'm not happy with the complexity of the page
> acceptance code either but I'm not sure that it's bad tradeoff compared
> to greater #VE complexity or fragility.
> 
> Does anyone think we should go back and really reconsider this?

One other thing I remembered as I re-read my write up on this.

In the "new" mode, guests never get #VE's for unaccepted memory.  They
just exit to the host and can never be reentered.  They must be killed.

In the "old" mode, I _believe_ that the guest always gets a #VE for
non-EPT-present memory.  The #VE is basically the same no matter if the
page is unaccepted or if the host goes out and makes a
previously-accepted page non-present.

One really nasty implication of this "old" mode is that the host can
remove *accepted* pages that are used in the syscall gap.  That means
that the #VE handler would need to be of the paranoid variety which
opens up all kinds of other fun.

 * "Old" - #VE's can happen in the syscall gap
 * "New" - #VE's happen at better-defined times.  Unexpected ones are
   fatal.

There's a third option which I proposed but doesn't yet exist.  The TDX
module _could_ separate the behavior of unaccepted memory #VE's and
host-induced #VEs.  This way, we could use load_unaligned_zeropad() with
impunity and handle it in the #VE handler.  At the same time, the host
would not be allowed to remove accepted memory and cause problems in the
syscall gap.  Kinda the best of both worlds.

But, I'm not sure how valuable that would be now that we have the
(admittedly squirrelly) code to avoid load_unaligned_zeropad() #VE's.
Kirill A. Shutemov Aug. 9, 2022, 11:38 a.m. UTC | #8
On Tue, Jul 26, 2022 at 01:17:13PM -0700, Andy Lutomirski wrote:
> 
> 
> On Tue, Jun 14, 2022, at 5:02 AM, Kirill A. Shutemov wrote:
> > load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> > The unwanted loads are typically harmless. But, they might be made to
> > totally unrelated or even unmapped memory. load_unaligned_zeropad()
> > relies on exception fixup (#PF, #GP and now #VE) to recover from these
> > unwanted loads.
> >
> > But, this approach does not work for unaccepted memory. For TDX, a load
> > from unaccepted memory will not lead to a recoverable exception within
> > the guest. The guest will exit to the VMM where the only recourse is to
> > terminate the guest.
> 
> Why is unaccepted memory marked present in the direct map in the first place?
> 
> Having kernel code assume that every valid address is followed by
> several bytes of memory that may be read without side effects other than
> #PF also seems like a mistake, but I probably won’t win that fight. But
> sticking guard pages in front of definitely-not-logically present pages
> seems silly to me.  Let’s just not map it.

It would mean no 1G pages in direct mapping for TDX as we accept 2M a
time.

> (What if MMIO memory is mapped next to regular memory?  Doing random
> unaligned reads that cross into MMIO seems unwise.)

MMIO is shared, not unaccpted private. We already handle the situation.
See 1e7769653b06 ("x86/tdx: Handle load_unaligned_zeropad() page-cross to
a shared page").
Borislav Petkov Aug. 11, 2022, 11:26 a.m. UTC | #9
On Wed, Aug 03, 2022 at 07:02:31AM -0700, Dave Hansen wrote:
> One other thing I remembered as I re-read my write up on this.
> 
> In the "new" mode, guests never get #VE's for unaccepted memory.  They
> just exit to the host and can never be reentered.  They must be killed.

Yeah, this is the part which I think is really silly.

OSes, in their execution lifetime, can - erroneously or not - but it
happens pretty often in real life, touch some unrelated memory. And this
has never been a big deal - #PF, that's it.

But now they don't even get a chance to correct their mistake - VMEXIT,
die.

load_unaligned_zeropad() is just one case.

Imagine the user loads some buggy driver in the guest and that driver
starts doing stray memory accesses through a wild pointer into the
fields. Guest dies immediately.

Dunno bit it all feels a bit too harsh and unfriendly to me.

Sure, if that user is really unlucky, those stray accesses can kill
his OS on baremetal too. So maybe you could argue here that such stray
accesses are actually a good thing. :)

All I know is, there should be a more resilient way to handle those.

> In the "old" mode, I _believe_ that the guest always gets a #VE for
> non-EPT-present memory.  The #VE is basically the same no matter if the
> page is unaccepted or if the host goes out and makes a
> previously-accepted page non-present.
> 
> One really nasty implication of this "old" mode is that the host can
> remove *accepted* pages that are used in the syscall gap.  That means
> that the #VE handler would need to be of the paranoid variety which
> opens up all kinds of other fun.

Yeah, I believe this needs to be dealt with anyway, for SNP at least.
But on AMD it would simply cause an exception and it'll be handled in
the #VC thing. And there's some ugly code to deal with the gap too.

>  * "Old" - #VE's can happen in the syscall gap
>  * "New" - #VE's happen at better-defined times.  Unexpected ones are
>    fatal.
> 
> There's a third option which I proposed but doesn't yet exist.  The TDX
> module _could_ separate the behavior of unaccepted memory #VE's and
> host-induced #VEs.  This way, we could use load_unaligned_zeropad() with
> impunity and handle it in the #VE handler.  At the same time, the host
> would not be allowed to remove accepted memory and cause problems in the
> syscall gap.  Kinda the best of both worlds.

I like that. This should've been the default from the get-go. Oh well,
what's it called in English, hindsight is 20 20...?

> But, I'm not sure how valuable that would be now that we have the
> (admittedly squirrelly) code to avoid load_unaligned_zeropad() #VE's.

I think you should push for the bestest solution and one day we can kill
those ugly workarounds.

Thx.
Andy Lutomirski Aug. 13, 2022, 4:03 p.m. UTC | #10
On Tue, Aug 9, 2022, at 4:38 AM, Kirill A. Shutemov wrote:
> On Tue, Jul 26, 2022 at 01:17:13PM -0700, Andy Lutomirski wrote:
>> 
>> 
>> On Tue, Jun 14, 2022, at 5:02 AM, Kirill A. Shutemov wrote:
>> > load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
>> > The unwanted loads are typically harmless. But, they might be made to
>> > totally unrelated or even unmapped memory. load_unaligned_zeropad()
>> > relies on exception fixup (#PF, #GP and now #VE) to recover from these
>> > unwanted loads.
>> >
>> > But, this approach does not work for unaccepted memory. For TDX, a load
>> > from unaccepted memory will not lead to a recoverable exception within
>> > the guest. The guest will exit to the VMM where the only recourse is to
>> > terminate the guest.
>> 
>> Why is unaccepted memory marked present in the direct map in the first place?
>> 
>> Having kernel code assume that every valid address is followed by
>> several bytes of memory that may be read without side effects other than
>> #PF also seems like a mistake, but I probably won’t win that fight. But
>> sticking guard pages in front of definitely-not-logically present pages
>> seems silly to me.  Let’s just not map it.
>
> It would mean no 1G pages in direct mapping for TDX as we accept 2M a
> time.
>
>> (What if MMIO memory is mapped next to regular memory?  Doing random
>> unaligned reads that cross into MMIO seems unwise.)
>
> MMIO is shared, not unaccpted private. We already handle the situation.
> See 1e7769653b06 ("x86/tdx: Handle load_unaligned_zeropad() page-cross to
> a shared page").
>

I don’t mean in a confidential guest — I mean generally. This whole model of “overrun the buffer — no big deal” is just fragile.

> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov
Andy Lutomirski Aug. 13, 2022, 4:04 p.m. UTC | #11
On Wed, Aug 3, 2022, at 7:02 AM, Dave Hansen wrote:
> On 8/2/22 16:46, Dave Hansen wrote:
>> To sum it all up, I'm not happy with the complexity of the page
>> acceptance code either but I'm not sure that it's bad tradeoff compared
>> to greater #VE complexity or fragility.
>> 
>> Does anyone think we should go back and really reconsider this?
>
> One other thing I remembered as I re-read my write up on this.
>
> In the "new" mode, guests never get #VE's for unaccepted memory.  They
> just exit to the host and can never be reentered.  They must be killed.
>
> In the "old" mode, I _believe_ that the guest always gets a #VE for
> non-EPT-present memory.  The #VE is basically the same no matter if the
> page is unaccepted or if the host goes out and makes a
> previously-accepted page non-present.
>
> One really nasty implication of this "old" mode is that the host can
> remove *accepted* pages that are used in the syscall gap.  That means
> that the #VE handler would need to be of the paranoid variety which
> opens up all kinds of other fun.
>
>  * "Old" - #VE's can happen in the syscall gap
>  * "New" - #VE's happen at better-defined times.  Unexpected ones are
>    fatal.
>
> There's a third option which I proposed but doesn't yet exist.  The TDX
> module _could_ separate the behavior of unaccepted memory #VE's and
> host-induced #VEs.  This way, we could use load_unaligned_zeropad() with
> impunity and handle it in the #VE handler.  At the same time, the host
> would not be allowed to remove accepted memory and cause problems in the
> syscall gap.  Kinda the best of both worlds.
>
> But, I'm not sure how valuable that would be now that we have the
> (admittedly squirrelly) code to avoid load_unaligned_zeropad() #VE's.

How would that be implemented?  It would need to track which GPAs *were* accepted across a host-induced unmap/remap cycle. This would involve preventing the host from ever completely removing a secure EPT table without the guest’s help, right?

Admittedly this would IMO be better behavior. Is it practical to implement?
Andy Lutomirski Aug. 13, 2022, 4:11 p.m. UTC | #12
On Thu, Aug 11, 2022, at 4:26 AM, Borislav Petkov wrote:
> On Wed, Aug 03, 2022 at 07:02:31AM -0700, Dave Hansen wrote:
>> One other thing I remembered as I re-read my write up on this.
>> 
>> In the "new" mode, guests never get #VE's for unaccepted memory.  They
>> just exit to the host and can never be reentered.  They must be killed.
>
> Yeah, this is the part which I think is really silly.
>
> OSes, in their execution lifetime, can - erroneously or not - but it
> happens pretty often in real life, touch some unrelated memory. And this
> has never been a big deal - #PF, that's it.
>
> But now they don't even get a chance to correct their mistake - VMEXIT,
> die.
>
> load_unaligned_zeropad() is just one case.
>
> Imagine the user loads some buggy driver in the guest and that driver
> starts doing stray memory accesses through a wild pointer into the
> fields. Guest dies immediately.
>
> Dunno bit it all feels a bit too harsh and unfriendly to me.
>
> Sure, if that user is really unlucky, those stray accesses can kill
> his OS on baremetal too. So maybe you could argue here that such stray
> accesses are actually a good thing. :)
>
> All I know is, there should be a more resilient way to handle those.
>
>> In the "old" mode, I _believe_ that the guest always gets a #VE for
>> non-EPT-present memory.  The #VE is basically the same no matter if the
>> page is unaccepted or if the host goes out and makes a
>> previously-accepted page non-present.
>> 
>> One really nasty implication of this "old" mode is that the host can
>> remove *accepted* pages that are used in the syscall gap.  That means
>> that the #VE handler would need to be of the paranoid variety which
>> opens up all kinds of other fun.
>
> Yeah, I believe this needs to be dealt with anyway, for SNP at least.
> But on AMD it would simply cause an exception and it'll be handled in
> the #VC thing. And there's some ugly code to deal with the gap too.

I do not believe for a second that the “ugly” code in question is correct. Let’s please not go there for TDX.  The whole point of this thing is security — I would rather see a somewhat fragile situation than an exploit.

Now if the TD module could deliver an unrecoverable #MC instead of an impossible-to-handle #VE, maybe we could at least get a nice debug trace out?  Of course it’s not so easy to do anything with a debug trace that doesn’t break confidentiality.

So, Dave (and other Intel folks), here’s a different feature I think we will want: a secure way to get debug logs out of a TDX guest. Maybe the guest could have a way to send syslog lines to the TD module (via hypercall or a ring buffer) and the TD module could encrypt them against a secret that is only accessible to future boots of the same guest or to an attested outside key.  TDX already has an exceedingly flexible attestation mechanism — there is surely room for the guest config to include a public key that is used for encryption of exported logs.

(Yes, this is sort of doable in software, but the part where any kind of buffer actually gets saved anywhere if the guest goes kaboom requires either help from the TD module or a defined protocol with key negotiation involving the untrusted host.)

The model where the guest leaves a panic on the screen or serial console won’t work in TDX…

>
>>  * "Old" - #VE's can happen in the syscall gap
>>  * "New" - #VE's happen at better-defined times.  Unexpected ones are
>>    fatal.
>> 
>> There's a third option which I proposed but doesn't yet exist.  The TDX
>> module _could_ separate the behavior of unaccepted memory #VE's and
>> host-induced #VEs.  This way, we could use load_unaligned_zeropad() with
>> impunity and handle it in the #VE handler.  At the same time, the host
>> would not be allowed to remove accepted memory and cause problems in the
>> syscall gap.  Kinda the best of both worlds.
>
> I like that. This should've been the default from the get-go. Oh well,
> what's it called in English, hindsight is 20 20...?
>
>> But, I'm not sure how valuable that would be now that we have the
>> (admittedly squirrelly) code to avoid load_unaligned_zeropad() #VE's.
>
> I think you should push for the bestest solution and one day we can kill
> those ugly workarounds.
>
> Thx.
>
> -- 
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
Kirill A. Shutemov Aug. 13, 2022, 8:58 p.m. UTC | #13
On Sat, Aug 13, 2022 at 09:04:58AM -0700, Andy Lutomirski wrote:
> 
> 
> On Wed, Aug 3, 2022, at 7:02 AM, Dave Hansen wrote:
> > On 8/2/22 16:46, Dave Hansen wrote:
> >> To sum it all up, I'm not happy with the complexity of the page
> >> acceptance code either but I'm not sure that it's bad tradeoff compared
> >> to greater #VE complexity or fragility.
> >> 
> >> Does anyone think we should go back and really reconsider this?
> >
> > One other thing I remembered as I re-read my write up on this.
> >
> > In the "new" mode, guests never get #VE's for unaccepted memory.  They
> > just exit to the host and can never be reentered.  They must be killed.
> >
> > In the "old" mode, I _believe_ that the guest always gets a #VE for
> > non-EPT-present memory.  The #VE is basically the same no matter if the
> > page is unaccepted or if the host goes out and makes a
> > previously-accepted page non-present.
> >
> > One really nasty implication of this "old" mode is that the host can
> > remove *accepted* pages that are used in the syscall gap.  That means
> > that the #VE handler would need to be of the paranoid variety which
> > opens up all kinds of other fun.
> >
> >  * "Old" - #VE's can happen in the syscall gap
> >  * "New" - #VE's happen at better-defined times.  Unexpected ones are
> >    fatal.
> >
> > There's a third option which I proposed but doesn't yet exist.  The TDX
> > module _could_ separate the behavior of unaccepted memory #VE's and
> > host-induced #VEs.  This way, we could use load_unaligned_zeropad() with
> > impunity and handle it in the #VE handler.  At the same time, the host
> > would not be allowed to remove accepted memory and cause problems in the
> > syscall gap.  Kinda the best of both worlds.
> >
> > But, I'm not sure how valuable that would be now that we have the
> > (admittedly squirrelly) code to avoid load_unaligned_zeropad() #VE's.
> 
> How would that be implemented?  It would need to track which GPAs *were*
> accepted across a host-induced unmap/remap cycle. This would involve
> preventing the host from ever completely removing a secure EPT table
> without the guest’s help, right?
> 
> Admittedly this would IMO be better behavior. Is it practical to implement?

I don't think it is better if you look from host POV. It owns resources of
the machine and it has to have a way to pull memory from uncooperative TD
at any point.

It also would require more complicated private->shared conversion as guest
has to notify TDX module about the change, not only host as we do now.
Kirill A. Shutemov Aug. 13, 2022, 9:02 p.m. UTC | #14
On Sat, Aug 13, 2022 at 09:03:13AM -0700, Andy Lutomirski wrote:
> 
> 
> On Tue, Aug 9, 2022, at 4:38 AM, Kirill A. Shutemov wrote:
> > On Tue, Jul 26, 2022 at 01:17:13PM -0700, Andy Lutomirski wrote:
> >> 
> >> 
> >> On Tue, Jun 14, 2022, at 5:02 AM, Kirill A. Shutemov wrote:
> >> > load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> >> > The unwanted loads are typically harmless. But, they might be made to
> >> > totally unrelated or even unmapped memory. load_unaligned_zeropad()
> >> > relies on exception fixup (#PF, #GP and now #VE) to recover from these
> >> > unwanted loads.
> >> >
> >> > But, this approach does not work for unaccepted memory. For TDX, a load
> >> > from unaccepted memory will not lead to a recoverable exception within
> >> > the guest. The guest will exit to the VMM where the only recourse is to
> >> > terminate the guest.
> >> 
> >> Why is unaccepted memory marked present in the direct map in the first place?
> >> 
> >> Having kernel code assume that every valid address is followed by
> >> several bytes of memory that may be read without side effects other than
> >> #PF also seems like a mistake, but I probably won’t win that fight. But
> >> sticking guard pages in front of definitely-not-logically present pages
> >> seems silly to me.  Let’s just not map it.
> >
> > It would mean no 1G pages in direct mapping for TDX as we accept 2M a
> > time.

As of now, we don't have a way to recover direct mapping from
fragmentation. So once we split 1G to 2M it stays this way.

> >> (What if MMIO memory is mapped next to regular memory?  Doing random
> >> unaligned reads that cross into MMIO seems unwise.)
> >
> > MMIO is shared, not unaccpted private. We already handle the situation.
> > See 1e7769653b06 ("x86/tdx: Handle load_unaligned_zeropad() page-cross to
> > a shared page").
> >
> 
> I don’t mean in a confidential guest — I mean generally. This whole
> model of “overrun the buffer — no big deal” is just fragile.

If you want to remove load_unaligned_zeropad(), I would not object. It can
make life easier.

I presumed that optimization it brings has measuarable benefit (otherwise,
why bother).
Kirill A. Shutemov Aug. 13, 2022, 9:13 p.m. UTC | #15
On Sat, Aug 13, 2022 at 09:11:52AM -0700, Andy Lutomirski wrote:
> Now if the TD module could deliver an unrecoverable #MC instead of an
> impossible-to-handle #VE, maybe we could at least get a nice debug trace
> out?  Of course it’s not so easy to do anything with a debug trace that
> doesn’t break confidentiality.

It is not impossible-to-handle #VE, it is no #VE for the guest and exit to
the host that cannot be recovered. Yes, it is not friednly for debugging.

Our plan was to allow SEPT_VE_DISABLE=0 for debug TD. It helps with
debugging stepping on unaccepted memory as allows #VE in the guest which
leads to panic() and nice traceback.

Would it be enough?
diff mbox series

Patch

diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 1df918b21469..bcd56fe82b9e 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -23,6 +23,38 @@  void accept_memory(phys_addr_t start, phys_addr_t end)
 	bitmap = __va(boot_params.unaccepted_memory);
 	range_start = start / PMD_SIZE;
 
+	/*
+	 * load_unaligned_zeropad() can lead to unwanted loads across page
+	 * boundaries. The unwanted loads are typically harmless. But, they
+	 * might be made to totally unrelated or even unmapped memory.
+	 * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
+	 * #VE) to recover from these unwanted loads.
+	 *
+	 * But, this approach does not work for unaccepted memory. For TDX, a
+	 * load from unaccepted memory will not lead to a recoverable exception
+	 * within the guest. The guest will exit to the VMM where the only
+	 * recourse is to terminate the guest.
+	 *
+	 * There are three parts to fix this issue and comprehensively avoid
+	 * access to unaccepted memory. Together these ensure that an extra
+	 * “guard” page is accepted in addition to the memory that needs to be
+	 * used:
+	 *
+	 * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
+	 *    checks up to end+2M if ‘end’ is aligned on a 2M boundary.
+	 *
+	 * 2. Implicitly extend accept_memory(start, end) to end+2M if ‘end’ is
+	 *    aligned on a 2M boundary.
+	 *
+	 * 3. Set PageUnaccepted() on both memory that itself needs to be
+	 *    accepted *and* memory where the next page needs to be accepted.
+	 *    Essentially, make PageUnaccepted(page) a marker for whether work
+	 *    needs to be done to make ‘page’ usable. That work might include
+	 *    accepting pages in addition to ‘page’ itself.
+	 */
+	if (!(end % PMD_SIZE))
+		end += PMD_SIZE;
+
 	spin_lock_irqsave(&unaccepted_memory_lock, flags);
 	for_each_set_bitrange_from(range_start, range_end, bitmap,
 				   DIV_ROUND_UP(end, PMD_SIZE)) {
@@ -46,6 +78,10 @@  bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
 
 	bitmap = __va(boot_params.unaccepted_memory);
 
+	/* See comment on load_unaligned_zeropad() in accept_memory() */
+	if (!(end % PMD_SIZE))
+		end += PMD_SIZE;
+
 	spin_lock_irqsave(&unaccepted_memory_lock, flags);
 	while (start < end) {
 		if (test_bit(start / PMD_SIZE, bitmap)) {
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index b91c89100b2d..bc1110509de4 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -709,6 +709,13 @@  static efi_status_t allocate_unaccepted_memory(struct boot_params *params,
 		return EFI_SUCCESS;
 	}
 
+	/*
+	 * range_contains_unaccepted_memory() may need to check one 2M chunk
+	 * beyond the end of RAM to deal with load_unaligned_zeropad(). Make
+	 * sure that the bitmap is large enough handle it.
+	 */
+	max_addr += PMD_SIZE;
+
 	/*
 	 * If unaccepted memory is present allocate a bitmap to track what
 	 * memory has to be accepted before access.