mbox series

[PATCHv7,00/14] mm, x86/cc: Implement support for unaccepted memory

Message ID 20220614120231.48165-1-kirill.shutemov@linux.intel.com
Headers show
Series mm, x86/cc: Implement support for unaccepted memory | expand

Message

Kirill A. Shutemov June 14, 2022, 12:02 p.m. UTC
UEFI Specification version 2.9 introduces the concept of memory
acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The approach lowers boot time substantially. Boot to shell is ~2.5x
faster for 4G TDX VM and ~4x faster for 64G.

TDX-specific code isolated from the core of unaccepted memory support. It
supposed to help to plug-in different implementation of unaccepted memory
such as SEV-SNP.

The tree can be found here:

https://github.com/intel/tdx.git guest-unaccepted-memory

v7:
 - Rework meminfo counter to use PageUnaccepted() and move to generic code;
 - Fix range_contains_unaccepted_memory() on machines without unaccepted memory;
 - Add Reviewed-by from David;
v6:
 - Fix load_unaligned_zeropad() on machine with unaccepted memory;
 - Clear PageUnaccepted() on merged pages, leaving it only on head;
 - Clarify error handling in allocate_e820();
 - Fix build with CONFIG_UNACCEPTED_MEMORY=y, but without TDX;
 - Disable kexec at boottime instead of build conflict;
 - Rebased to tip/master;
 - Spelling fixes;
 - Add Reviewed-by from Mike and David;
v5:
 - Updates comments and commit messages;
   + Explain options for unaccepted memory handling;
 - Expose amount of unaccepted memory in /proc/meminfo
 - Adjust check in page_expected_state();
 - Fix error code handling in allocate_e820();
 - Centralize __pa()/__va() definitions in the boot stub;
 - Avoid includes from the main kernel in the boot stub;
 - Use an existing hole in boot_param for unaccepted_memory, instead of adding
   to the end of the structure;
 - Extract allocate_unaccepted_memory() form allocate_e820();
 - Complain if there's unaccepted memory, but kernel does not support it;
 - Fix vmstat counter;
 - Split up few preparatory patches;
 - Random readability adjustments;
v4:
 - PageBuddyUnaccepted() -> PageUnaccepted;
 - Use separate page_type, not shared with offline;
 - Rework interface between core-mm and arch code;
 - Adjust commit messages;
 - Ack from Mike;

Kirill A. Shutemov (14):
  x86/boot: Centralize __pa()/__va() definitions
  mm: Add support for unaccepted memory
  mm: Report unaccepted memory in meminfo
  efi/x86: Get full memory map in allocate_e820()
  x86/boot: Add infrastructure required for unaccepted memory support
  efi/x86: Implement support for unaccepted memory
  x86/boot/compressed: Handle unaccepted memory
  x86/mm: Reserve unaccepted memory bitmap
  x86/mm: Provide helpers for unaccepted memory
  x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  x86: Disable kexec if system has unaccepted memory
  x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in
    boot stub
  x86/tdx: Refactor try_accept_one()
  x86/tdx: Add unaccepted memory support

 Documentation/x86/zero-page.rst          |   1 +
 arch/x86/Kconfig                         |   1 +
 arch/x86/boot/bitops.h                   |  40 ++++++++
 arch/x86/boot/compressed/Makefile        |   1 +
 arch/x86/boot/compressed/align.h         |  14 +++
 arch/x86/boot/compressed/bitmap.c        |  43 ++++++++
 arch/x86/boot/compressed/bitmap.h        |  49 +++++++++
 arch/x86/boot/compressed/bits.h          |  36 +++++++
 arch/x86/boot/compressed/compiler.h      |   9 ++
 arch/x86/boot/compressed/efi.h           |   1 +
 arch/x86/boot/compressed/find.c          |  54 ++++++++++
 arch/x86/boot/compressed/find.h          |  80 +++++++++++++++
 arch/x86/boot/compressed/ident_map_64.c  |   8 --
 arch/x86/boot/compressed/kaslr.c         |  35 ++++---
 arch/x86/boot/compressed/math.h          |  37 +++++++
 arch/x86/boot/compressed/mem.c           | 111 ++++++++++++++++++++
 arch/x86/boot/compressed/minmax.h        |  61 +++++++++++
 arch/x86/boot/compressed/misc.c          |   6 ++
 arch/x86/boot/compressed/misc.h          |  15 +++
 arch/x86/boot/compressed/pgtable_types.h |  25 +++++
 arch/x86/boot/compressed/sev.c           |   2 -
 arch/x86/boot/compressed/tdx.c           |  78 ++++++++++++++
 arch/x86/coco/tdx/tdx.c                  |  94 ++++++++---------
 arch/x86/include/asm/page.h              |   3 +
 arch/x86/include/asm/shared/tdx.h        |  47 +++++++++
 arch/x86/include/asm/tdx.h               |  19 ----
 arch/x86/include/asm/unaccepted_memory.h |  16 +++
 arch/x86/include/uapi/asm/bootparam.h    |   2 +-
 arch/x86/kernel/e820.c                   |  10 ++
 arch/x86/mm/Makefile                     |   2 +
 arch/x86/mm/unaccepted_memory.c          | 123 +++++++++++++++++++++++
 drivers/base/node.c                      |   7 ++
 drivers/firmware/efi/Kconfig             |  14 +++
 drivers/firmware/efi/efi.c               |   1 +
 drivers/firmware/efi/libstub/x86-stub.c  | 103 ++++++++++++++++---
 fs/proc/meminfo.c                        |   5 +
 include/linux/efi.h                      |   3 +-
 include/linux/mmzone.h                   |   1 +
 include/linux/page-flags.h               |  31 ++++++
 mm/internal.h                            |  12 +++
 mm/memblock.c                            |   9 ++
 mm/page_alloc.c                          |  96 +++++++++++++++++-
 mm/vmstat.c                              |   1 +
 43 files changed, 1191 insertions(+), 115 deletions(-)
 create mode 100644 arch/x86/boot/compressed/align.h
 create mode 100644 arch/x86/boot/compressed/bitmap.c
 create mode 100644 arch/x86/boot/compressed/bitmap.h
 create mode 100644 arch/x86/boot/compressed/bits.h
 create mode 100644 arch/x86/boot/compressed/compiler.h
 create mode 100644 arch/x86/boot/compressed/find.c
 create mode 100644 arch/x86/boot/compressed/find.h
 create mode 100644 arch/x86/boot/compressed/math.h
 create mode 100644 arch/x86/boot/compressed/mem.c
 create mode 100644 arch/x86/boot/compressed/minmax.h
 create mode 100644 arch/x86/boot/compressed/pgtable_types.h
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h
 create mode 100644 arch/x86/mm/unaccepted_memory.c

Comments

Gupta, Pankaj June 14, 2022, 12:57 p.m. UTC | #1
On 6/14/2022 2:02 PM, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, require memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific to the Virtual Machine
> platform.
> 
> There are several ways kernel can deal with unaccepted memory:
> 
>   1. Accept all the memory during the boot. It is easy to implement and
>      it doesn't have runtime cost once the system is booted. The downside
>      is very long boot time.
> 
>      Accept can be parallelized to multiple CPUs to keep it manageable
>      (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>      memory bandwidth and does not scale beyond the point.
> 
>   2. Accept a block of memory on the first use. It requires more
>      infrastructure and changes in page allocator to make it work, but
>      it provides good boot time.
> 
>      On-demand memory accept means latency spikes every time kernel steps
>      onto a new memory block. The spikes will go away once workload data
>      set size gets stabilized or all memory gets accepted.
> 
>   3. Accept all memory in background. Introduce a thread (or multiple)
>      that gets memory accepted proactively. It will minimize time the
>      system experience latency spikes on memory allocation while keeping
>      low boot time.
> 
>      This approach cannot function on its own. It is an extension of #2:
>      background memory acceptance requires functional scheduler, but the
>      page allocator may need to tap into unaccepted memory before that.
> 
>      The downside of the approach is that these threads also steal CPU
>      cycles and memory bandwidth from the user's workload and may hurt
>      user experience.
> 
> Implement #2 for now. It is a reasonable default. Some workloads may
> want to use #1 or #3 and they can be implemented later based on user's
> demands.
> 
> Support of unaccepted memory requires a few changes in core-mm code:
> 
>    - memblock has to accept memory on allocation;
> 
>    - page allocator has to accept memory on the first allocation of the
>      page;
> 
> Memblock change is trivial.
> 
> The page allocator is modified to accept pages on the first allocation.
> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
> used to indicate that the page requires acceptance.
> 
> Architecture has to provide two helpers if it wants to support
> unaccepted memory:
> 
>   - accept_memory() makes a range of physical addresses accepted.
> 
>   - range_contains_unaccepted_memory() checks anything within the range
>     of physical addresses requires acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
> Reviewed-by: David Hildenbrand <david@redhat.com>
> ---
>   include/linux/page-flags.h | 31 +++++++++++++
>   mm/internal.h              | 12 +++++
>   mm/memblock.c              |  9 ++++
>   mm/page_alloc.c            | 89 +++++++++++++++++++++++++++++++++++++-
>   4 files changed, 139 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index e66f7aa3191d..444ba8f4bfa0 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -942,6 +942,14 @@ static inline bool is_page_hwpoison(struct page *page)
>   #define PG_offline	0x00000100
>   #define PG_table	0x00000200
>   #define PG_guard	0x00000400
> +#define PG_unaccepted	0x00000800
> +
> +/*
> + * Page types allowed at page_expected_state()
> + *
> + * PageUnaccepted() will get cleared in post_alloc_hook().
> + */
> +#define PAGE_TYPES_EXPECTED	PG_unaccepted
>   
>   #define PageType(page, flag)						\
>   	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
> @@ -967,6 +975,18 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
>   	page->page_type |= PG_##lname;					\
>   }
>   
> +#define PAGE_TYPE_OPS_FALSE(uname)					\
> +static __always_inline int Page##uname(struct page *page)		\
> +{									\
> +	return false;							\
> +}									\
> +static __always_inline void __SetPage##uname(struct page *page)		\
> +{									\
> +}									\
> +static __always_inline void __ClearPage##uname(struct page *page)	\
> +{									\
> +}
> +
>   /*
>    * PageBuddy() indicates that the page is free and in the buddy system
>    * (see mm/page_alloc.c).
> @@ -997,6 +1017,17 @@ PAGE_TYPE_OPS(Buddy, buddy)
>    */
>   PAGE_TYPE_OPS(Offline, offline)
>   
> +/*
> + * PageUnaccepted() indicates that the page has to be "accepted" before it can
> + * be read or written. The page allocator must call accept_page() before
> + * touching the page or returning it to the caller.
> + */
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +PAGE_TYPE_OPS(Unaccepted, unaccepted)
> +#else
> +PAGE_TYPE_OPS_FALSE(Unaccepted)
> +#endif
> +
>   extern void page_offline_freeze(void);
>   extern void page_offline_thaw(void);
>   extern void page_offline_begin(void);
> diff --git a/mm/internal.h b/mm/internal.h
> index c0f8fbe0445b..7c1a653edfc8 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -861,4 +861,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
>   
>   DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>   
> +#ifndef CONFIG_UNACCEPTED_MEMORY
> +static inline bool range_contains_unaccepted_memory(phys_addr_t start,
> +						    phys_addr_t end)
> +{
> +	return false;
> +}
> +
> +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +}
> +#endif
> +
>   #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/memblock.c b/mm/memblock.c
> index e4f03a6e8e56..a1f7f8b304d5 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1405,6 +1405,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>   		 */
>   		kmemleak_alloc_phys(found, size, 0, 0);
>   
> +	/*
> +	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> +	 * require memory to be accepted before it can be used by the
> +	 * guest.
> +	 *
> +	 * Accept the memory of the allocated buffer.
> +	 */
> +	accept_memory(found, found + size);
> +
>   	return found;
>   }
>   
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e008a3df0485..279c2746aaa8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -122,6 +122,12 @@ typedef int __bitwise fpi_t;
>    */
>   #define FPI_SKIP_KASAN_POISON	((__force fpi_t)BIT(2))
>   
> +/*
> + * Check if the page needs to be marked as PageUnaccepted().
> + * Used for the new pages added to the buddy allocator for the first time.
> + */
> +#define FPI_UNACCEPTED_SLOWPATH	((__force fpi_t)BIT(3))
> +
>   /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>   static DEFINE_MUTEX(pcp_batch_high_lock);
>   #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
> @@ -993,6 +999,35 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
>   			NULL) != NULL;
>   }
>   
> +/*
> + * Page acceptance can be very slow. Do not call under critical locks.
> + */
> +static void accept_page(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +	int i;
> +
> +	if (!PageUnaccepted(page))
> +		return;
> +
> +	accept_memory(start, start + (PAGE_SIZE << order));
> +	__ClearPageUnaccepted(page);
> +
> +	/* Assert that there is no PageUnaccepted() on tail pages */
> +	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
> +		for (i = 1; i < (1 << order); i++)
> +			VM_BUG_ON_PAGE(PageUnaccepted(page + i), page + i);
> +	}
> +}
> +
> +static bool page_contains_unaccepted(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +	phys_addr_t end = start + (PAGE_SIZE << order);
> +
> +	return range_contains_unaccepted_memory(start, end);
> +}
> +
>   /*
>    * Freeing function for a buddy system allocator.
>    *
> @@ -1027,6 +1062,7 @@ static inline void __free_one_page(struct page *page,
>   	unsigned long combined_pfn;
>   	struct page *buddy;
>   	bool to_tail;
> +	bool page_needs_acceptance = false;
>   
>   	VM_BUG_ON(!zone_is_initialized(zone));
>   	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -1038,6 +1074,11 @@ static inline void __free_one_page(struct page *page,
>   	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
>   	VM_BUG_ON_PAGE(bad_range(zone, page), page);
>   
> +	if (PageUnaccepted(page)) {
> +		page_needs_acceptance = true;
> +		__ClearPageUnaccepted(page);
> +	}
> +
>   	while (order < MAX_ORDER - 1) {
>   		if (compaction_capture(capc, page, order, migratetype)) {
>   			__mod_zone_freepage_state(zone, -(1 << order),
> @@ -1072,6 +1113,13 @@ static inline void __free_one_page(struct page *page,
>   			clear_page_guard(zone, buddy, order, migratetype);
>   		else
>   			del_page_from_free_list(buddy, zone, order);
> +
> +		/* Mark page unaccepted if any of merged pages were unaccepted */
> +		if (PageUnaccepted(buddy)) {
> +			page_needs_acceptance = true;
> +			__ClearPageUnaccepted(buddy);
> +		}
> +
>   		combined_pfn = buddy_pfn & pfn;
>   		page = page + (combined_pfn - pfn);
>   		pfn = combined_pfn;
> @@ -1081,6 +1129,23 @@ static inline void __free_one_page(struct page *page,
>   done_merging:
>   	set_buddy_order(page, order);
>   
> +	/*
> +	 * The page gets marked as PageUnaccepted() if any of merged-in pages
> +	 * is PageUnaccepted().
> +	 *
> +	 * New pages, just being added to buddy allocator, do not have
> +	 * PageUnaccepted() set. FPI_UNACCEPTED_SLOWPATH indicates that the
> +	 * page is new and page_contains_unaccepted() check is required to
> +	 * determinate if acceptance is required.
> +	 *
> +	 * Avoid calling page_contains_unaccepted() if it is known that the page
> +	 * needs acceptance. It can be costly.
> +	 */
> +	if (!page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED_SLOWPATH))
> +		page_needs_acceptance = page_contains_unaccepted(page, order);
> +	if (page_needs_acceptance)
> +		__SetPageUnaccepted(page);
> +
>   	if (fpi_flags & FPI_TO_TAIL)
>   		to_tail = true;
>   	else if (is_shuffle_order(order))
> @@ -1164,7 +1229,13 @@ int split_free_page(struct page *free_page,
>   static inline bool page_expected_state(struct page *page,
>   					unsigned long check_flags)
>   {
> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> +	/*
> +	 * The page must not be mapped to userspace and must not have
> +	 * a PageType other than listed in PAGE_TYPES_EXPECTED.
> +	 *
> +	 * Note, bit cleared means the page type is set.
> +	 */
> +	if (unlikely((atomic_read(&page->_mapcount) | PAGE_TYPES_EXPECTED) != -1))
>   		return false;
>   
>   	if (unlikely((unsigned long)page->mapping |
> @@ -1669,7 +1740,9 @@ void __free_pages_core(struct page *page, unsigned int order)
>   	 * Bypass PCP and place fresh pages right to the tail, primarily
>   	 * relevant for memory onlining.
>   	 */
> -	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
> +	__free_pages_ok(page, order,
> +			FPI_TO_TAIL | FPI_SKIP_KASAN_POISON |
> +			FPI_UNACCEPTED_SLOWPATH);
>   }
>   
>   #ifdef CONFIG_NUMA
> @@ -1822,6 +1895,9 @@ static void __init deferred_free_range(unsigned long pfn,
>   		return;
>   	}
>   
> +	/* Accept chunks smaller than page-block upfront */
> +	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
> +
>   	for (i = 0; i < nr_pages; i++, page++, pfn++) {
>   		if ((pfn & (pageblock_nr_pages - 1)) == 0)
>   			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> @@ -2281,6 +2357,13 @@ static inline void expand(struct zone *zone, struct page *page,
>   		if (set_page_guard(zone, &page[size], high, migratetype))
>   			continue;
>   
> +		/*
> +		 * Transfer PageUnaccepted() to the newly split pages so
> +		 * they can be accepted after dropping the zone lock.
> +		 */
> +		if (PageUnaccepted(page))
> +			__SetPageUnaccepted(&page[size]);
> +
>   		add_to_free_list(&page[size], zone, high, migratetype);
>   		set_buddy_order(&page[size], high);
>   	}
> @@ -2411,6 +2494,8 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>   	 */
>   	kernel_unpoison_pages(page, 1 << order);
>   
> +	accept_page(page, order);
> +
>   	/*
>   	 * As memory initialization might be integrated into KASAN,
>   	 * KASAN unpoisoning and memory initializion code must be

Reviewed previous version(v6) of this patch. Seems no change in this 
version except '!PageUnaccepted' check addition in function 
'accept_page' and rename of function 'page_contains_unaccepted'.

Acked-by: Pankaj Gupta <pankaj.gupta@amd.com>
Peter Zijlstra June 15, 2022, 10:19 a.m. UTC | #2
On Tue, Jun 14, 2022 at 03:02:22PM +0300, Kirill A. Shutemov wrote:
> Pull functionality from the main kernel headers and lib/ that is
> required for unaccepted memory support.
> 
> This is preparatory patch. The users for the functionality will come in
> following patches.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/boot/bitops.h                   | 40 ++++++++++++
>  arch/x86/boot/compressed/align.h         | 14 +++++
>  arch/x86/boot/compressed/bitmap.c        | 43 +++++++++++++
>  arch/x86/boot/compressed/bitmap.h        | 49 +++++++++++++++
>  arch/x86/boot/compressed/bits.h          | 36 +++++++++++
>  arch/x86/boot/compressed/compiler.h      |  9 +++
>  arch/x86/boot/compressed/find.c          | 54 ++++++++++++++++
>  arch/x86/boot/compressed/find.h          | 80 ++++++++++++++++++++++++
>  arch/x86/boot/compressed/math.h          | 37 +++++++++++
>  arch/x86/boot/compressed/minmax.h        | 61 ++++++++++++++++++
>  arch/x86/boot/compressed/pgtable_types.h | 25 ++++++++

That's quite a lot of duplicated code; is there really no way so share
this?
Kirill A. Shutemov June 15, 2022, 3:05 p.m. UTC | #3
On Wed, Jun 15, 2022 at 12:19:45PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 14, 2022 at 03:02:22PM +0300, Kirill A. Shutemov wrote:
> > Pull functionality from the main kernel headers and lib/ that is
> > required for unaccepted memory support.
> > 
> > This is preparatory patch. The users for the functionality will come in
> > following patches.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/boot/bitops.h                   | 40 ++++++++++++
> >  arch/x86/boot/compressed/align.h         | 14 +++++
> >  arch/x86/boot/compressed/bitmap.c        | 43 +++++++++++++
> >  arch/x86/boot/compressed/bitmap.h        | 49 +++++++++++++++
> >  arch/x86/boot/compressed/bits.h          | 36 +++++++++++
> >  arch/x86/boot/compressed/compiler.h      |  9 +++
> >  arch/x86/boot/compressed/find.c          | 54 ++++++++++++++++
> >  arch/x86/boot/compressed/find.h          | 80 ++++++++++++++++++++++++
> >  arch/x86/boot/compressed/math.h          | 37 +++++++++++
> >  arch/x86/boot/compressed/minmax.h        | 61 ++++++++++++++++++
> >  arch/x86/boot/compressed/pgtable_types.h | 25 ++++++++
> 
> That's quite a lot of duplicated code; is there really no way so share
> this?

Code duplication also make me uncomfortable. But that what Borislav wanted
to see. efi.h in the boot stub which copies bulk of <linux/efi.h> also
sets the trend in the direction.

Alternative is creating a subset of headers that can be used in both in
main kernel and boot stub. It is more complex and doesn't allow for short
cuts that can be made on copy if you know the context it is used in.

It also sounds painfully similar to uapi/ project. I'm not sure we want to
go this path.
Peter Gonda June 24, 2022, 4:37 p.m. UTC | #4
On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, requiring memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific for the Virtual
> Machine platform.
>
> Accepting memory is costly and it makes VMM allocate memory for the
> accepted guest physical address range. It's better to postpone memory
> acceptance until memory is needed. It lowers boot time and reduces
> memory overhead.
>
> The kernel needs to know what memory has been accepted. Firmware
> communicates this information via memory map: a new memory type --
> EFI_UNACCEPTED_MEMORY -- indicates such memory.
>
> Range-based tracking works fine for firmware, but it gets bulky for
> the kernel: e820 has to be modified on every page acceptance. It leads
> to table fragmentation, but there's a limited number of entries in the
> e820 table
>
> Another option is to mark such memory as usable in e820 and track if the
> range has been accepted in a bitmap. One bit in the bitmap represents
> 2MiB in the address space: one 4k page is enough to track 64GiB or
> physical address space.
>
> In the worst-case scenario -- a huge hole in the middle of the
> address space -- It needs 256MiB to handle 4PiB of the address
> space.
>
> Any unaccepted memory that is not aligned to 2M gets accepted upfront.
>
> The approach lowers boot time substantially. Boot to shell is ~2.5x
> faster for 4G TDX VM and ~4x faster for 64G.
>
> TDX-specific code isolated from the core of unaccepted memory support. It
> supposed to help to plug-in different implementation of unaccepted memory
> such as SEV-SNP.
>
> The tree can be found here:
>
> https://github.com/intel/tdx.git guest-unaccepted-memory

Hi Kirill,

I have a couple questions about this feature mainly about how cloud
customers can use this, I assume since this is a confidential compute
feature a large number of the users of these patches will be cloud
customers using TDX and SNP. One issue I see with these patches is how
do we as a cloud provider know whether a customer's linux image
supports this feature, if the image doesn't have these patches UEFI
needs to fully validate the memory, if the image does we can use this
new protocol. In GCE we supply our VMs with a version of the EDK2 FW
and the customer doesn't input into which UEFI we run, as far as I can
tell from the Azure SNP VM documentation it seems very similar. We
need to somehow tell our UEFI in the VM what to do based on the image.
The current way I can see to solve this issue would be to have our
customers give us metadata about their VM's image but this seems kinda
burdensome on our customers (I assume we'll have more features which
both UEFI and kernel need to both support inorder to be turned on like
this one) and error-prone, if a customer incorrectly labels their
image it may fail to boot.. Has there been any discussion about how to
solve this? My naive thoughts were what if UEFI and Kernel had some
sort of feature negotiation. Maybe that could happen via an extension
to exit boot services or a UEFI runtime driver, I'm not sure what's
best here just some ideas.



>
> v7:
>  - Rework meminfo counter to use PageUnaccepted() and move to generic code;
>  - Fix range_contains_unaccepted_memory() on machines without unaccepted memory;
>  - Add Reviewed-by from David;
> v6:
>  - Fix load_unaligned_zeropad() on machine with unaccepted memory;
>  - Clear PageUnaccepted() on merged pages, leaving it only on head;
>  - Clarify error handling in allocate_e820();
>  - Fix build with CONFIG_UNACCEPTED_MEMORY=y, but without TDX;
>  - Disable kexec at boottime instead of build conflict;
>  - Rebased to tip/master;
>  - Spelling fixes;
>  - Add Reviewed-by from Mike and David;
> v5:
>  - Updates comments and commit messages;
>    + Explain options for unaccepted memory handling;
>  - Expose amount of unaccepted memory in /proc/meminfo
>  - Adjust check in page_expected_state();
>  - Fix error code handling in allocate_e820();
>  - Centralize __pa()/__va() definitions in the boot stub;
>  - Avoid includes from the main kernel in the boot stub;
>  - Use an existing hole in boot_param for unaccepted_memory, instead of adding
>    to the end of the structure;
>  - Extract allocate_unaccepted_memory() form allocate_e820();
>  - Complain if there's unaccepted memory, but kernel does not support it;
>  - Fix vmstat counter;
>  - Split up few preparatory patches;
>  - Random readability adjustments;
> v4:
>  - PageBuddyUnaccepted() -> PageUnaccepted;
>  - Use separate page_type, not shared with offline;
>  - Rework interface between core-mm and arch code;
>  - Adjust commit messages;
>  - Ack from Mike;
>
> Kirill A. Shutemov (14):
>   x86/boot: Centralize __pa()/__va() definitions
>   mm: Add support for unaccepted memory
>   mm: Report unaccepted memory in meminfo
>   efi/x86: Get full memory map in allocate_e820()
>   x86/boot: Add infrastructure required for unaccepted memory support
>   efi/x86: Implement support for unaccepted memory
>   x86/boot/compressed: Handle unaccepted memory
>   x86/mm: Reserve unaccepted memory bitmap
>   x86/mm: Provide helpers for unaccepted memory
>   x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
>   x86: Disable kexec if system has unaccepted memory
>   x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in
>     boot stub
>   x86/tdx: Refactor try_accept_one()
>   x86/tdx: Add unaccepted memory support
>
>  Documentation/x86/zero-page.rst          |   1 +
>  arch/x86/Kconfig                         |   1 +
>  arch/x86/boot/bitops.h                   |  40 ++++++++
>  arch/x86/boot/compressed/Makefile        |   1 +
>  arch/x86/boot/compressed/align.h         |  14 +++
>  arch/x86/boot/compressed/bitmap.c        |  43 ++++++++
>  arch/x86/boot/compressed/bitmap.h        |  49 +++++++++
>  arch/x86/boot/compressed/bits.h          |  36 +++++++
>  arch/x86/boot/compressed/compiler.h      |   9 ++
>  arch/x86/boot/compressed/efi.h           |   1 +
>  arch/x86/boot/compressed/find.c          |  54 ++++++++++
>  arch/x86/boot/compressed/find.h          |  80 +++++++++++++++
>  arch/x86/boot/compressed/ident_map_64.c  |   8 --
>  arch/x86/boot/compressed/kaslr.c         |  35 ++++---
>  arch/x86/boot/compressed/math.h          |  37 +++++++
>  arch/x86/boot/compressed/mem.c           | 111 ++++++++++++++++++++
>  arch/x86/boot/compressed/minmax.h        |  61 +++++++++++
>  arch/x86/boot/compressed/misc.c          |   6 ++
>  arch/x86/boot/compressed/misc.h          |  15 +++
>  arch/x86/boot/compressed/pgtable_types.h |  25 +++++
>  arch/x86/boot/compressed/sev.c           |   2 -
>  arch/x86/boot/compressed/tdx.c           |  78 ++++++++++++++
>  arch/x86/coco/tdx/tdx.c                  |  94 ++++++++---------
>  arch/x86/include/asm/page.h              |   3 +
>  arch/x86/include/asm/shared/tdx.h        |  47 +++++++++
>  arch/x86/include/asm/tdx.h               |  19 ----
>  arch/x86/include/asm/unaccepted_memory.h |  16 +++
>  arch/x86/include/uapi/asm/bootparam.h    |   2 +-
>  arch/x86/kernel/e820.c                   |  10 ++
>  arch/x86/mm/Makefile                     |   2 +
>  arch/x86/mm/unaccepted_memory.c          | 123 +++++++++++++++++++++++
>  drivers/base/node.c                      |   7 ++
>  drivers/firmware/efi/Kconfig             |  14 +++
>  drivers/firmware/efi/efi.c               |   1 +
>  drivers/firmware/efi/libstub/x86-stub.c  | 103 ++++++++++++++++---
>  fs/proc/meminfo.c                        |   5 +
>  include/linux/efi.h                      |   3 +-
>  include/linux/mmzone.h                   |   1 +
>  include/linux/page-flags.h               |  31 ++++++
>  mm/internal.h                            |  12 +++
>  mm/memblock.c                            |   9 ++
>  mm/page_alloc.c                          |  96 +++++++++++++++++-
>  mm/vmstat.c                              |   1 +
>  43 files changed, 1191 insertions(+), 115 deletions(-)
>  create mode 100644 arch/x86/boot/compressed/align.h
>  create mode 100644 arch/x86/boot/compressed/bitmap.c
>  create mode 100644 arch/x86/boot/compressed/bitmap.h
>  create mode 100644 arch/x86/boot/compressed/bits.h
>  create mode 100644 arch/x86/boot/compressed/compiler.h
>  create mode 100644 arch/x86/boot/compressed/find.c
>  create mode 100644 arch/x86/boot/compressed/find.h
>  create mode 100644 arch/x86/boot/compressed/math.h
>  create mode 100644 arch/x86/boot/compressed/mem.c
>  create mode 100644 arch/x86/boot/compressed/minmax.h
>  create mode 100644 arch/x86/boot/compressed/pgtable_types.h
>  create mode 100644 arch/x86/include/asm/unaccepted_memory.h
>  create mode 100644 arch/x86/mm/unaccepted_memory.c
>
> --
> 2.35.1
>
>
Dave Hansen June 24, 2022, 4:57 p.m. UTC | #5
Peter, is your enter key broken?  You seem to be typing all your text in
a single unreadable paragraph.

On 6/24/22 09:37, Peter Gonda wrote:
> if a customer incorrectly labels their image it may fail to boot..

You're saying that firmware basically has two choices:
1. Accept all the memory up front and boot slowly, but reliably
2. Use thus "unaccepted memory" mechanism, boot fast, but risk that the
   VM loses a bunch of memory.

If the guest can't even boot because of a lack of memory, then the
pre-accepted chunk is probably too small in the first place.

If the customer screws up, they lose a bunch of the RAM they paid for.
That seems like a rather self-correcting problem to me.
Marc Orr June 24, 2022, 5:06 p.m. UTC | #6
On Fri, Jun 24, 2022 at 9:57 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> Peter, is your enter key broken?  You seem to be typing all your text in
> a single unreadable paragraph.
>
> On 6/24/22 09:37, Peter Gonda wrote:
> > if a customer incorrectly labels their image it may fail to boot..
>
> You're saying that firmware basically has two choices:
> 1. Accept all the memory up front and boot slowly, but reliably
> 2. Use thus "unaccepted memory" mechanism, boot fast, but risk that the
>    VM loses a bunch of memory.
>
> If the guest can't even boot because of a lack of memory, then the
> pre-accepted chunk is probably too small in the first place.
>
> If the customer screws up, they lose a bunch of the RAM they paid for.
> That seems like a rather self-correcting problem to me.

I think Peter's point is a little more nuanced than that. Once lazy
accept goes into the guest firmware -- without the feature negotiation
that Peter is suggesting -- cloud providers now have a bookkeeping
problem. Which images have kernels that can boot from a guest firmware
that doesn't pre-validate all the guest memory?

The way we've been solving similar bookkeeping problems up to now
(e.g., Which guest can run with CVM features like TDX/SEV enabled?
Which SEV guests can live migrate?) is as follows. We tag images with
feature tags. But this is sort of a hack. And not a great one. It's
confusing to customers, hard for the cloud service provider to
support, and easy to mess up.

It would be better if the guest FW knew whether or not the kernel it
was going to launch supported lazy accept.

That being said, this does seem like a difficult problem to solve,
since it's sort of backward from how things work, in that when the
guest firmware wants to switch between pre-validating all memory vs.
minimizing what it pre-validates, the guest kernel is not running yet!
But if there is some way to do this, it would be a huge improvement
over the current status quo of pushing the feature negotiation up to
the cloud service provider and ultimately the cloud customer.
Dave Hansen June 24, 2022, 5:09 p.m. UTC | #7
On 6/24/22 10:06, Marc Orr wrote:
> I think Peter's point is a little more nuanced than that. Once lazy
> accept goes into the guest firmware -- without the feature negotiation
> that Peter is suggesting -- cloud providers now have a bookkeeping
> problem. Which images have kernels that can boot from a guest firmware
> that doesn't pre-validate all the guest memory?

Hold on a sec though...

Is this a matter of

	can boot from a guest firmware that doesn't pre-validate all the
	guest memory?

or

	can boot from a guest firmware that doesn't pre-validate all the
	guest memory ... with access to all of that guest's RAM?

In other words, are we talking about "fails to boot" or "can't see all
the RAM"?
Peter Gonda June 24, 2022, 5:15 p.m. UTC | #8
>> > Peter, is your enter key broken?  You seem to be typing all your text in
>> > a single unreadable paragraph.

Sorry I will try to format better in the future.

>> > You're saying that firmware basically has two choices:
>> > 1. Accept all the memory up front and boot slowly, but reliably
>> > 2. Use thus "unaccepted memory" mechanism, boot fast, but risk that the
>> >    VM loses a bunch of memory.

That's right. Given that the first round of SNP guest patches are in
but this work to support unaccepted memory for SNP is not we assume we
will have distros that support SNP without this "unaccepted memory"
feature.

On Fri, Jun 24, 2022 at 11:10 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/24/22 10:06, Marc Orr wrote:
> > I think Peter's point is a little more nuanced than that. Once lazy
> > accept goes into the guest firmware -- without the feature negotiation
> > that Peter is suggesting -- cloud providers now have a bookkeeping
> > problem. Which images have kernels that can boot from a guest firmware
> > that doesn't pre-validate all the guest memory?
>
> Hold on a sec though...
>
> Is this a matter of
>
>         can boot from a guest firmware that doesn't pre-validate all the
>         guest memory?
>
> or
>
>         can boot from a guest firmware that doesn't pre-validate all the
>         guest memory ... with access to all of that guest's RAM?
>
> In other words, are we talking about "fails to boot" or "can't see all
> the RAM"?
>

Yes, I'm sorry I was mistaken. If FW uses unaccepted memory but the
kernel doesn't support it the VM should still boot but will fail to
utilize all of its given RAM.

>> > If the customer screws up, they lose a bunch of the RAM they paid for.
>> > That seems like a rather self-correcting problem to me.

Providing customers with an easy to use product is a problem for us
the cloud provider, encoding foot-guns doesn't sound like what's best
for the user here. I wanted to bring this up here since it seems like
a problem most vendors/users of SNP and TDX would run into. We can of
course figure this out internally if no one else sees this as an
issue.
Marc Orr June 24, 2022, 5:19 p.m. UTC | #9
On Fri, Jun 24, 2022 at 10:10 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/24/22 10:06, Marc Orr wrote:
> > I think Peter's point is a little more nuanced than that. Once lazy
> > accept goes into the guest firmware -- without the feature negotiation
> > that Peter is suggesting -- cloud providers now have a bookkeeping
> > problem. Which images have kernels that can boot from a guest firmware
> > that doesn't pre-validate all the guest memory?
>
> Hold on a sec though...
>
> Is this a matter of
>
>         can boot from a guest firmware that doesn't pre-validate all the
>         guest memory?
>
> or
>
>         can boot from a guest firmware that doesn't pre-validate all the
>         guest memory ... with access to all of that guest's RAM?
>
> In other words, are we talking about "fails to boot" or "can't see all
> the RAM"?

Ah... yeah, you're right, Dave -- I guess it's the latter. The guest
won't have access to all of the memory that the customer is paying
for. But that's still bad. If the customer buys a 96 GB VM and can
only see 4GB because they're kernel doesn't have these patches they're
going to be confused and frustrated.
Peter Gonda June 24, 2022, 5:21 p.m. UTC | #10
On Fri, Jun 24, 2022 at 11:19 AM Marc Orr <marcorr@google.com> wrote:
>
> On Fri, Jun 24, 2022 at 10:10 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 6/24/22 10:06, Marc Orr wrote:
> > > I think Peter's point is a little more nuanced than that. Once lazy
> > > accept goes into the guest firmware -- without the feature negotiation
> > > that Peter is suggesting -- cloud providers now have a bookkeeping
> > > problem. Which images have kernels that can boot from a guest firmware
> > > that doesn't pre-validate all the guest memory?
> >
> > Hold on a sec though...
> >
> > Is this a matter of
> >
> >         can boot from a guest firmware that doesn't pre-validate all the
> >         guest memory?
> >
> > or
> >
> >         can boot from a guest firmware that doesn't pre-validate all the
> >         guest memory ... with access to all of that guest's RAM?
> >
> > In other words, are we talking about "fails to boot" or "can't see all
> > the RAM"?
>
> Ah... yeah, you're right, Dave -- I guess it's the latter. The guest
> won't have access to all of the memory that the customer is paying
> for. But that's still bad. If the customer buys a 96 GB VM and can
> only see 4GB because they're kernel doesn't have these patches they're
> going to be confused and frustrated.

The other error case which might be more confusing to the customer is
their kernel does have these patches, there is some misconfiguration
and their VM boots slowly because the FW uses the accept all memory
approach.
Michael Roth June 24, 2022, 5:40 p.m. UTC | #11
On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > UEFI Specification version 2.9 introduces the concept of memory
> > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > SEV-SNP, requiring memory to be accepted before it can be used by the
> > guest. Accepting happens via a protocol specific for the Virtual
> > Machine platform.
> >
> > Accepting memory is costly and it makes VMM allocate memory for the
> > accepted guest physical address range. It's better to postpone memory
> > acceptance until memory is needed. It lowers boot time and reduces
> > memory overhead.
> >
> > The kernel needs to know what memory has been accepted. Firmware
> > communicates this information via memory map: a new memory type --
> > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> >
> > Range-based tracking works fine for firmware, but it gets bulky for
> > the kernel: e820 has to be modified on every page acceptance. It leads
> > to table fragmentation, but there's a limited number of entries in the
> > e820 table
> >
> > Another option is to mark such memory as usable in e820 and track if the
> > range has been accepted in a bitmap. One bit in the bitmap represents
> > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > physical address space.
> >
> > In the worst-case scenario -- a huge hole in the middle of the
> > address space -- It needs 256MiB to handle 4PiB of the address
> > space.
> >
> > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> >
> > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > faster for 4G TDX VM and ~4x faster for 64G.
> >
> > TDX-specific code isolated from the core of unaccepted memory support. It
> > supposed to help to plug-in different implementation of unaccepted memory
> > such as SEV-SNP.
> >
> > The tree can be found here:
> >
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fintel%2Ftdx.git&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C73bacba017c84291482a08da55ffd481%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637916854542432349%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=P%2FUJOL305xo85NLXGxGouQVGHgzLJpmBdNyZ7Re5%2FB0%3D&amp;reserved=0 guest-unaccepted-memory
> 
> Hi Kirill,
> 
> I have a couple questions about this feature mainly about how cloud
> customers can use this, I assume since this is a confidential compute
> feature a large number of the users of these patches will be cloud
> customers using TDX and SNP. One issue I see with these patches is how
> do we as a cloud provider know whether a customer's linux image
> supports this feature, if the image doesn't have these patches UEFI
> needs to fully validate the memory, if the image does we can use this
> new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> and the customer doesn't input into which UEFI we run, as far as I can
> tell from the Azure SNP VM documentation it seems very similar. We
> need to somehow tell our UEFI in the VM what to do based on the image.
> The current way I can see to solve this issue would be to have our
> customers give us metadata about their VM's image but this seems kinda
> burdensome on our customers (I assume we'll have more features which
> both UEFI and kernel need to both support inorder to be turned on like
> this one) and error-prone, if a customer incorrectly labels their

> image it may fail to boot.. Has there been any discussion about how to
> solve this? My naive thoughts were what if UEFI and Kernel had some
> sort of feature negotiation. Maybe that could happen via an extension
> to exit boot services or a UEFI runtime driver, I'm not sure what's
> best here just some ideas.

Not sure if you've seen this thread or not, but there's also been some
discussion around this in the context of the UEFI support:

  https://patchew.org/EDK2/cover.1654420875.git.min.m.xu@intel.com/cce5ea2aaaeddd9ce9df6fa7ac1ef52976c5c7e6.1654420876.git.min.m.xu@intel.com/#20220608061805.vvsjiqt55rqnl3fw@sirius.home.kraxel.org

2 things being discussed there really, which I think roughly boil down
to:

 1) how to configure OVMF to enable/disable lazy acceptance
    - compile time option most likely: accept-all/accept-minimum/accept-1GB

 2) how to introduce an automatic mode in the future where OVMF does the
    right thing based on what the guest supports. Gerd floated the idea of
    tying it to ExitBootServices as well, but not sure there's a solid
    plan on what to do here yet.

If that's accurate, it seems like the only 'safe' option is to disable it via
#1 (accept-all), and then when #2 comes along, compile OVMF to just Do The
Right Thing.

Users who know their VMs implement lazy acceptance can force it on via
accept-all OVMF compile option.

-Mike
Dave Hansen June 24, 2022, 5:47 p.m. UTC | #12
On 6/24/22 10:19, Marc Orr wrote:
>> Is this a matter of
>>
>>         can boot from a guest firmware that doesn't pre-validate all the
>>         guest memory?
>>
>> or
>>
>>         can boot from a guest firmware that doesn't pre-validate all the
>>         guest memory ... with access to all of that guest's RAM?
>>
>> In other words, are we talking about "fails to boot" or "can't see all
>> the RAM"?
> Ah... yeah, you're right, Dave -- I guess it's the latter. The guest
> won't have access to all of the memory that the customer is paying
> for. But that's still bad. If the customer buys a 96 GB VM and can
> only see 4GB because they're kernel doesn't have these patches they're
> going to be confused and frustrated.

They'll at least be a _bit_ less angry and frustrated than if they were
staring at a blank screen. ;)  But, yeah, I totally get the point.

How big is the window going to be where we have guests that can have
unaccepted memory, but don't have acceptance support?  For TDX, it's
looking like it'll probably _just_ be 5.19.  Is TDX on 5.19 in shape
that cloud providers can deploy it?  Or, is stuff like lack of
attestation a deal breaker?
Michael Roth June 24, 2022, 5:58 p.m. UTC | #13
On Fri, Jun 24, 2022 at 12:40:57PM -0500, Michael Roth wrote:
> 
>  1) how to configure OVMF to enable/disable lazy acceptance
>     - compile time option most likely: accept-all/accept-minimum/accept-1GB
> 
>  2) how to introduce an automatic mode in the future where OVMF does the
>     right thing based on what the guest supports. Gerd floated the idea of
>     tying it to ExitBootServices as well, but not sure there's a solid
>     plan on what to do here yet.
> 
> If that's accurate, it seems like the only 'safe' option is to disable it via
> #1 (accept-all), and then when #2 comes along, compile OVMF to just Do The
> Right Thing.
> 
> Users who know their VMs implement lazy acceptance can force it on via
> accept-all OVMF compile option.

accept-min / accept-X I mean.
Peter Gonda June 24, 2022, 6:05 p.m. UTC | #14
On Fri, Jun 24, 2022 at 11:41 AM Michael Roth <michael.roth@amd.com> wrote:
>
> On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > UEFI Specification version 2.9 introduces the concept of memory
> > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > > SEV-SNP, requiring memory to be accepted before it can be used by the
> > > guest. Accepting happens via a protocol specific for the Virtual
> > > Machine platform.
> > >
> > > Accepting memory is costly and it makes VMM allocate memory for the
> > > accepted guest physical address range. It's better to postpone memory
> > > acceptance until memory is needed. It lowers boot time and reduces
> > > memory overhead.
> > >
> > > The kernel needs to know what memory has been accepted. Firmware
> > > communicates this information via memory map: a new memory type --
> > > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> > >
> > > Range-based tracking works fine for firmware, but it gets bulky for
> > > the kernel: e820 has to be modified on every page acceptance. It leads
> > > to table fragmentation, but there's a limited number of entries in the
> > > e820 table
> > >
> > > Another option is to mark such memory as usable in e820 and track if the
> > > range has been accepted in a bitmap. One bit in the bitmap represents
> > > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > > physical address space.
> > >
> > > In the worst-case scenario -- a huge hole in the middle of the
> > > address space -- It needs 256MiB to handle 4PiB of the address
> > > space.
> > >
> > > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> > >
> > > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > > faster for 4G TDX VM and ~4x faster for 64G.
> > >
> > > TDX-specific code isolated from the core of unaccepted memory support. It
> > > supposed to help to plug-in different implementation of unaccepted memory
> > > such as SEV-SNP.
> > >
> > > The tree can be found here:
> > >
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fintel%2Ftdx.git&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C73bacba017c84291482a08da55ffd481%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637916854542432349%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=P%2FUJOL305xo85NLXGxGouQVGHgzLJpmBdNyZ7Re5%2FB0%3D&amp;reserved=0 guest-unaccepted-memory
> >
> > Hi Kirill,
> >
> > I have a couple questions about this feature mainly about how cloud
> > customers can use this, I assume since this is a confidential compute
> > feature a large number of the users of these patches will be cloud
> > customers using TDX and SNP. One issue I see with these patches is how
> > do we as a cloud provider know whether a customer's linux image
> > supports this feature, if the image doesn't have these patches UEFI
> > needs to fully validate the memory, if the image does we can use this
> > new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> > and the customer doesn't input into which UEFI we run, as far as I can
> > tell from the Azure SNP VM documentation it seems very similar. We
> > need to somehow tell our UEFI in the VM what to do based on the image.
> > The current way I can see to solve this issue would be to have our
> > customers give us metadata about their VM's image but this seems kinda
> > burdensome on our customers (I assume we'll have more features which
> > both UEFI and kernel need to both support inorder to be turned on like
> > this one) and error-prone, if a customer incorrectly labels their
>
> > image it may fail to boot.. Has there been any discussion about how to
> > solve this? My naive thoughts were what if UEFI and Kernel had some
> > sort of feature negotiation. Maybe that could happen via an extension
> > to exit boot services or a UEFI runtime driver, I'm not sure what's
> > best here just some ideas.
>
> Not sure if you've seen this thread or not, but there's also been some
> discussion around this in the context of the UEFI support:
>
>   https://patchew.org/EDK2/cover.1654420875.git.min.m.xu@intel.com/cce5ea2aaaeddd9ce9df6fa7ac1ef52976c5c7e6.1654420876.git.min.m.xu@intel.com/#20220608061805.vvsjiqt55rqnl3fw@sirius.home.kraxel.org
>
> 2 things being discussed there really, which I think roughly boil down
> to:
>
>  1) how to configure OVMF to enable/disable lazy acceptance
>     - compile time option most likely: accept-all/accept-minimum/accept-1GB
>
>  2) how to introduce an automatic mode in the future where OVMF does the
>     right thing based on what the guest supports. Gerd floated the idea of
>     tying it to ExitBootServices as well, but not sure there's a solid
>     plan on what to do here yet.
>
> If that's accurate, it seems like the only 'safe' option is to disable it via
> #1 (accept-all), and then when #2 comes along, compile OVMF to just Do The
> Right Thing.
>
> Users who know their VMs implement lazy acceptance can force it on via
> accept-all OVMF compile option.

Thanks for this Mike! I will bring this to the EDK2 community.

The issue for us is our users use a GCE built EDK2 not their own
compiled version so they don't have the choice. Reading the Azure docs
it seems the same for them, and for AWS so I don't know how often
customers actually get to bring their own firmware.

>
> -Mike
Peter Gonda June 24, 2022, 6:10 p.m. UTC | #15
On Fri, Jun 24, 2022 at 11:47 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/24/22 10:19, Marc Orr wrote:
> >> Is this a matter of
> >>
> >>         can boot from a guest firmware that doesn't pre-validate all the
> >>         guest memory?
> >>
> >> or
> >>
> >>         can boot from a guest firmware that doesn't pre-validate all the
> >>         guest memory ... with access to all of that guest's RAM?
> >>
> >> In other words, are we talking about "fails to boot" or "can't see all
> >> the RAM"?
> > Ah... yeah, you're right, Dave -- I guess it's the latter. The guest
> > won't have access to all of the memory that the customer is paying
> > for. But that's still bad. If the customer buys a 96 GB VM and can
> > only see 4GB because they're kernel doesn't have these patches they're
> > going to be confused and frustrated.
>
> They'll at least be a _bit_ less angry and frustrated than if they were
> staring at a blank screen. ;)  But, yeah, I totally get the point.

Ha! Well we do have that issue in some cases. If you try to run an SEV
VM with an image that doesn't support SEV you will just get a blank
serial screen. If we had something like this back then the FW could
have surfaced a nice error to the user but that's history now.

>
> How big is the window going to be where we have guests that can have
> unaccepted memory, but don't have acceptance support?  For TDX, it's
> looking like it'll probably _just_ be 5.19.  Is TDX on 5.19 in shape
> that cloud providers can deploy it?  Or, is stuff like lack of
> attestation a deal breaker?

This is complicated because distros don't run upstream linux versions.
If I understand correctly (I see some distro emails on here so please
correct me) distros normally maintain forks which they backport things
into. So I cannot answer this question. It is possible that a
hypothetical distro backports only the SNP/TDX initial patches and
doesn't take these for many releases.

I am more familiar with SNP and it does have some attestation support
in the first patch sets.

Also I should have been more clear. I don't want to try and hold up
this feature but instead discuss a future usability add-on feature.

>
>
Dave Hansen June 24, 2022, 6:13 p.m. UTC | #16
On 6/24/22 11:10, Peter Gonda wrote:
>> How big is the window going to be where we have guests that can have
>> unaccepted memory, but don't have acceptance support?  For TDX, it's
>> looking like it'll probably _just_ be 5.19.  Is TDX on 5.19 in shape
>> that cloud providers can deploy it?  Or, is stuff like lack of
>> attestation a deal breaker?
> This is complicated because distros don't run upstream linux versions.
> If I understand correctly (I see some distro emails on here so please
> correct me) distros normally maintain forks which they backport things
> into. So I cannot answer this question. It is possible that a
> hypothetical distro backports only the SNP/TDX initial patches and
> doesn't take these for many releases.

Distros could also backport a bare-bones version of this set that
doesn't do anything fancy and just synchronously accepts the memory at
boot.  No bitmap, no page allocator changes.  It'll slow boot down, but
is better than having no RAM.
Kirill A. Shutemov June 27, 2022, 11:30 a.m. UTC | #17
On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > UEFI Specification version 2.9 introduces the concept of memory
> > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > SEV-SNP, requiring memory to be accepted before it can be used by the
> > guest. Accepting happens via a protocol specific for the Virtual
> > Machine platform.
> >
> > Accepting memory is costly and it makes VMM allocate memory for the
> > accepted guest physical address range. It's better to postpone memory
> > acceptance until memory is needed. It lowers boot time and reduces
> > memory overhead.
> >
> > The kernel needs to know what memory has been accepted. Firmware
> > communicates this information via memory map: a new memory type --
> > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> >
> > Range-based tracking works fine for firmware, but it gets bulky for
> > the kernel: e820 has to be modified on every page acceptance. It leads
> > to table fragmentation, but there's a limited number of entries in the
> > e820 table
> >
> > Another option is to mark such memory as usable in e820 and track if the
> > range has been accepted in a bitmap. One bit in the bitmap represents
> > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > physical address space.
> >
> > In the worst-case scenario -- a huge hole in the middle of the
> > address space -- It needs 256MiB to handle 4PiB of the address
> > space.
> >
> > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> >
> > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > faster for 4G TDX VM and ~4x faster for 64G.
> >
> > TDX-specific code isolated from the core of unaccepted memory support. It
> > supposed to help to plug-in different implementation of unaccepted memory
> > such as SEV-SNP.
> >
> > The tree can be found here:
> >
> > https://github.com/intel/tdx.git guest-unaccepted-memory
> 
> Hi Kirill,
> 
> I have a couple questions about this feature mainly about how cloud
> customers can use this, I assume since this is a confidential compute
> feature a large number of the users of these patches will be cloud
> customers using TDX and SNP. One issue I see with these patches is how
> do we as a cloud provider know whether a customer's linux image
> supports this feature, if the image doesn't have these patches UEFI
> needs to fully validate the memory, if the image does we can use this
> new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> and the customer doesn't input into which UEFI we run, as far as I can
> tell from the Azure SNP VM documentation it seems very similar. We
> need to somehow tell our UEFI in the VM what to do based on the image.
> The current way I can see to solve this issue would be to have our
> customers give us metadata about their VM's image but this seems kinda
> burdensome on our customers (I assume we'll have more features which
> both UEFI and kernel need to both support inorder to be turned on like
> this one) and error-prone, if a customer incorrectly labels their
> image it may fail to boot.. Has there been any discussion about how to
> solve this? My naive thoughts were what if UEFI and Kernel had some
> sort of feature negotiation. Maybe that could happen via an extension
> to exit boot services or a UEFI runtime driver, I'm not sure what's
> best here just some ideas.

Just as an idea, we can put info into UTS_VERSION which can be read from
the built bzImage. We have info on SMP and preeption there already.

Patch below does this:

$ file arch/x86/boot/bzImage
arch/x86/boot/bzImage: Linux kernel x86 boot executable bzImage, version 5.19.0-rc3-00016-g2f6aa48e28d9-dirty (kas@box) #2300 SMP PREEMPT_DYNAMIC UNACCEPTED_MEMORY Mon Jun 27 14:23:04 , RO-rootFS, swap_dev 0XC, Normal VGA

Note UNACCEPTED_MEMORY in the output.

Probably we want to have there info on which flavour of unaccepted memory
is supported (TDX/SNP/whatever). It is a bit more tricky.

Any opinion?

diff --git a/init/Makefile b/init/Makefile
index d82623d7fc8e..6688ea43e6bf 100644
--- a/init/Makefile
+++ b/init/Makefile
@@ -32,7 +32,7 @@ quiet_cmd_compile.h = CHK     $@
 	$(CONFIG_SHELL) $(srctree)/scripts/mkcompile_h $@	\
 	"$(UTS_MACHINE)" "$(CONFIG_SMP)" "$(CONFIG_PREEMPT_BUILD)"	\
 	"$(CONFIG_PREEMPT_DYNAMIC)" "$(CONFIG_PREEMPT_RT)" \
-	"$(CONFIG_CC_VERSION_TEXT)" "$(LD)"
+	"$(CONFIG_UNACCEPTED_MEMORY)" "$(CONFIG_CC_VERSION_TEXT)" "$(LD)"

 include/generated/compile.h: FORCE
 	$(call cmd,compile.h)
diff --git a/scripts/mkcompile_h b/scripts/mkcompile_h
index ca40a5258c87..efacfecad699 100755
--- a/scripts/mkcompile_h
+++ b/scripts/mkcompile_h
@@ -7,8 +7,9 @@ SMP=$3
 PREEMPT=$4
 PREEMPT_DYNAMIC=$5
 PREEMPT_RT=$6
-CC_VERSION="$7"
-LD=$8
+UNACCEPTED_MEMORY=$7
+CC_VERSION="$8"
+LD=$9

 # Do not expand names
 set -f
@@ -51,6 +52,10 @@ elif [ -n "$PREEMPT" ] ; then
 	CONFIG_FLAGS="$CONFIG_FLAGS PREEMPT"
 fi

+if [ -n "$UNACCEPTED_MEMORY" ] ; then
+	CONFIG_FLAGS="$CONFIG_FLAGS UNACCEPTED_MEMORY"
+fi
+
 # Truncate to maximum length
 UTS_LEN=64
 UTS_VERSION="$(echo $UTS_VERSION $CONFIG_FLAGS $TIMESTAMP | cut -b -$UTS_LEN)"
Ard Biesheuvel June 27, 2022, 11:54 a.m. UTC | #18
On Mon, 27 Jun 2022 at 13:30, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > UEFI Specification version 2.9 introduces the concept of memory
> > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > > SEV-SNP, requiring memory to be accepted before it can be used by the
> > > guest. Accepting happens via a protocol specific for the Virtual
> > > Machine platform.
> > >
> > > Accepting memory is costly and it makes VMM allocate memory for the
> > > accepted guest physical address range. It's better to postpone memory
> > > acceptance until memory is needed. It lowers boot time and reduces
> > > memory overhead.
> > >
> > > The kernel needs to know what memory has been accepted. Firmware
> > > communicates this information via memory map: a new memory type --
> > > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> > >
> > > Range-based tracking works fine for firmware, but it gets bulky for
> > > the kernel: e820 has to be modified on every page acceptance. It leads
> > > to table fragmentation, but there's a limited number of entries in the
> > > e820 table
> > >
> > > Another option is to mark such memory as usable in e820 and track if the
> > > range has been accepted in a bitmap. One bit in the bitmap represents
> > > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > > physical address space.
> > >
> > > In the worst-case scenario -- a huge hole in the middle of the
> > > address space -- It needs 256MiB to handle 4PiB of the address
> > > space.
> > >
> > > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> > >
> > > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > > faster for 4G TDX VM and ~4x faster for 64G.
> > >
> > > TDX-specific code isolated from the core of unaccepted memory support. It
> > > supposed to help to plug-in different implementation of unaccepted memory
> > > such as SEV-SNP.
> > >
> > > The tree can be found here:
> > >
> > > https://github.com/intel/tdx.git guest-unaccepted-memory
> >
> > Hi Kirill,
> >
> > I have a couple questions about this feature mainly about how cloud
> > customers can use this, I assume since this is a confidential compute
> > feature a large number of the users of these patches will be cloud
> > customers using TDX and SNP. One issue I see with these patches is how
> > do we as a cloud provider know whether a customer's linux image
> > supports this feature, if the image doesn't have these patches UEFI
> > needs to fully validate the memory, if the image does we can use this
> > new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> > and the customer doesn't input into which UEFI we run, as far as I can
> > tell from the Azure SNP VM documentation it seems very similar. We
> > need to somehow tell our UEFI in the VM what to do based on the image.
> > The current way I can see to solve this issue would be to have our
> > customers give us metadata about their VM's image but this seems kinda
> > burdensome on our customers (I assume we'll have more features which
> > both UEFI and kernel need to both support inorder to be turned on like
> > this one) and error-prone, if a customer incorrectly labels their
> > image it may fail to boot.. Has there been any discussion about how to
> > solve this? My naive thoughts were what if UEFI and Kernel had some
> > sort of feature negotiation. Maybe that could happen via an extension
> > to exit boot services or a UEFI runtime driver, I'm not sure what's
> > best here just some ideas.
>
> Just as an idea, we can put info into UTS_VERSION which can be read from
> the built bzImage. We have info on SMP and preeption there already.
>

Instead of hacking this into the binary, couldn't we define a protocol
that the kernel will call from the EFI stub (before EBS()) to identify
itself as an image that understands unaccepted memory, and knows how
to deal with it?

That way, the firmware can accept all the memory on behalf of the OS
at ExitBootServices() time, unless the OS has indicated there is no
need to do so.




> Patch below does this:
>
> $ file arch/x86/boot/bzImage
> arch/x86/boot/bzImage: Linux kernel x86 boot executable bzImage, version 5.19.0-rc3-00016-g2f6aa48e28d9-dirty (kas@box) #2300 SMP PREEMPT_DYNAMIC UNACCEPTED_MEMORY Mon Jun 27 14:23:04 , RO-rootFS, swap_dev 0XC, Normal VGA
>
> Note UNACCEPTED_MEMORY in the output.
>
> Probably we want to have there info on which flavour of unaccepted memory
> is supported (TDX/SNP/whatever). It is a bit more tricky.
>
> Any opinion?
>
> diff --git a/init/Makefile b/init/Makefile
> index d82623d7fc8e..6688ea43e6bf 100644
> --- a/init/Makefile
> +++ b/init/Makefile
> @@ -32,7 +32,7 @@ quiet_cmd_compile.h = CHK     $@
>         $(CONFIG_SHELL) $(srctree)/scripts/mkcompile_h $@       \
>         "$(UTS_MACHINE)" "$(CONFIG_SMP)" "$(CONFIG_PREEMPT_BUILD)"      \
>         "$(CONFIG_PREEMPT_DYNAMIC)" "$(CONFIG_PREEMPT_RT)" \
> -       "$(CONFIG_CC_VERSION_TEXT)" "$(LD)"
> +       "$(CONFIG_UNACCEPTED_MEMORY)" "$(CONFIG_CC_VERSION_TEXT)" "$(LD)"
>
>  include/generated/compile.h: FORCE
>         $(call cmd,compile.h)
> diff --git a/scripts/mkcompile_h b/scripts/mkcompile_h
> index ca40a5258c87..efacfecad699 100755
> --- a/scripts/mkcompile_h
> +++ b/scripts/mkcompile_h
> @@ -7,8 +7,9 @@ SMP=$3
>  PREEMPT=$4
>  PREEMPT_DYNAMIC=$5
>  PREEMPT_RT=$6
> -CC_VERSION="$7"
> -LD=$8
> +UNACCEPTED_MEMORY=$7
> +CC_VERSION="$8"
> +LD=$9
>
>  # Do not expand names
>  set -f
> @@ -51,6 +52,10 @@ elif [ -n "$PREEMPT" ] ; then
>         CONFIG_FLAGS="$CONFIG_FLAGS PREEMPT"
>  fi
>
> +if [ -n "$UNACCEPTED_MEMORY" ] ; then
> +       CONFIG_FLAGS="$CONFIG_FLAGS UNACCEPTED_MEMORY"
> +fi
> +
>  # Truncate to maximum length
>  UTS_LEN=64
>  UTS_VERSION="$(echo $UTS_VERSION $CONFIG_FLAGS $TIMESTAMP | cut -b -$UTS_LEN)"
> --
>  Kirill A. Shutemov
Kirill A. Shutemov June 27, 2022, 12:22 p.m. UTC | #19
On Mon, Jun 27, 2022 at 01:54:45PM +0200, Ard Biesheuvel wrote:
> On Mon, 27 Jun 2022 at 13:30, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> > > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> > > <kirill.shutemov@linux.intel.com> wrote:
> > > >
> > > > UEFI Specification version 2.9 introduces the concept of memory
> > > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > > > SEV-SNP, requiring memory to be accepted before it can be used by the
> > > > guest. Accepting happens via a protocol specific for the Virtual
> > > > Machine platform.
> > > >
> > > > Accepting memory is costly and it makes VMM allocate memory for the
> > > > accepted guest physical address range. It's better to postpone memory
> > > > acceptance until memory is needed. It lowers boot time and reduces
> > > > memory overhead.
> > > >
> > > > The kernel needs to know what memory has been accepted. Firmware
> > > > communicates this information via memory map: a new memory type --
> > > > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> > > >
> > > > Range-based tracking works fine for firmware, but it gets bulky for
> > > > the kernel: e820 has to be modified on every page acceptance. It leads
> > > > to table fragmentation, but there's a limited number of entries in the
> > > > e820 table
> > > >
> > > > Another option is to mark such memory as usable in e820 and track if the
> > > > range has been accepted in a bitmap. One bit in the bitmap represents
> > > > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > > > physical address space.
> > > >
> > > > In the worst-case scenario -- a huge hole in the middle of the
> > > > address space -- It needs 256MiB to handle 4PiB of the address
> > > > space.
> > > >
> > > > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> > > >
> > > > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > > > faster for 4G TDX VM and ~4x faster for 64G.
> > > >
> > > > TDX-specific code isolated from the core of unaccepted memory support. It
> > > > supposed to help to plug-in different implementation of unaccepted memory
> > > > such as SEV-SNP.
> > > >
> > > > The tree can be found here:
> > > >
> > > > https://github.com/intel/tdx.git guest-unaccepted-memory
> > >
> > > Hi Kirill,
> > >
> > > I have a couple questions about this feature mainly about how cloud
> > > customers can use this, I assume since this is a confidential compute
> > > feature a large number of the users of these patches will be cloud
> > > customers using TDX and SNP. One issue I see with these patches is how
> > > do we as a cloud provider know whether a customer's linux image
> > > supports this feature, if the image doesn't have these patches UEFI
> > > needs to fully validate the memory, if the image does we can use this
> > > new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> > > and the customer doesn't input into which UEFI we run, as far as I can
> > > tell from the Azure SNP VM documentation it seems very similar. We
> > > need to somehow tell our UEFI in the VM what to do based on the image.
> > > The current way I can see to solve this issue would be to have our
> > > customers give us metadata about their VM's image but this seems kinda
> > > burdensome on our customers (I assume we'll have more features which
> > > both UEFI and kernel need to both support inorder to be turned on like
> > > this one) and error-prone, if a customer incorrectly labels their
> > > image it may fail to boot.. Has there been any discussion about how to
> > > solve this? My naive thoughts were what if UEFI and Kernel had some
> > > sort of feature negotiation. Maybe that could happen via an extension
> > > to exit boot services or a UEFI runtime driver, I'm not sure what's
> > > best here just some ideas.
> >
> > Just as an idea, we can put info into UTS_VERSION which can be read from
> > the built bzImage. We have info on SMP and preeption there already.
> >
> 
> Instead of hacking this into the binary, couldn't we define a protocol
> that the kernel will call from the EFI stub (before EBS()) to identify
> itself as an image that understands unaccepted memory, and knows how
> to deal with it?
> 
> That way, the firmware can accept all the memory on behalf of the OS
> at ExitBootServices() time, unless the OS has indicated there is no
> need to do so.

I agree it would be better. But I think it would require change to EFI
spec, no?
Peter Gonda June 27, 2022, 4:17 p.m. UTC | #20
On Mon, Jun 27, 2022 at 6:22 AM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Mon, Jun 27, 2022 at 01:54:45PM +0200, Ard Biesheuvel wrote:
> > On Mon, 27 Jun 2022 at 13:30, Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> > > > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> > > > <kirill.shutemov@linux.intel.com> wrote:
> > > > >
> > > > > UEFI Specification version 2.9 introduces the concept of memory
> > > > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > > > > SEV-SNP, requiring memory to be accepted before it can be used by the
> > > > > guest. Accepting happens via a protocol specific for the Virtual
> > > > > Machine platform.
> > > > >
> > > > > Accepting memory is costly and it makes VMM allocate memory for the
> > > > > accepted guest physical address range. It's better to postpone memory
> > > > > acceptance until memory is needed. It lowers boot time and reduces
> > > > > memory overhead.
> > > > >
> > > > > The kernel needs to know what memory has been accepted. Firmware
> > > > > communicates this information via memory map: a new memory type --
> > > > > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> > > > >
> > > > > Range-based tracking works fine for firmware, but it gets bulky for
> > > > > the kernel: e820 has to be modified on every page acceptance. It leads
> > > > > to table fragmentation, but there's a limited number of entries in the
> > > > > e820 table
> > > > >
> > > > > Another option is to mark such memory as usable in e820 and track if the
> > > > > range has been accepted in a bitmap. One bit in the bitmap represents
> > > > > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > > > > physical address space.
> > > > >
> > > > > In the worst-case scenario -- a huge hole in the middle of the
> > > > > address space -- It needs 256MiB to handle 4PiB of the address
> > > > > space.
> > > > >
> > > > > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> > > > >
> > > > > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > > > > faster for 4G TDX VM and ~4x faster for 64G.
> > > > >
> > > > > TDX-specific code isolated from the core of unaccepted memory support. It
> > > > > supposed to help to plug-in different implementation of unaccepted memory
> > > > > such as SEV-SNP.
> > > > >
> > > > > The tree can be found here:
> > > > >
> > > > > https://github.com/intel/tdx.git guest-unaccepted-memory
> > > >
> > > > Hi Kirill,
> > > >
> > > > I have a couple questions about this feature mainly about how cloud
> > > > customers can use this, I assume since this is a confidential compute
> > > > feature a large number of the users of these patches will be cloud
> > > > customers using TDX and SNP. One issue I see with these patches is how
> > > > do we as a cloud provider know whether a customer's linux image
> > > > supports this feature, if the image doesn't have these patches UEFI
> > > > needs to fully validate the memory, if the image does we can use this
> > > > new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> > > > and the customer doesn't input into which UEFI we run, as far as I can
> > > > tell from the Azure SNP VM documentation it seems very similar. We
> > > > need to somehow tell our UEFI in the VM what to do based on the image.
> > > > The current way I can see to solve this issue would be to have our
> > > > customers give us metadata about their VM's image but this seems kinda
> > > > burdensome on our customers (I assume we'll have more features which
> > > > both UEFI and kernel need to both support inorder to be turned on like
> > > > this one) and error-prone, if a customer incorrectly labels their
> > > > image it may fail to boot.. Has there been any discussion about how to
> > > > solve this? My naive thoughts were what if UEFI and Kernel had some
> > > > sort of feature negotiation. Maybe that could happen via an extension
> > > > to exit boot services or a UEFI runtime driver, I'm not sure what's
> > > > best here just some ideas.
> > >
> > > Just as an idea, we can put info into UTS_VERSION which can be read from
> > > the built bzImage. We have info on SMP and preeption there already.
> > >
> >
> > Instead of hacking this into the binary, couldn't we define a protocol
> > that the kernel will call from the EFI stub (before EBS()) to identify
> > itself as an image that understands unaccepted memory, and knows how
> > to deal with it?
> >
> > That way, the firmware can accept all the memory on behalf of the OS
> > at ExitBootServices() time, unless the OS has indicated there is no
> > need to do so.
>
> I agree it would be better. But I think it would require change to EFI
> spec, no?

Could this somehow be amended on to the UEFI Specification version 2.9
change which added all of the unaccepted memory features?

>
> --
>  Kirill A. Shutemov
Ard Biesheuvel June 27, 2022, 4:33 p.m. UTC | #21
On Mon, 27 Jun 2022 at 18:17, Peter Gonda <pgonda@google.com> wrote:
>
> On Mon, Jun 27, 2022 at 6:22 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Mon, Jun 27, 2022 at 01:54:45PM +0200, Ard Biesheuvel wrote:
> > > On Mon, 27 Jun 2022 at 13:30, Kirill A. Shutemov
> > > <kirill.shutemov@linux.intel.com> wrote:
> > > >
> > > > On Fri, Jun 24, 2022 at 10:37:10AM -0600, Peter Gonda wrote:
> > > > > On Tue, Jun 14, 2022 at 6:03 AM Kirill A. Shutemov
> > > > > <kirill.shutemov@linux.intel.com> wrote:
> > > > > >
> > > > > > UEFI Specification version 2.9 introduces the concept of memory
> > > > > > acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> > > > > > SEV-SNP, requiring memory to be accepted before it can be used by the
> > > > > > guest. Accepting happens via a protocol specific for the Virtual
> > > > > > Machine platform.
> > > > > >
> > > > > > Accepting memory is costly and it makes VMM allocate memory for the
> > > > > > accepted guest physical address range. It's better to postpone memory
> > > > > > acceptance until memory is needed. It lowers boot time and reduces
> > > > > > memory overhead.
> > > > > >
> > > > > > The kernel needs to know what memory has been accepted. Firmware
> > > > > > communicates this information via memory map: a new memory type --
> > > > > > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> > > > > >
> > > > > > Range-based tracking works fine for firmware, but it gets bulky for
> > > > > > the kernel: e820 has to be modified on every page acceptance. It leads
> > > > > > to table fragmentation, but there's a limited number of entries in the
> > > > > > e820 table
> > > > > >
> > > > > > Another option is to mark such memory as usable in e820 and track if the
> > > > > > range has been accepted in a bitmap. One bit in the bitmap represents
> > > > > > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > > > > > physical address space.
> > > > > >
> > > > > > In the worst-case scenario -- a huge hole in the middle of the
> > > > > > address space -- It needs 256MiB to handle 4PiB of the address
> > > > > > space.
> > > > > >
> > > > > > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> > > > > >
> > > > > > The approach lowers boot time substantially. Boot to shell is ~2.5x
> > > > > > faster for 4G TDX VM and ~4x faster for 64G.
> > > > > >
> > > > > > TDX-specific code isolated from the core of unaccepted memory support. It
> > > > > > supposed to help to plug-in different implementation of unaccepted memory
> > > > > > such as SEV-SNP.
> > > > > >
> > > > > > The tree can be found here:
> > > > > >
> > > > > > https://github.com/intel/tdx.git guest-unaccepted-memory
> > > > >
> > > > > Hi Kirill,
> > > > >
> > > > > I have a couple questions about this feature mainly about how cloud
> > > > > customers can use this, I assume since this is a confidential compute
> > > > > feature a large number of the users of these patches will be cloud
> > > > > customers using TDX and SNP. One issue I see with these patches is how
> > > > > do we as a cloud provider know whether a customer's linux image
> > > > > supports this feature, if the image doesn't have these patches UEFI
> > > > > needs to fully validate the memory, if the image does we can use this
> > > > > new protocol. In GCE we supply our VMs with a version of the EDK2 FW
> > > > > and the customer doesn't input into which UEFI we run, as far as I can
> > > > > tell from the Azure SNP VM documentation it seems very similar. We
> > > > > need to somehow tell our UEFI in the VM what to do based on the image.
> > > > > The current way I can see to solve this issue would be to have our
> > > > > customers give us metadata about their VM's image but this seems kinda
> > > > > burdensome on our customers (I assume we'll have more features which
> > > > > both UEFI and kernel need to both support inorder to be turned on like
> > > > > this one) and error-prone, if a customer incorrectly labels their
> > > > > image it may fail to boot.. Has there been any discussion about how to
> > > > > solve this? My naive thoughts were what if UEFI and Kernel had some
> > > > > sort of feature negotiation. Maybe that could happen via an extension
> > > > > to exit boot services or a UEFI runtime driver, I'm not sure what's
> > > > > best here just some ideas.
> > > >
> > > > Just as an idea, we can put info into UTS_VERSION which can be read from
> > > > the built bzImage. We have info on SMP and preeption there already.
> > > >
> > >
> > > Instead of hacking this into the binary, couldn't we define a protocol
> > > that the kernel will call from the EFI stub (before EBS()) to identify
> > > itself as an image that understands unaccepted memory, and knows how
> > > to deal with it?
> > >
> > > That way, the firmware can accept all the memory on behalf of the OS
> > > at ExitBootServices() time, unless the OS has indicated there is no
> > > need to do so.
> >
> > I agree it would be better. But I think it would require change to EFI
> > spec, no?
>
> Could this somehow be amended on to the UEFI Specification version 2.9
> change which added all of the unaccepted memory features?
>

Why would this need a change in the EFI spec? Not every EFI protocol
needs to be in the spec.
Kirill A. Shutemov June 27, 2022, 10:38 p.m. UTC | #22
On Mon, Jun 27, 2022 at 06:33:51PM +0200, Ard Biesheuvel wrote:
> > > > >
> > > > > Just as an idea, we can put info into UTS_VERSION which can be read from
> > > > > the built bzImage. We have info on SMP and preeption there already.
> > > > >
> > > >
> > > > Instead of hacking this into the binary, couldn't we define a protocol
> > > > that the kernel will call from the EFI stub (before EBS()) to identify
> > > > itself as an image that understands unaccepted memory, and knows how
> > > > to deal with it?
> > > >
> > > > That way, the firmware can accept all the memory on behalf of the OS
> > > > at ExitBootServices() time, unless the OS has indicated there is no
> > > > need to do so.
> > >
> > > I agree it would be better. But I think it would require change to EFI
> > > spec, no?
> >
> > Could this somehow be amended on to the UEFI Specification version 2.9
> > change which added all of the unaccepted memory features?
> >
> 
> Why would this need a change in the EFI spec? Not every EFI protocol
> needs to be in the spec.

My EFI knowledge is shallow. Do we do this in other cases?
Ard Biesheuvel June 28, 2022, 5:17 p.m. UTC | #23
On Tue, 28 Jun 2022 at 00:38, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Mon, Jun 27, 2022 at 06:33:51PM +0200, Ard Biesheuvel wrote:
> > > > > >
> > > > > > Just as an idea, we can put info into UTS_VERSION which can be read from
> > > > > > the built bzImage. We have info on SMP and preeption there already.
> > > > > >
> > > > >
> > > > > Instead of hacking this into the binary, couldn't we define a protocol
> > > > > that the kernel will call from the EFI stub (before EBS()) to identify
> > > > > itself as an image that understands unaccepted memory, and knows how
> > > > > to deal with it?
> > > > >
> > > > > That way, the firmware can accept all the memory on behalf of the OS
> > > > > at ExitBootServices() time, unless the OS has indicated there is no
> > > > > need to do so.
> > > >
> > > > I agree it would be better. But I think it would require change to EFI
> > > > spec, no?
> > >
> > > Could this somehow be amended on to the UEFI Specification version 2.9
> > > change which added all of the unaccepted memory features?
> > >
> >
> > Why would this need a change in the EFI spec? Not every EFI protocol
> > needs to be in the spec.
>
> My EFI knowledge is shallow. Do we do this in other cases?
>

The E in EFI means 'extensible' and the whole design of a protocol
database using GUIDs as identifiers (which will not collide and
therefore need no a priori coordination when defining them) is
intended to allow extensions to be defined and implemented in a
distributed manner.

Of course, it would be fantastic if we can converge on a protocol that
all flavors of confidential compute can use, across different OSes, so
it is generally good if a protocol is defined in *some* shared
specification. But this doesn't have to be the EFI spec.
Borislav Petkov July 17, 2022, 5:16 p.m. UTC | #24
On Wed, Jun 15, 2022 at 06:05:34PM +0300, Kirill A. Shutemov wrote:
> It also sounds painfully similar to uapi/ project. I'm not sure we want to
> go this path.

It is the same path perf tool is taking - see first paragraph:

https://lore.kernel.org/r/YtQM40VmiLTkPND2@kernel.org

I don't want to deal with the regular breakages or hacks to
boot/compressed/ just because little duplication. We copy the header
once and that's it - it doesn't even have to get updated like perf tool
does from time to time.
Kirill A. Shutemov July 18, 2022, 5:21 p.m. UTC | #25
On Tue, Jun 28, 2022 at 07:17:00PM +0200, Ard Biesheuvel wrote:
> On Tue, 28 Jun 2022 at 00:38, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Mon, Jun 27, 2022 at 06:33:51PM +0200, Ard Biesheuvel wrote:
> > > > > > >
> > > > > > > Just as an idea, we can put info into UTS_VERSION which can be read from
> > > > > > > the built bzImage. We have info on SMP and preeption there already.
> > > > > > >
> > > > > >
> > > > > > Instead of hacking this into the binary, couldn't we define a protocol
> > > > > > that the kernel will call from the EFI stub (before EBS()) to identify
> > > > > > itself as an image that understands unaccepted memory, and knows how
> > > > > > to deal with it?
> > > > > >
> > > > > > That way, the firmware can accept all the memory on behalf of the OS
> > > > > > at ExitBootServices() time, unless the OS has indicated there is no
> > > > > > need to do so.
> > > > >
> > > > > I agree it would be better. But I think it would require change to EFI
> > > > > spec, no?
> > > >
> > > > Could this somehow be amended on to the UEFI Specification version 2.9
> > > > change which added all of the unaccepted memory features?
> > > >
> > >
> > > Why would this need a change in the EFI spec? Not every EFI protocol
> > > needs to be in the spec.
> >
> > My EFI knowledge is shallow. Do we do this in other cases?
> >
> 
> The E in EFI means 'extensible' and the whole design of a protocol
> database using GUIDs as identifiers (which will not collide and
> therefore need no a priori coordination when defining them) is
> intended to allow extensions to be defined and implemented in a
> distributed manner.
> 
> Of course, it would be fantastic if we can converge on a protocol that
> all flavors of confidential compute can use, across different OSes, so
> it is generally good if a protocol is defined in *some* shared
> specification. But this doesn't have to be the EFI spec.

I've talked with our firmware expert today and I think we have a problem
with the approach when kernel declaries support of unaccepted memory.

This apporach doesn't work if we include bootloader into the picture: if
EBS() called by bootloader we still cannot know if target kernel supports
unaccepted memory and we return to the square 1.

I think we should make it obvious from a kernel image if it supports
unaccepted memory (with UTS_VERSION or other way).

Any comments?
Dionna Amalie Glaze July 18, 2022, 11:32 p.m. UTC | #26
> I've talked with our firmware expert today and I think we have a problem
> with the approach when kernel declaries support of unaccepted memory.
>

Is this Jiewen Yao? I've been trying to design the UEFI spec change
with him. The bootloader problem he commented with this morning was
something I wasn't fully considering.

> This apporach doesn't work if we include bootloader into the picture: if
> EBS() called by bootloader we still cannot know if target kernel supports
> unaccepted memory and we return to the square 1.
>
> I think we should make it obvious from a kernel image if it supports
> unaccepted memory (with UTS_VERSION or other way).
>
> Any comments?

Is this binary parsing trick already used in EDK2? If not, I wouldn't
want to introduce an ABI-solidifying requirement like that.

A bit more cumbersome, but more flexible way to enable the feature is
an idea I had in a meeting today:
Make unaccepted memory support a feature-enabling EFI driver installed
to the EFI system partition.

* The first time you boot (setup mode), you install an EFI driver that
just sets a feature Pcd to true (using a custom protocol as Ard had
suggested above).
* The second time you boot, if the feature Pcd is true, then the UEFI
is free to not accept memory and use the unaccepted memory type. The
bootloader will run after unaccepted memory has been allowed already,
so there is no accept-all event.

The default behavior will be to accept all memory when GetMemoryMap is
called unless the feature pcd is set to true.

We can then say this driver isn't needed once some new generation of
this technology comes along and we can require unaccepted memory
support as part of that technology's baseline, or we manage to update
the UEFI spec to have GetMemoryMapEx which has unaccepted memory
support baked in and the bootloaders all know to use it.

The cloud experience will be, "is boot slow? Install this EFI driver
from the cloud service provider" to tell the UEFI to enable unaccepted
memory.
Dionna Amalie Glaze July 19, 2022, 12:31 a.m. UTC | #27
> > I think we should make it obvious from a kernel image if it supports
> > unaccepted memory (with UTS_VERSION or other way).
> >

Something I didn't address in my previous email: how would the UEFI
know where the kernel is to parse this UTS_VERSION out when it's
booting a bootloader before Linux gets booted?
Yao, Jiewen July 19, 2022, 2:48 a.m. UTC | #28
Hey
I posted my comment on Bugzilla https://bugzilla.tianocore.org/show_bug.cgi?id=3987

Let's achieve EDKII/UEFI related discussion there.

Thank you
Yao, Jiewen

> -----Original Message-----
> From: Dionna Amalie Glaze <dionnaglaze@google.com>
> Sent: Tuesday, July 19, 2022 7:32 AM
> To: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>; Peter Gonda <pgonda@google.com>;
> Borislav Petkov <bp@alien8.de>; Lutomirski, Andy <luto@kernel.org>;
> Christopherson,, Sean <seanjc@google.com>; Andrew Morton <akpm@linux-
> foundation.org>; Rodel, Jorg <jroedel@suse.de>; Andi Kleen
> <ak@linux.intel.com>; Kuppuswamy Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com>; David Rientjes
> <rientjes@google.com>; Vlastimil Babka <vbabka@suse.cz>; Tom Lendacky
> <thomas.lendacky@amd.com>; Thomas Gleixner <tglx@linutronix.de>; Peter
> Zijlstra <peterz@infradead.org>; Paolo Bonzini <pbonzini@redhat.com>; Ingo
> Molnar <mingo@redhat.com>; Varad Gautam <varad.gautam@suse.com>;
> Dario Faggioli <dfaggioli@suse.com>; Hansen, Dave <dave.hansen@intel.com>;
> Mike Rapoport <rppt@kernel.org>; David Hildenbrand <david@redhat.com>;
> Marcelo Cerri <marcelo.cerri@canonical.com>; tim.gardner@canonical.com;
> Khalid ElMously <khalid.elmously@canonical.com>; Cox, Philip
> <philip.cox@canonical.com>; the arch/x86 maintainers <x86@kernel.org>;
> Linux Memory Management List <linux-mm@kvack.org>; linux-
> coco@lists.linux.dev; linux-efi <linux-efi@vger.kernel.org>; LKML <linux-
> kernel@vger.kernel.org>; Yao, Jiewen <jiewen.yao@intel.com>
> Subject: Re: [PATCHv7 00/14] mm, x86/cc: Implement support for unaccepted
> memory
> 
> > I've talked with our firmware expert today and I think we have a problem
> > with the approach when kernel declaries support of unaccepted memory.
> >
> 
> Is this Jiewen Yao? I've been trying to design the UEFI spec change
> with him. The bootloader problem he commented with this morning was
> something I wasn't fully considering.
> 
> > This apporach doesn't work if we include bootloader into the picture: if
> > EBS() called by bootloader we still cannot know if target kernel supports
> > unaccepted memory and we return to the square 1.
> >
> > I think we should make it obvious from a kernel image if it supports
> > unaccepted memory (with UTS_VERSION or other way).
> >
> > Any comments?
> 
> Is this binary parsing trick already used in EDK2? If not, I wouldn't
> want to introduce an ABI-solidifying requirement like that.
> 
> A bit more cumbersome, but more flexible way to enable the feature is
> an idea I had in a meeting today:
> Make unaccepted memory support a feature-enabling EFI driver installed
> to the EFI system partition.
> 
> * The first time you boot (setup mode), you install an EFI driver that
> just sets a feature Pcd to true (using a custom protocol as Ard had
> suggested above).
> * The second time you boot, if the feature Pcd is true, then the UEFI
> is free to not accept memory and use the unaccepted memory type. The
> bootloader will run after unaccepted memory has been allowed already,
> so there is no accept-all event.
> 
> The default behavior will be to accept all memory when GetMemoryMap is
> called unless the feature pcd is set to true.
> 
> We can then say this driver isn't needed once some new generation of
> this technology comes along and we can require unaccepted memory
> support as part of that technology's baseline, or we manage to update
> the UEFI spec to have GetMemoryMapEx which has unaccepted memory
> support baked in and the bootloaders all know to use it.
> 
> The cloud experience will be, "is boot slow? Install this EFI driver
> from the cloud service provider" to tell the UEFI to enable unaccepted
> memory.
> 
> --
> -Dionna Glaze, PhD (she/her)
Dionna Amalie Glaze July 19, 2022, 6:29 p.m. UTC | #29
> > > I think we should make it obvious from a kernel image if it supports
> > > unaccepted memory (with UTS_VERSION or other way).
> > >
>
> Something I didn't address in my previous email: how would the UEFI
> know where the kernel is to parse this UTS_VERSION out when it's
> booting a bootloader before Linux gets booted?
>

How about instead of the limited resource of UTS_VERSION, we add a
SETUP_BOOT_FEATURES enum for setup_data in the boot header? That would
be easier to parse out and more extensible in the future.
https://www.kernel.org/doc/html/latest/x86/boot.html?highlight=boot

This can contain a bitmap of a number of features that we currently
need manual tagging for, such as SEV guest support, SEV-SNP guest
support, TDX guest support, and (CONFIG_UNACCEPTED_MEMORY, TDX) or
(CONFIG_UNACCEPTED_MEMORY, SEV-SNP).
The VMM, UEFI, or boot loader can read these from the images/kernels
and have the appropriate behavior.
Borislav Petkov July 19, 2022, 7:13 p.m. UTC | #30
On Tue, Jul 19, 2022 at 11:29:32AM -0700, Dionna Amalie Glaze wrote:
> How about instead of the limited resource of UTS_VERSION, we add a
> SETUP_BOOT_FEATURES enum for setup_data in the boot header? That would
> be easier to parse out and more extensible in the future.
> https://www.kernel.org/doc/html/latest/x86/boot.html?highlight=boot
> 
> This can contain a bitmap of a number of features that we currently
> need manual tagging for, such as SEV guest support, SEV-SNP guest
> support, TDX guest support, and (CONFIG_UNACCEPTED_MEMORY, TDX) or
> (CONFIG_UNACCEPTED_MEMORY, SEV-SNP).
> The VMM, UEFI, or boot loader can read these from the images/kernels
> and have the appropriate behavior.

I think for stuff like that you want loadflags or xloadflags in the
setup header.
Ard Biesheuvel July 19, 2022, 8:45 p.m. UTC | #31
On Tue, 19 Jul 2022 at 21:14, Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Jul 19, 2022 at 11:29:32AM -0700, Dionna Amalie Glaze wrote:
> > How about instead of the limited resource of UTS_VERSION, we add a
> > SETUP_BOOT_FEATURES enum for setup_data in the boot header? That would
> > be easier to parse out and more extensible in the future.
> > https://www.kernel.org/doc/html/latest/x86/boot.html?highlight=boot
> >
> > This can contain a bitmap of a number of features that we currently
> > need manual tagging for, such as SEV guest support, SEV-SNP guest
> > support, TDX guest support, and (CONFIG_UNACCEPTED_MEMORY, TDX) or
> > (CONFIG_UNACCEPTED_MEMORY, SEV-SNP).
> > The VMM, UEFI, or boot loader can read these from the images/kernels
> > and have the appropriate behavior.
>
> I think for stuff like that you want loadflags or xloadflags in the
> setup header.
>

Please, no. Let's not invent Linux/x86 specific hacks to infer whether
or not the kernel is capable of accepting memory when it is perfectly
capable of telling us directly. We will surely need something
analogous on other architectures in the future as well, so the setup
header is definitely not the right place for this.

The 'bootloader that calls EBS()' case does not apply to Linux, and
given that we are talking specifically about confidential computing
VMs here, we can afford to be normative and define something generic
that works well for us.

So let's define a way for the EFI stub to signal to the firmware
(before EBS()) that it will take control of accepting memory. The
'bootloader that calls EBS()' case can invent something along the
lines of what has been proposed in this thread to infer the
capabilities of the kernel (and decide what to signal to the
firmware). But we have no need for this additional complexity on
Linux.
Borislav Petkov July 19, 2022, 9:23 p.m. UTC | #32
On Tue, Jul 19, 2022 at 10:45:06PM +0200, Ard Biesheuvel wrote:
> So let's define a way for the EFI stub to signal to the firmware
> (before EBS()) that it will take control of accepting memory. The
> 'bootloader that calls EBS()' case can invent something along the
> lines of what has been proposed in this thread to infer the
> capabilities of the kernel (and decide what to signal to the
> firmware). But we have no need for this additional complexity on
> Linux.

To tell you the truth, I've been perusing this thread from the sidelines
and am wondering why does this need this special dance at all?

If EFI takes control of accepting memory, then when the guest kernel
boots, it'll find all memory accepted and not do anything.

If EFI doesn't accept memory, then the guest kernel will boot and do the
accepting itself.

So either I'm missing something or we're overengineering this for no
good reason...
Dave Hansen July 19, 2022, 9:35 p.m. UTC | #33
On 7/19/22 14:23, Borislav Petkov wrote:
> On Tue, Jul 19, 2022 at 10:45:06PM +0200, Ard Biesheuvel wrote:
>> So let's define a way for the EFI stub to signal to the firmware
>> (before EBS()) that it will take control of accepting memory. The
>> 'bootloader that calls EBS()' case can invent something along the
>> lines of what has been proposed in this thread to infer the
>> capabilities of the kernel (and decide what to signal to the
>> firmware). But we have no need for this additional complexity on
>> Linux.
> To tell you the truth, I've been perusing this thread from the sidelines
> and am wondering why does this need this special dance at all?
> 
> If EFI takes control of accepting memory, then when the guest kernel
> boots, it'll find all memory accepted and not do anything.
> 
> If EFI doesn't accept memory, then the guest kernel will boot and do the
> accepting itself.
> 
> So either I'm missing something or we're overengineering this for no
> good reason...

They're trying to design something that can (forever) handle guests that
might not be able to accept memory.  It's based on the idea that
*something* needs to assume control and EFI doesn't have enough
information to assume control.

I wish we didn't need all this complexity, though.

There are three entities that can influence how much memory is accepted:

1. The host
2. The guest firmware
3. The guest kernel (or bootloader or something after the firmware)

This whole thread is about how #2 and #3 talk to each other and make
sure *someone* does it.

I kinda think we should just take the guest firmware out of the picture.
 There are only going to be a few versions of the kernel that can boot
under TDX (or SEV-SNP) and *can't* handle unaccepted memory.  It seems a
bit silly to design this whole interface for a few versions of the OS
that TDX folks tell me can't be used anyway.

I think we should just say if you want to run an OS that doesn't have
unaccepted memory support, you can either:

1. Deal with that at the host level configuration
2. Boot some intermediate thing like a bootloader that does acceptance
   before running the stupid^Wunenlightended OS
3. Live with the 4GB of pre-accepted memory you get with no OS work.

Yeah, this isn't convenient for some hosts.  But, really, this is
preferable to doing an EFI/OS dance until the end of time.
Borislav Petkov July 19, 2022, 9:50 p.m. UTC | #34
On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote:
> They're trying to design something that can (forever) handle guests that
> might not be able to accept memory. 

Wait, what?

If you can't modify those guests to teach them to accept memory, how do
you add TDX or SNP guest support to them?

I.e., you need to modify the guests and then you can add memory
acceptance. Basically, your point below...

> It's based on the idea that *something* needs to assume control and
> EFI doesn't have enough information to assume control.
>
> I wish we didn't need all this complexity, though.
> 
> There are three entities that can influence how much memory is accepted:
> 
> 1. The host
> 2. The guest firmware
> 3. The guest kernel (or bootloader or something after the firmware)
> 
> This whole thread is about how #2 and #3 talk to each other and make
> sure *someone* does it.
> 
> I kinda think we should just take the guest firmware out of the picture.
>  There are only going to be a few versions of the kernel that can boot
> under TDX (or SEV-SNP) and *can't* handle unaccepted memory.  It seems a
> bit silly to design this whole interface for a few versions of the OS
> that TDX folks tell me can't be used anyway.
> 
> I think we should just say if you want to run an OS that doesn't have
> unaccepted memory support, you can either:
> 
> 1. Deal with that at the host level configuration
> 2. Boot some intermediate thing like a bootloader that does acceptance
>    before running the stupid^Wunenlightended OS
> 3. Live with the 4GB of pre-accepted memory you get with no OS work.
> 
> Yeah, this isn't convenient for some hosts.  But, really, this is
> preferable to doing an EFI/OS dance until the end of time.

Ack. Definitely.

Thx.
Kirill A. Shutemov July 19, 2022, 10:01 p.m. UTC | #35
On Tue, Jul 19, 2022 at 11:50:57PM +0200, Borislav Petkov wrote:
> On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote:
> > They're trying to design something that can (forever) handle guests that
> > might not be able to accept memory. 
> 
> Wait, what?
> 
> If you can't modify those guests to teach them to accept memory, how do
> you add TDX or SNP guest support to them?
> 
> I.e., you need to modify the guests and then you can add memory
> acceptance. Basically, your point below...
> 
> > It's based on the idea that *something* needs to assume control and
> > EFI doesn't have enough information to assume control.
> >
> > I wish we didn't need all this complexity, though.
> > 
> > There are three entities that can influence how much memory is accepted:
> > 
> > 1. The host
> > 2. The guest firmware
> > 3. The guest kernel (or bootloader or something after the firmware)
> > 
> > This whole thread is about how #2 and #3 talk to each other and make
> > sure *someone* does it.
> > 
> > I kinda think we should just take the guest firmware out of the picture.
> >  There are only going to be a few versions of the kernel that can boot
> > under TDX (or SEV-SNP) and *can't* handle unaccepted memory.  It seems a
> > bit silly to design this whole interface for a few versions of the OS
> > that TDX folks tell me can't be used anyway.
> > 
> > I think we should just say if you want to run an OS that doesn't have
> > unaccepted memory support, you can either:
> > 
> > 1. Deal with that at the host level configuration
> > 2. Boot some intermediate thing like a bootloader that does acceptance
> >    before running the stupid^Wunenlightended OS
> > 3. Live with the 4GB of pre-accepted memory you get with no OS work.
> > 
> > Yeah, this isn't convenient for some hosts.  But, really, this is
> > preferable to doing an EFI/OS dance until the end of time.
> 
> Ack. Definitely.

I like it too as it is no-code solution :P

Peter, I'm pretty sure unaccepted memory support hits upstream well before
TDX get adopted widely in production. I think it is pretty reasonable to
deal with it on host side in meanwhile.

Any objections?
Dave Hansen July 19, 2022, 10:02 p.m. UTC | #36
On 7/19/22 14:50, Borislav Petkov wrote:
> On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote:
>> They're trying to design something that can (forever) handle guests that
>> might not be able to accept memory. 
> Wait, what?
> 
> If you can't modify those guests to teach them to accept memory, how do
> you add TDX or SNP guest support to them?

Mainline today, for instance, doesn't have unaccepted memory support for
TDX or SEV-SNP guests.  But, they both still boot fine because folks
either configure it on the host side not to *have* any unaccepted
memory.  Or, they just live with the small (4GB??) amount of
pre-accepted memory, which is fine for testing things.
Tom Lendacky July 19, 2022, 10:08 p.m. UTC | #37
On 7/19/22 17:02, Dave Hansen wrote:
> On 7/19/22 14:50, Borislav Petkov wrote:
>> On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote:
>>> They're trying to design something that can (forever) handle guests that
>>> might not be able to accept memory.
>> Wait, what?
>>
>> If you can't modify those guests to teach them to accept memory, how do
>> you add TDX or SNP guest support to them?
> 
> Mainline today, for instance, doesn't have unaccepted memory support for
> TDX or SEV-SNP guests.  But, they both still boot fine because folks
> either configure it on the host side not to *have* any unaccepted
> memory.  Or, they just live with the small (4GB??) amount of
> pre-accepted memory, which is fine for testing things.

Today, for SEV-SNP, OVMF accepts all of the memory in advance of booting 
the kernel.

Thanks,
Tom

>
Marc Orr July 20, 2022, 12:26 a.m. UTC | #38
On Tue, Jul 19, 2022 at 3:02 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 7/19/22 14:50, Borislav Petkov wrote:
> > On Tue, Jul 19, 2022 at 02:35:45PM -0700, Dave Hansen wrote:
> >> They're trying to design something that can (forever) handle guests that
> >> might not be able to accept memory.
> > Wait, what?
> >
> > If you can't modify those guests to teach them to accept memory, how do
> > you add TDX or SNP guest support to them?
>
> Mainline today, for instance, doesn't have unaccepted memory support for
> TDX or SEV-SNP guests.  But, they both still boot fine because folks
> either configure it on the host side not to *have* any unaccepted
> memory.  Or, they just live with the small (4GB??) amount of
> pre-accepted memory, which is fine for testing things.

For us (Google cloud), "1. Deal with that at the host level
configuration" looks like:
https://cloud.google.com/compute/docs/images/create-delete-deprecate-private-images#guest-os-features

In other words, we have to tag images with "feature tags" to
distinguish which images have kernels that support which features.

Part of the reason we need to do it this way is that we use a single
guest firmware (i.e., guest UEFI) that lives outside of the image.

These feature tags are a mess to keep track of.

All that being said, I can totally see the upstream perspective being
"not our problem". It's hard to argue with that :-).

A few more thoughts:

- If the guest-side patches weren't upstream before this patch set to
handle unaccepted memory, you're all definitely right, that this isn't
a real issue. (Maybe it still isn't...)
- Do we anticipate (many) more features for confidential compute in
the future that require code in both the guest FW and guest kernel? If
yes, then designing a FW-kernel feature negotiation could be useful
beyond this situation.
- Dave's suggestion to "2. Boot some intermediate thing like a
bootloader that does acceptance ..." is pretty clever! So if upstream
thinks this FW-kernel negotiation is not a good direction, maybe we
(Google) can pursue this idea to avoid introducing yet another tag on
our images.

Thank you all for this discussion.

Thanks,
Marc
Borislav Petkov July 20, 2022, 5:44 a.m. UTC | #39
On Tue, Jul 19, 2022 at 05:26:21PM -0700, Marc Orr wrote:
> These feature tags are a mess to keep track of.

Well, looking at those tags, it doesn't look like you'll stop using them
anytime soon.

And once all the required SNP/TDX features are part of the guest image,
- including unaccepted memory - if anything, you'll have less tags.

:-)

> - Do we anticipate (many) more features for confidential compute in
> the future that require code in both the guest FW and guest kernel? If
> yes, then designing a FW-kernel feature negotiation could be useful
> beyond this situation.

Good question.

> - Dave's suggestion to "2. Boot some intermediate thing like a
> bootloader that does acceptance ..." is pretty clever! So if upstream
> thinks this FW-kernel negotiation is not a good direction, maybe we
> (Google) can pursue this idea to avoid introducing yet another tag on
> our images.

Are those tags really that nasty so that you guys are looking at
upstream changes just to avoid them?

Thx.
Marc Orr July 20, 2022, 5:03 p.m. UTC | #40
On Tue, Jul 19, 2022 at 10:44 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Jul 19, 2022 at 05:26:21PM -0700, Marc Orr wrote:
> > These feature tags are a mess to keep track of.
>
> Well, looking at those tags, it doesn't look like you'll stop using them
> anytime soon.
>
> And once all the required SNP/TDX features are part of the guest image,
> - including unaccepted memory - if anything, you'll have less tags.
>
> :-)

Yeah, once all of the features are a part of the guest image AND any
older images with SNP/TDX minus the features are deprecated. I agree.

> > - Do we anticipate (many) more features for confidential compute in
> > the future that require code in both the guest FW and guest kernel? If
> > yes, then designing a FW-kernel feature negotiation could be useful
> > beyond this situation.
>
> Good question.
>
> > - Dave's suggestion to "2. Boot some intermediate thing like a
> > bootloader that does acceptance ..." is pretty clever! So if upstream
> > thinks this FW-kernel negotiation is not a good direction, maybe we
> > (Google) can pursue this idea to avoid introducing yet another tag on
> > our images.
>
> Are those tags really that nasty so that you guys are looking at
> upstream changes just to avoid them?

Generally, no. But the problem with tags is that distros tag their
images wrong sometimes. And that leads to problems. For example, I
just got a bug assigned to me yesterday about some ARM image tagged as
SEV_CAPABLE. Oops. Lol :-). (Though, I'm pretty sure we won't try to
boot an ARM image on a non-ARM host anyway; but it's still wrong...)

That being said, this lazy accept problem is sort of a special case,
since it requires deploying code to the guest FW and the guest kernel.
I'm still relatively new at all of this, but other than the
SNP/TDX-enlightenment patches themselves,  I haven't really seen any
other examples of this. So that goes back to my previous question. Is
this going to happen a lot more? If not, I can definitely see value in
the argument to skip the complexity of the FW/kernel feature
negotiation.

Another thing I thought of since my last reply, that's mostly an
internal solution to this problem on our side: Going back to Dave's
10k-foot view of the different angles of how to solve this. For "1.
Deal with that at the host level configuration", I'm thinking we could
tag the images with their internal guest kernel version. For example,
if an image has a 5.15 kernel, then we could have a `KERNEL_5_15` tag.
This would then allow us to have logic in the guest FW like:

if (guest_kernel_is_at_least(/*major=*/5, /*minor=*/15)
     enable_lazy_accept = true;

One detail I actually missed in all of this, is how the guest image
tag gets propagated into the guest FW in this approach. (Apologies for
this, as that's a pretty big oversight on my part.) Dionna: Have you
thought about this? Presumably this requires some sort of paravirt for
the guest to ask the host. And for any paravirt interface, now we need
to think about if it degrades the security of the confidential VMs.
Though, using it to get the kernel version to decide whether or not to
accept the memory within the guest UEFI or mark it as unaccepted seems
fine from a security angle to me.

Also, tagging images with their underlying kernel versions still seems
susceptible to mis-labeling. But this seems like it can be mostly
"fixed" via automation (e.g., write a tool to boot the guest and ask
it what it's kernel version is and use the result to attach the tag).
Also, tagging the images with their kernel version seems like a much
more general solution to these sorts of issues.

Thoughts?
Dave Hansen July 21, 2022, 5:12 p.m. UTC | #41
On 7/19/22 17:26, Marc Orr wrote:
> - Dave's suggestion to "2. Boot some intermediate thing like a
> bootloader that does acceptance ..." is pretty clever! So if upstream
> thinks this FW-kernel negotiation is not a good direction, maybe we
> (Google) can pursue this idea to avoid introducing yet another tag on
> our images.

I'm obviously speaking only for myself here and not for "upstream" as a
whole, but I clearly don't like the FW/kernel negotiation thing.  It's a
permanent pain in our necks to solve a very temporary problem.
Borislav Petkov July 22, 2022, 3:07 p.m. UTC | #42
On Wed, Jul 20, 2022 at 10:03:40AM -0700, Marc Orr wrote:
> Generally, no. But the problem with tags is that distros tag their
> images wrong sometimes. And that leads to problems. For example, I
> just got a bug assigned to me yesterday about some ARM image tagged as
> SEV_CAPABLE. Oops. Lol :-). (Though, I'm pretty sure we won't try to
> boot an ARM image on a non-ARM host anyway; but it's still wrong...)

Yeah, even if, let it crash'n'burn - people will notice pretty quickly.

> That being said, this lazy accept problem is sort of a special case,
> since it requires deploying code to the guest FW and the guest kernel.
> I'm still relatively new at all of this, but other than the
> SNP/TDX-enlightenment patches themselves,  I haven't really seen any
> other examples of this. So that goes back to my previous question. Is
> this going to happen a lot more?

Good question.

Unfortunately, not even the architects of coco could give you an answer
because, as you see yourself, those additional features like memory
acceptance, live migration, etc keep changing - the whole coco thing is
pretty much a moving target.

For example, if someone comes along and says, err, see, I have this live
migration helper and that thing runs as an EFI executable and it is so
much better...

Not saying it'll happen but it could. I hope you're catching my drift.

> If not, I can definitely see value in the argument to skip the
> complexity of the FW/kernel feature negotiation.
>
> Another thing I thought of since my last reply, that's mostly an
> internal solution to this problem on our side: Going back to Dave's
> 10k-foot view of the different angles of how to solve this. For "1.
> Deal with that at the host level configuration", I'm thinking we could
> tag the images with their internal guest kernel version. For example,
> if an image has a 5.15 kernel, then we could have a `KERNEL_5_15` tag.
> This would then allow us to have logic in the guest FW like:
> 
> if (guest_kernel_is_at_least(/*major=*/5, /*minor=*/15)
>      enable_lazy_accept = true;

Well, I don't want to spoil your idea but imagine distros like SLE or
others backport features into old kernels. All of a sudden 5.14 or older
can do memory acceptance too. And then that version-based scheme falls
apart.

So I'm guessing it would probably be better to explicitly tag distro
images. Thing is, once all needed support gets in, you can drop the tags
and simply say, you don't support those old images anymore and assume
all required support is there and implicit...

> Also, tagging images with their underlying kernel versions still seems
> susceptible to mis-labeling. But this seems like it can be mostly
> "fixed" via automation (e.g., write a tool to boot the guest and ask
> it what it's kernel version is and use the result to attach the tag).

I'll do you one better: boot the image and check for all required
features and produce tags. Or do not accept the image as a possible coco
image. And so on.

Thx.
Ard Biesheuvel July 23, 2022, 11:14 a.m. UTC | #43
On Thu, 21 Jul 2022 at 19:13, Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 7/19/22 17:26, Marc Orr wrote:
> > - Dave's suggestion to "2. Boot some intermediate thing like a
> > bootloader that does acceptance ..." is pretty clever! So if upstream
> > thinks this FW-kernel negotiation is not a good direction, maybe we
> > (Google) can pursue this idea to avoid introducing yet another tag on
> > our images.
>
> I'm obviously speaking only for myself here and not for "upstream" as a
> whole, but I clearly don't like the FW/kernel negotiation thing.  It's a
> permanent pain in our necks to solve a very temporary problem.

EFI is basically our existing embodiment of this fw/kernel negotiation
thing, and iff we need it, I have no objection to using it for this
purpose, i.e., to allow the firmware to infer whether or not it should
accept all available memory on behalf of the OS before exiting boot
services. But if we don't need this, even better.

What I strongly object to is inventing a new bespoke way for the
firmware to make inferences about the capabilities of the image by
inspecting fields in the file representation of the image (which is
not guaranteed by EFI to be identical to its in-memory representation,
as, e.g., the PE/COFF header could be omitted by a loader without
violating the spec)

As for the intermediate thing: yes, that would be a valuable thing to
have in OVMF (and I will gladly take EDK2 patches that implement
this). However, I'm not sure how you decide whether or not this thing
should be active or not, doesn't that just move the problem around?
Dionna Amalie Glaze July 28, 2022, 10:01 p.m. UTC | #44
>
> What I strongly object to is inventing a new bespoke way for the
> firmware to make inferences about the capabilities of the image by
> inspecting fields in the file representation of the image (which is
> not guaranteed by EFI to be identical to its in-memory representation,
> as, e.g., the PE/COFF header could be omitted by a loader without
> violating the spec)
>
> As for the intermediate thing: yes, that would be a valuable thing to
> have in OVMF (and I will gladly take EDK2 patches that implement
> this). However, I'm not sure how you decide whether or not this thing
> should be active or not, doesn't that just move the problem around?

This does just move the problem around, but it makes correct behavior
the default instead of silently ignoring most of the VM's memory and
booting regularly. I have the driver mostly written to change the
behavior to accept all by default unless a driver has been installed
to set a particular boolean to make it not. Still that's yet another
thing as you say.

I agree with everyone that this situation just stinks. "Can't you just
boot it?" was asked before, and yes we can, but at the scale of a CSP
managing anybody's image uploads, that not-insignificant cost has to
be paid by someone. It's a hard problem to route the image to the
right kind of machine that's expected to be able to run it... it's a
big ol' mess.

One thing is for sure: these patches shouldn't be blocked by the "how
do we detect it" question. I'm glad to see so much engagement with
this problem, but I fear I might have delayed its progress towards a
merge. I know AMD has a follow-up to add SEV-SNP accept_memory support
to finish this all up.

I'll try to get the ear of all the distributions that are tracking
towards providing SEV-SNP-supported images for CSPs to get them on the
release that includes these patches. I'll also see about upstreaming
that EFI driver and EDK2 changes in case there's a slip in the kernel
release and we need this workaround.
--
-Dionna Glaze, PhD (she/her)
Kirill A. Shutemov Aug. 9, 2022, 11:14 a.m. UTC | #45
On Sat, Jul 23, 2022 at 01:14:07PM +0200, Ard Biesheuvel wrote:
> On Thu, 21 Jul 2022 at 19:13, Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 7/19/22 17:26, Marc Orr wrote:
> > > - Dave's suggestion to "2. Boot some intermediate thing like a
> > > bootloader that does acceptance ..." is pretty clever! So if upstream
> > > thinks this FW-kernel negotiation is not a good direction, maybe we
> > > (Google) can pursue this idea to avoid introducing yet another tag on
> > > our images.
> >
> > I'm obviously speaking only for myself here and not for "upstream" as a
> > whole, but I clearly don't like the FW/kernel negotiation thing.  It's a
> > permanent pain in our necks to solve a very temporary problem.
> 
> EFI is basically our existing embodiment of this fw/kernel negotiation
> thing, and iff we need it, I have no objection to using it for this
> purpose, i.e., to allow the firmware to infer whether or not it should
> accept all available memory on behalf of the OS before exiting boot
> services. But if we don't need this, even better.

FW/kernel negotiation does not work if there's a boot loader in the middle
that does ExitBootServices(). By the time kernel can announce if it
supports unaccepted memory there's nobody to announce to.
Ard Biesheuvel Aug. 9, 2022, 11:36 a.m. UTC | #46
On Tue, 9 Aug 2022 at 13:11, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> On Sat, Jul 23, 2022 at 01:14:07PM +0200, Ard Biesheuvel wrote:
> > On Thu, 21 Jul 2022 at 19:13, Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > > On 7/19/22 17:26, Marc Orr wrote:
> > > > - Dave's suggestion to "2. Boot some intermediate thing like a
> > > > bootloader that does acceptance ..." is pretty clever! So if upstream
> > > > thinks this FW-kernel negotiation is not a good direction, maybe we
> > > > (Google) can pursue this idea to avoid introducing yet another tag on
> > > > our images.
> > >
> > > I'm obviously speaking only for myself here and not for "upstream" as a
> > > whole, but I clearly don't like the FW/kernel negotiation thing.  It's a
> > > permanent pain in our necks to solve a very temporary problem.
> >
> > EFI is basically our existing embodiment of this fw/kernel negotiation
> > thing, and iff we need it, I have no objection to using it for this
> > purpose, i.e., to allow the firmware to infer whether or not it should
> > accept all available memory on behalf of the OS before exiting boot
> > services. But if we don't need this, even better.
>
> FW/kernel negotiation does not work if there's a boot loader in the middle
> that does ExitBootServices(). By the time kernel can announce if it
> supports unaccepted memory there's nobody to announce to.
>

Why would you want to support such bootloaders for TDX anyway? TDX
heavily relies on measured boot abstractions and other things that are
heavily tied to firmware.
Kirill A. Shutemov Aug. 9, 2022, 11:54 a.m. UTC | #47
On Tue, Aug 09, 2022 at 01:36:00PM +0200, Ard Biesheuvel wrote:
> On Tue, 9 Aug 2022 at 13:11, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > On Sat, Jul 23, 2022 at 01:14:07PM +0200, Ard Biesheuvel wrote:
> > > On Thu, 21 Jul 2022 at 19:13, Dave Hansen <dave.hansen@intel.com> wrote:
> > > >
> > > > On 7/19/22 17:26, Marc Orr wrote:
> > > > > - Dave's suggestion to "2. Boot some intermediate thing like a
> > > > > bootloader that does acceptance ..." is pretty clever! So if upstream
> > > > > thinks this FW-kernel negotiation is not a good direction, maybe we
> > > > > (Google) can pursue this idea to avoid introducing yet another tag on
> > > > > our images.
> > > >
> > > > I'm obviously speaking only for myself here and not for "upstream" as a
> > > > whole, but I clearly don't like the FW/kernel negotiation thing.  It's a
> > > > permanent pain in our necks to solve a very temporary problem.
> > >
> > > EFI is basically our existing embodiment of this fw/kernel negotiation
> > > thing, and iff we need it, I have no objection to using it for this
> > > purpose, i.e., to allow the firmware to infer whether or not it should
> > > accept all available memory on behalf of the OS before exiting boot
> > > services. But if we don't need this, even better.
> >
> > FW/kernel negotiation does not work if there's a boot loader in the middle
> > that does ExitBootServices(). By the time kernel can announce if it
> > supports unaccepted memory there's nobody to announce to.
> >
> 
> Why would you want to support such bootloaders for TDX anyway? TDX
> heavily relies on measured boot abstractions and other things that are
> heavily tied to firmware.

I don't understand it either. And, yet, there's demand for it.
Dionna Amalie Glaze Aug. 9, 2022, 9:09 p.m. UTC | #48
> > > > EFI is basically our existing embodiment of this fw/kernel negotiation
> > > > thing, and iff we need it, I have no objection to using it for this
> > > > purpose, i.e., to allow the firmware to infer whether or not it should
> > > > accept all available memory on behalf of the OS before exiting boot
> > > > services. But if we don't need this, even better.
> > >
> > > FW/kernel negotiation does not work if there's a boot loader in the middle
> > > that does ExitBootServices(). By the time kernel can announce if it
> > > supports unaccepted memory there's nobody to announce to.
> > >
> >
> > Why would you want to support such bootloaders for TDX anyway? TDX
> > heavily relies on measured boot abstractions and other things that are
> > heavily tied to firmware.
>
> I don't understand it either. And, yet, there's demand for it.
>

I think there's no good solution for this bad upgrade path that the
UEFI spec stuck us with, so I think I'm going to stick to what many
folks have suggested: just have the host require external information.
What this means is that at VM creation time, the user has to specify
an extra flag that all memory has to be accepted in firmware before
booting the guest OS. Failure to provide the flag leads to the
unfortunate outcome that the VM only has access to the lower 4GB of
RAM. We can only hope that the VM OOMs shortly after they start up the
machine and the user reads an FAQ that they should add this flag.

I'll do a round of appeals to distributions to include this patch set
and AMD's follow-up that defines accept_memory for SEV-SNP to reduce
the time that people need to know about this flag.