[PATCHv4,1/8] mm: Add support for unaccepted memory

Message ID	20220405234343.74045-2-kirill.shutemov@linux.intel.com
State	New
Headers	show Return-Path: <linux-efi-owner@kernel.org> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> To: Borislav Petkov <bp@alien8.de>, Andy Lutomirski <luto@kernel.org>, Sean Christopherson <seanjc@google.com>, Andrew Morton <akpm@linux-foundation.org>, Joerg Roedel <jroedel@suse.de>, Ard Biesheuvel <ardb@kernel.org> Cc: Andi Kleen <ak@linux.intel.com>, Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>, David Rientjes <rientjes@google.com>, Vlastimil Babka <vbabka@suse.cz>, Tom Lendacky <thomas.lendacky@amd.com>, Thomas Gleixner <tglx@linutronix.de>, Peter Zijlstra <peterz@infradead.org>, Paolo Bonzini <pbonzini@redhat.com>, Ingo Molnar <mingo@redhat.com>, Varad Gautam <varad.gautam@suse.com>, Dario Faggioli <dfaggioli@suse.com>, Dave Hansen <dave.hansen@intel.com>, Brijesh Singh <brijesh.singh@amd.com>, Mike Rapoport <rppt@kernel.org>, David Hildenbrand <david@redhat.com>, x86@kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev, linux-efi@vger.kernel.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Mike Rapoport <rppt@linux.ibm.com> Subject: [PATCHv4 1/8] mm: Add support for unaccepted memory Date: Wed, 6 Apr 2022 02:43:36 +0300 Message-Id: <20220405234343.74045-2-kirill.shutemov@linux.intel.com> In-Reply-To: <20220405234343.74045-1-kirill.shutemov@linux.intel.com> References: <20220405234343.74045-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	mm, x86/cc: Implement support for unaccepted memory \| expand [PATCHv4,0/8] mm, x86/cc: Implement support for unaccepted memory [PATCHv4,1/8] mm: Add support for unaccepted memory [PATCHv4,2/8] efi/x86: Get full memory map in allocate_e820() [PATCHv4,3/8] efi/x86: Implement support for unaccepted memory [PATCHv4,4/8] x86/boot/compressed: Handle unaccepted memory [PATCHv4,5/8] x86/mm: Reserve unaccepted memory bitmap [PATCHv4,6/8] x86/mm: Provide helpers for unaccepted memory [PATCHv4,7/8] x86/tdx: Unaccepted memory support [PATCHv4,8/8] mm/vmstat: Add counter for memory accepting

Kirill A. Shutemov April 5, 2022, 11:43 p.m. UTC

UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual Machine
platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

Support of such memory requires a few changes in core-mm code:

  - memblock has to accept memory on allocation;

  - page allocator has to accept memory on the first allocation of the
    page;

Memblock change is trivial.

The page allocator is modified to accept pages on the first allocation.
PageUnaccepted() is used to indicate that the page requires acceptance.

Kernel only needs to accept memory once after boot, so during the boot
and warm up phase there will be a lot of memory acceptance. After things
are settled down the only price of the feature if couple of checks for
PageUnaccepted() in allocate and free paths. The check refers a hot
variable (that also encodes PageBuddy()), so it is cheap and not visible
on profiles.

Architecture has to provide two helpers if it wants to support
unaccepted memory:

 - accept_memory() makes a range of physical addresses accepted.

 - memory_is_unaccepted() checks anything within the range of physical
   addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
---
 include/linux/page-flags.h | 24 ++++++++++++++++
 mm/internal.h              | 11 ++++++++
 mm/memblock.c              |  9 ++++++
 mm/page_alloc.c            | 57 ++++++++++++++++++++++++++++++++++++--
 4 files changed, 99 insertions(+), 2 deletions(-)

Dave Hansen April 8, 2022, 6:55 p.m. UTC | #1

On 4/5/22 16:43, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, requiring memory to be accepted before it can be used by the

		^ require

> guest. Accepting happens via a protocol specific for the Virtual Machine
> platform.

							^ s/for/to

> Accepting memory is costly and it makes VMM allocate memory for the
> accepted guest physical address range. It's better to postpone memory
> acceptance until memory is needed. It lowers boot time and reduces
> memory overhead.
> 
> Support of such memory requires a few changes in core-mm code:
> 
>   - memblock has to accept memory on allocation;
> 
>   - page allocator has to accept memory on the first allocation of the
>     page;
> 
> Memblock change is trivial.
> 
> The page allocator is modified to accept pages on the first allocation.
> PageUnaccepted() is used to indicate that the page requires acceptance.

Does this consume an actual page flag or is it aliased?

> Kernel only needs to accept memory once after boot, so during the boot
> and warm up phase there will be a lot of memory acceptance. After things
> are settled down the only price of the feature if couple of checks for
> PageUnaccepted() in allocate and free paths. The check refers a hot

							       ^ to

...
> + /*
> +  * PageUnaccepted() indicates that the page has to be "accepted" before it can
> +  * be used. Page allocator has to call accept_page() before returning the page
> +  * to the caller.
> +  */

Let's talk about "used" with a bit more detail.  Maybe:

/*
 * PageUnaccepted() indicates that the page has to be "accepted" before
 * it can be read or written.  The page allocator must to call
 * accept_page() before touching the page or returning it to the caller.
 */

...
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2db95780e003..53f4aa1c92a7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -121,6 +121,12 @@ typedef int __bitwise fpi_t;
>   */
>  #define FPI_SKIP_KASAN_POISON	((__force fpi_t)BIT(2))
>  
> +/*
> + * Check if the page needs to be marked as PageUnaccepted().
> + * Used for the new pages added to the buddy allocator for the first time.
> + */
> +#define FPI_UNACCEPTED		((__force fpi_t)BIT(3))
> +
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
> @@ -1023,6 +1029,26 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
>  	return page_is_buddy(higher_page, higher_buddy, order + 1);
>  }
>  
> +static void accept_page(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +	int i;
> +
> +	accept_memory(start, start + (PAGE_SIZE << order));
> +
> +	for (i = 0; i < (1 << order); i++) {
> +		if (PageUnaccepted(page + i))
> +			__ClearPageUnaccepted(page + i);
> +	}
> +}

It's probably worth a comment somewhere that this can be really slow.

> +static bool page_is_unaccepted(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +
> +	return memory_is_unaccepted(start, start + (PAGE_SIZE << order));
> +}
> +
>  /*
>   * Freeing function for a buddy system allocator.
>   *
> @@ -1058,6 +1084,7 @@ static inline void __free_one_page(struct page *page,
>  	unsigned long combined_pfn;
>  	struct page *buddy;
>  	bool to_tail;
> +	bool unaccepted = PageUnaccepted(page);
>  
>  	VM_BUG_ON(!zone_is_initialized(zone));
>  	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -1089,6 +1116,11 @@ static inline void __free_one_page(struct page *page,
>  			clear_page_guard(zone, buddy, order, migratetype);
>  		else
>  			del_page_from_free_list(buddy, zone, order);
> +
> +		/* Mark page unaccepted if any of merged pages were unaccepted */
> +		if (PageUnaccepted(buddy))
> +			unaccepted = true;

Naming nit: following the logic with a double-negative like !unaccepted
is a bit hard.  Would this be more readable if it were:

	bool page_needs_acceptance = PageUnaccepted(page);

and then the code below...

>  		combined_pfn = buddy_pfn & pfn;
>  		page = page + (combined_pfn - pfn);
>  		pfn = combined_pfn;
> @@ -1124,6 +1156,17 @@ static inline void __free_one_page(struct page *page,
>  done_merging:
>  	set_buddy_order(page, order);
>  
> +	/*
> +	 * Check if the page needs to be marked as PageUnaccepted().
> +	 * Used for the new pages added to the buddy allocator for the first
> +	 * time.
> +	 */
> +	if (!unaccepted && (fpi_flags & FPI_UNACCEPTED))
> +		unaccepted = page_is_unaccepted(page, order);

	if (page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED))
		page_needs_acceptance = page_is_unaccepted(page, order);

> +	if (unaccepted)
> +		__SetPageUnaccepted(page);

This is getting hard for me to follow.

There are:
1. Pages that come in here with PageUnaccepted()==1
2. Pages that come in here with PageUnaccepted()==0, but a buddy that
   was PageUnaccepted()==1

In either of those cases, the bitmap will be consulted to see if the
page is *truly* unaccepted or not.  But, I'm struggling to figure out
how a page could end up in one of those scenarios and *not* be
page_is_unaccepted().

There are three pieces of information that come in:
1. PageUnaccepted(page)
2. PageUnaccepted(buddies[])
3. the bitmap

and one piece of information going out:

PageUnaccepted(page);

I think I need a more coherent description of how those four things fit
together.

>  	if (fpi_flags & FPI_TO_TAIL)
>  		to_tail = true;
>  	else if (is_shuffle_order(order))
> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
>  static inline bool page_expected_state(struct page *page,
>  					unsigned long check_flags)
>  {
> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> +	    !PageUnaccepted(page))
>  		return false;

That probably deserves a comment, and maybe its own if() statement.

>  	if (unlikely((unsigned long)page->mapping |
> @@ -1654,7 +1698,8 @@ void __free_pages_core(struct page *page, unsigned int order)
>  	 * Bypass PCP and place fresh pages right to the tail, primarily
>  	 * relevant for memory onlining.
>  	 */
> -	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
> +	__free_pages_ok(page, order,
> +			FPI_TO_TAIL | FPI_SKIP_KASAN_POISON | FPI_UNACCEPTED);
>  }
>  
>  #ifdef CONFIG_NUMA
> @@ -1807,6 +1852,7 @@ static void __init deferred_free_range(unsigned long pfn,
>  		return;
>  	}
>  
> +	accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT);
>  	for (i = 0; i < nr_pages; i++, page++, pfn++) {
>  		if ((pfn & (pageblock_nr_pages - 1)) == 0)
>  			set_pageblock_migratetype(page, MIGRATE_MOVABLE);

Comment, please.  I assume doing the slow accept up front is OK here
because this is in the deferred path.  But, it would be nice to know for
sure.

> @@ -2266,6 +2312,10 @@ static inline void expand(struct zone *zone, struct page *page,
>  		if (set_page_guard(zone, &page[size], high, migratetype))
>  			continue;
>  
> +		/* Transfer PageUnaccepted() to the newly split pages */
> +		if (PageUnaccepted(page))
> +			__SetPageUnaccepted(&page[size]);

We don't want to just accept the page here, right?  Because we're
holding the zone lock?  Maybe we should mention that:

		/*
		 * Transfer PageUnaccepted() to the newly split pages so
		 * they can be accepted after dropping the zone lock.
		 */

>  		add_to_free_list(&page[size], zone, high, migratetype);
>  		set_buddy_order(&page[size], high);
>  	}
> @@ -2396,6 +2446,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  	 */
>  	kernel_unpoison_pages(page, 1 << order);
>  
> +	if (PageUnaccepted(page))
> +		accept_page(page, order);
> +
>  	/*
>  	 * As memory initialization might be integrated into KASAN,
>  	 * KASAN unpoisoning and memory initializion code must be

Is accepted memory guaranteed to be zeroed?  Do we want to skip the
__GFP_ZERO behavior later in this function?  Or is that just a silly
over-optimization?

Dave Hansen April 8, 2022, 7:11 p.m. UTC | #2

On 4/5/22 16:43, Kirill A. Shutemov wrote:
> Kernel only needs to accept memory once after boot, so during the boot
> and warm up phase there will be a lot of memory acceptance. After things
> are settled down the only price of the feature if couple of checks for
> PageUnaccepted() in allocate and free paths. The check refers a hot
> variable (that also encodes PageBuddy()), so it is cheap and not visible
> on profiles.

Let's also not sugar-coat this.  Page acceptance is hideously slow.
It's agonizingly slow.  To boot, it's done holding a global spinlock
with interrupts disabled (see patch 6/8).  At the very, very least, each
acceptance operation involves a couple of what are effectively ring
transitions, a 2MB memset(), and a bunch of cache flushing.

The system is going to be downright unusable during this time, right?

Sure, it's *temporary* and only happens once at boot.  But, it's going
to suck.

Am I over-stating this in any way?

The ACCEPT_MEMORY vmstat is good to have around.  Thanks for adding it.
 But, I think we should also write down some guidance like:

	If your TDX system seems as slow as snail after boot, look at
	the "accept_memory" counter in /proc/vmstat.  If it is
	incrementing, then TDX memory acceptance is likely to blame.

Do we need anything more discrete to tell users when acceptance is over?
 For instance, maybe they run something and it goes really slow, they
watch "accept_memory" until it stops.  They rejoice at their good
fortune!  Then, memory allocation starts falling over to a new node and
the agony beings anew.

I can think of dealing with this in two ways:

	cat /sys/.../unaccepted_pages_left

which just walks the bitmap and counts the amount of pages remaining. or
something like:

	echo 1 > /sys/devices/system/node/node0/make_the_pain_stop

Which will, well, make the pain stop on node0.

Kirill A. Shutemov April 9, 2022, 3:54 p.m. UTC | #3

On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote:
> On 4/5/22 16:43, Kirill A. Shutemov wrote:
> > UEFI Specification version 2.9 introduces the concept of memory
> > acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> > SEV-SNP, requiring memory to be accepted before it can be used by the
> 
> 		^ require

Heh. That's wording form the spec.

> > guest. Accepting happens via a protocol specific for the Virtual Machine
> > platform.
> 
> 							^ s/for/to
> 
> > Accepting memory is costly and it makes VMM allocate memory for the
> > accepted guest physical address range. It's better to postpone memory
> > acceptance until memory is needed. It lowers boot time and reduces
> > memory overhead.
> > 
> > Support of such memory requires a few changes in core-mm code:
> > 
> >   - memblock has to accept memory on allocation;
> > 
> >   - page allocator has to accept memory on the first allocation of the
> >     page;
> > 
> > Memblock change is trivial.
> > 
> > The page allocator is modified to accept pages on the first allocation.
> > PageUnaccepted() is used to indicate that the page requires acceptance.
> 
> Does this consume an actual page flag or is it aliased?

It is encoded as a page type in mapcount of unallocated memory. It is not
aliased with PageOffline() as I did before.

I will mention that it is a new page type.

> > Kernel only needs to accept memory once after boot, so during the boot
> > and warm up phase there will be a lot of memory acceptance. After things
> > are settled down the only price of the feature if couple of checks for
> > PageUnaccepted() in allocate and free paths. The check refers a hot
> 
> 							       ^ to
> 
> ...
> > + /*
> > +  * PageUnaccepted() indicates that the page has to be "accepted" before it can
> > +  * be used. Page allocator has to call accept_page() before returning the page
> > +  * to the caller.
> > +  */
> 
> Let's talk about "used" with a bit more detail.  Maybe:
> 
> /*
>  * PageUnaccepted() indicates that the page has to be "accepted" before
>  * it can be read or written.  The page allocator must to call
>  * accept_page() before touching the page or returning it to the caller.
>  */

I guess s/must to call/must call/, right?

> ...
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2db95780e003..53f4aa1c92a7 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -121,6 +121,12 @@ typedef int __bitwise fpi_t;
> >   */
> >  #define FPI_SKIP_KASAN_POISON	((__force fpi_t)BIT(2))
> >  
> > +/*
> > + * Check if the page needs to be marked as PageUnaccepted().
> > + * Used for the new pages added to the buddy allocator for the first time.
> > + */
> > +#define FPI_UNACCEPTED		((__force fpi_t)BIT(3))
> > +
> >  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
> >  static DEFINE_MUTEX(pcp_batch_high_lock);
> >  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
> > @@ -1023,6 +1029,26 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
> >  	return page_is_buddy(higher_page, higher_buddy, order + 1);
> >  }
> >  
> > +static void accept_page(struct page *page, unsigned int order)
> > +{
> > +	phys_addr_t start = page_to_phys(page);
> > +	int i;
> > +
> > +	accept_memory(start, start + (PAGE_SIZE << order));
> > +
> > +	for (i = 0; i < (1 << order); i++) {
> > +		if (PageUnaccepted(page + i))
> > +			__ClearPageUnaccepted(page + i);
> > +	}
> > +}
> 
> It's probably worth a comment somewhere that this can be really slow.
> 
> > +static bool page_is_unaccepted(struct page *page, unsigned int order)
> > +{
> > +	phys_addr_t start = page_to_phys(page);
> > +
> > +	return memory_is_unaccepted(start, start + (PAGE_SIZE << order));
> > +}
> > +
> >  /*
> >   * Freeing function for a buddy system allocator.
> >   *
> > @@ -1058,6 +1084,7 @@ static inline void __free_one_page(struct page *page,
> >  	unsigned long combined_pfn;
> >  	struct page *buddy;
> >  	bool to_tail;
> > +	bool unaccepted = PageUnaccepted(page);
> >  
> >  	VM_BUG_ON(!zone_is_initialized(zone));
> >  	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> > @@ -1089,6 +1116,11 @@ static inline void __free_one_page(struct page *page,
> >  			clear_page_guard(zone, buddy, order, migratetype);
> >  		else
> >  			del_page_from_free_list(buddy, zone, order);
> > +
> > +		/* Mark page unaccepted if any of merged pages were unaccepted */
> > +		if (PageUnaccepted(buddy))
> > +			unaccepted = true;
> 
> Naming nit: following the logic with a double-negative like !unaccepted
> is a bit hard.  Would this be more readable if it were:
> 
> 	bool page_needs_acceptance = PageUnaccepted(page);
> 
> and then the code below...
> 
> >  		combined_pfn = buddy_pfn & pfn;
> >  		page = page + (combined_pfn - pfn);
> >  		pfn = combined_pfn;
> > @@ -1124,6 +1156,17 @@ static inline void __free_one_page(struct page *page,
> >  done_merging:
> >  	set_buddy_order(page, order);
> >  
> > +	/*
> > +	 * Check if the page needs to be marked as PageUnaccepted().
> > +	 * Used for the new pages added to the buddy allocator for the first
> > +	 * time.
> > +	 */
> > +	if (!unaccepted && (fpi_flags & FPI_UNACCEPTED))
> > +		unaccepted = page_is_unaccepted(page, order);
> 
> 	if (page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED))
> 		page_needs_acceptance = page_is_unaccepted(page, order);
> 
> > +	if (unaccepted)
> > +		__SetPageUnaccepted(page);
> 
> This is getting hard for me to follow.
> 
> There are:
> 1. Pages that come in here with PageUnaccepted()==1
> 2. Pages that come in here with PageUnaccepted()==0, but a buddy that
>    was PageUnaccepted()==1
> 
> In either of those cases, the bitmap will be consulted to see if the
> page is *truly* unaccepted or not.  But, I'm struggling to figure out
> how a page could end up in one of those scenarios and *not* be
> page_is_unaccepted().
> 
> There are three pieces of information that come in:
> 1. PageUnaccepted(page)
> 2. PageUnaccepted(buddies[])
> 3. the bitmap

1 and 2 are the same conceptionally: merged-in pieces of the resulting page.

> 
> and one piece of information going out:
> 
> PageUnaccepted(page);
> 
> I think I need a more coherent description of how those four things fit
> together.

The page gets marked as PageUnaccepted() if any of merged-in pages is
PageUnaccepted().

For new pages, just being added to buddy allocator, consult
page_is_unaccepted(). FPI_UNACCEPTED indicates that the page is new and
page_is_unaccepted() check is required.

Avoid calling page_is_unaccepted() if it is known that the page needs
acceptance anyway. It can be costly.

Is it good enough explanation?

FPI_UNACCEPTED is not a good name. Any help with a better one?
FPI_CHECK_UNACCEPTED?

> >  	if (fpi_flags & FPI_TO_TAIL)
> >  		to_tail = true;
> >  	else if (is_shuffle_order(order))
> > @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
> >  static inline bool page_expected_state(struct page *page,
> >  					unsigned long check_flags)
> >  {
> > -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> > +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> > +	    !PageUnaccepted(page))
> >  		return false;
> 
> That probably deserves a comment, and maybe its own if() statement.

Own if does not work. PageUnaccepted() is encoded in _mapcount.

What about this:

	/*
	 * page->_mapcount is expected to be -1.
	 *
	 * There is an exception for PageUnaccepted(). The page type can be set
	 * for pages on free list. Page types are encoded in _mapcount.
	 *
	 * PageUnaccepted() will get cleared in post_alloc_hook().
	 */
	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
		return false;

?

> >  	if (unlikely((unsigned long)page->mapping |
> > @@ -1654,7 +1698,8 @@ void __free_pages_core(struct page *page, unsigned int order)
> >  	 * Bypass PCP and place fresh pages right to the tail, primarily
> >  	 * relevant for memory onlining.
> >  	 */
> > -	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
> > +	__free_pages_ok(page, order,
> > +			FPI_TO_TAIL | FPI_SKIP_KASAN_POISON | FPI_UNACCEPTED);
> >  }
> >  
> >  #ifdef CONFIG_NUMA
> > @@ -1807,6 +1852,7 @@ static void __init deferred_free_range(unsigned long pfn,
> >  		return;
> >  	}
> >  
> > +	accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT);
> >  	for (i = 0; i < nr_pages; i++, page++, pfn++) {
> >  		if ((pfn & (pageblock_nr_pages - 1)) == 0)
> >  			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> 
> Comment, please.  I assume doing the slow accept up front is OK here
> because this is in the deferred path.  But, it would be nice to know for
> sure.

It is acceptance of smaller than page block upfront. I will add a comment.

> 
> > @@ -2266,6 +2312,10 @@ static inline void expand(struct zone *zone, struct page *page,
> >  		if (set_page_guard(zone, &page[size], high, migratetype))
> >  			continue;
> >  
> > +		/* Transfer PageUnaccepted() to the newly split pages */
> > +		if (PageUnaccepted(page))
> > +			__SetPageUnaccepted(&page[size]);
> 
> We don't want to just accept the page here, right?  Because we're
> holding the zone lock?  Maybe we should mention that:
> 
> 		/*
> 		 * Transfer PageUnaccepted() to the newly split pages so
> 		 * they can be accepted after dropping the zone lock.
> 		 */

Okay.

> >  		add_to_free_list(&page[size], zone, high, migratetype);
> >  		set_buddy_order(&page[size], high);
> >  	}
> > @@ -2396,6 +2446,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >  	 */
> >  	kernel_unpoison_pages(page, 1 << order);
> >  
> > +	if (PageUnaccepted(page))
> > +		accept_page(page, order);
> > +
> >  	/*
> >  	 * As memory initialization might be integrated into KASAN,
> >  	 * KASAN unpoisoning and memory initializion code must be
> 
> Is accepted memory guaranteed to be zeroed?  Do we want to skip the
> __GFP_ZERO behavior later in this function?  Or is that just a silly
> over-optimization?

For TDX, it is true that the memory gets cleared on acceptance, but I
don't we can say the same for any possible implementation.

I would rather leave __GFP_ZERO for peace of mind. Clearing the cache-hot
page for the second time shouldn't be a big deal comparing to acceptance
cost.

Kirill A. Shutemov April 9, 2022, 5:52 p.m. UTC | #4

On Fri, Apr 08, 2022 at 12:11:58PM -0700, Dave Hansen wrote:
> On 4/5/22 16:43, Kirill A. Shutemov wrote:
> > Kernel only needs to accept memory once after boot, so during the boot
> > and warm up phase there will be a lot of memory acceptance. After things
> > are settled down the only price of the feature if couple of checks for
> > PageUnaccepted() in allocate and free paths. The check refers a hot
> > variable (that also encodes PageBuddy()), so it is cheap and not visible
> > on profiles.
> 
> Let's also not sugar-coat this.  Page acceptance is hideously slow.
> It's agonizingly slow.  To boot, it's done holding a global spinlock
> with interrupts disabled (see patch 6/8).  At the very, very least, each
> acceptance operation involves a couple of what are effectively ring
> transitions, a 2MB memset(), and a bunch of cache flushing.
> 
> The system is going to be downright unusable during this time, right?

Well, yes. The CPU that doing accepting is completely blocked by it.
But other CPUs may proceed until in in its turn steps onto memory
accepting.

> Sure, it's *temporary* and only happens once at boot.  But, it's going
> to suck.
> 
> Am I over-stating this in any way?
> 
> The ACCEPT_MEMORY vmstat is good to have around.  Thanks for adding it.
>  But, I think we should also write down some guidance like:
> 
> 	If your TDX system seems as slow as snail after boot, look at
> 	the "accept_memory" counter in /proc/vmstat.  If it is
> 	incrementing, then TDX memory acceptance is likely to blame.

Sure. Will add to commit message.

> Do we need anything more discrete to tell users when acceptance is over?

I can imagine setups that where acceptance is never over. A VM running
a workload with fixed dataset can have planty of memory unaccepted.

I don't think "make it over" should be the goal.

>  For instance, maybe they run something and it goes really slow, they
> watch "accept_memory" until it stops.  They rejoice at their good
> fortune!  Then, memory allocation starts falling over to a new node and
> the agony beings anew.
> 
> I can think of dealing with this in two ways:
> 
> 	cat /sys/.../unaccepted_pages_left
> 
> which just walks the bitmap and counts the amount of pages remaining. or
> something like:
> 
> 	echo 1 > /sys/devices/system/node/node0/make_the_pain_stop
> 
> Which will, well, make the pain stop on node0.

Sure we can add handles. But API is hard. Maybe we should wait and see
what is actually needed. (Yes, I'm lazy.:)

Dave Hansen April 11, 2022, 6:38 a.m. UTC | #5

On 4/9/22 08:54, Kirill A. Shutemov wrote:
> On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote:
>>> The page allocator is modified to accept pages on the first allocation.
>>> PageUnaccepted() is used to indicate that the page requires acceptance.
>>
>> Does this consume an actual page flag or is it aliased?
> 
> It is encoded as a page type in mapcount of unallocated memory. It is not
> aliased with PageOffline() as I did before.
> 
> I will mention that it is a new page type.

Guess I should have looked at the code. :)

Are we just increasingly using the StudlyNames() for anything to do with
pages?

>>> + /*
>>> +  * PageUnaccepted() indicates that the page has to be "accepted" before it can
>>> +  * be used. Page allocator has to call accept_page() before returning the page
>>> +  * to the caller.
>>> +  */
>>
>> Let's talk about "used" with a bit more detail.  Maybe:
>>
>> /*
>>  * PageUnaccepted() indicates that the page has to be "accepted" before
>>  * it can be read or written.  The page allocator must to call
>>  * accept_page() before touching the page or returning it to the caller.
>>  */
> 
> I guess s/must to call/must call/, right?

Yep.

...
>>> +	/*
>>> +	 * Check if the page needs to be marked as PageUnaccepted().
>>> +	 * Used for the new pages added to the buddy allocator for the first
>>> +	 * time.
>>> +	 */
>>> +	if (!unaccepted && (fpi_flags & FPI_UNACCEPTED))
>>> +		unaccepted = page_is_unaccepted(page, order);
>>
>> 	if (page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED))
>> 		page_needs_acceptance = page_is_unaccepted(page, order);
>>
>>> +	if (unaccepted)
>>> +		__SetPageUnaccepted(page);
>>
>> This is getting hard for me to follow.
>>
>> There are:
>> 1. Pages that come in here with PageUnaccepted()==1
>> 2. Pages that come in here with PageUnaccepted()==0, but a buddy that
>>    was PageUnaccepted()==1
>>
>> In either of those cases, the bitmap will be consulted to see if the
>> page is *truly* unaccepted or not.  But, I'm struggling to figure out
>> how a page could end up in one of those scenarios and *not* be
>> page_is_unaccepted().
>>
>> There are three pieces of information that come in:
>> 1. PageUnaccepted(page)
>> 2. PageUnaccepted(buddies[])
>> 3. the bitmap
> 
> 1 and 2 are the same conceptionally: merged-in pieces of the resulting page.
> 
>>
>> and one piece of information going out:
>>
>> PageUnaccepted(page);
>>
>> I think I need a more coherent description of how those four things fit
>> together.
> 
> The page gets marked as PageUnaccepted() if any of merged-in pages is
> PageUnaccepted().
> 
> For new pages, just being added to buddy allocator, consult
> page_is_unaccepted(). FPI_UNACCEPTED indicates that the page is new and
> page_is_unaccepted() check is required.
> 
> Avoid calling page_is_unaccepted() if it is known that the page needs
> acceptance anyway. It can be costly.
> 
> Is it good enough explanation?

Yeah, elaborating on the slow and fast paths makes a lot of sense.

> FPI_UNACCEPTED is not a good name. Any help with a better one?
> FPI_CHECK_UNACCEPTED?

Maybe even something like FPI_UNACCEPTED_SLOWPATH.  Then you can say
that when this is passed in the pages might not have PageUnaccepted()
set and the slow bitmap needs to be consulted.

>>>  	if (fpi_flags & FPI_TO_TAIL)
>>>  		to_tail = true;
>>>  	else if (is_shuffle_order(order))
>>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
>>>  static inline bool page_expected_state(struct page *page,
>>>  					unsigned long check_flags)
>>>  {
>>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
>>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
>>> +	    !PageUnaccepted(page))
>>>  		return false;
>>
>> That probably deserves a comment, and maybe its own if() statement.
> 
> Own if does not work. PageUnaccepted() is encoded in _mapcount.
> 
> What about this:
> 
> 	/*
> 	 * page->_mapcount is expected to be -1.
> 	 *
> 	 * There is an exception for PageUnaccepted(). The page type can be set
> 	 * for pages on free list. Page types are encoded in _mapcount.
> 	 *
> 	 * PageUnaccepted() will get cleared in post_alloc_hook().
> 	 */
> 	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
> 		return false;
> 
> ?

That's better.  But, aren't the PG_* names usually reserved for real
page->flags bits?  That naming might be part of my confusion.

>>>  		add_to_free_list(&page[size], zone, high, migratetype);
>>>  		set_buddy_order(&page[size], high);
>>>  	}
>>> @@ -2396,6 +2446,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>>>  	 */
>>>  	kernel_unpoison_pages(page, 1 << order);
>>>  
>>> +	if (PageUnaccepted(page))
>>> +		accept_page(page, order);
>>> +
>>>  	/*
>>>  	 * As memory initialization might be integrated into KASAN,
>>>  	 * KASAN unpoisoning and memory initializion code must be
>>
>> Is accepted memory guaranteed to be zeroed?  Do we want to skip the
>> __GFP_ZERO behavior later in this function?  Or is that just a silly
>> over-optimization?
> 
> For TDX, it is true that the memory gets cleared on acceptance, but I
> don't we can say the same for any possible implementation.
> 
> I would rather leave __GFP_ZERO for peace of mind. Clearing the cache-hot
> page for the second time shouldn't be a big deal comparing to acceptance
> cost.

Sure, fair enough.

Dave Hansen April 11, 2022, 6:41 a.m. UTC | #6

On 4/9/22 10:52, Kirill A. Shutemov wrote:
> On Fri, Apr 08, 2022 at 12:11:58PM -0700, Dave Hansen wrote:
>> On 4/5/22 16:43, Kirill A. Shutemov wrote:
>>> Kernel only needs to accept memory once after boot, so during the boot
>>> and warm up phase there will be a lot of memory acceptance. After things
>>> are settled down the only price of the feature if couple of checks for
>>> PageUnaccepted() in allocate and free paths. The check refers a hot
>>> variable (that also encodes PageBuddy()), so it is cheap and not visible
>>> on profiles.
>>
>> Let's also not sugar-coat this.  Page acceptance is hideously slow.
>> It's agonizingly slow.  To boot, it's done holding a global spinlock
>> with interrupts disabled (see patch 6/8).  At the very, very least, each
>> acceptance operation involves a couple of what are effectively ring
>> transitions, a 2MB memset(), and a bunch of cache flushing.
>>
>> The system is going to be downright unusable during this time, right?
...
>> Do we need anything more discrete to tell users when acceptance is over?
> 
> I can imagine setups that where acceptance is never over. A VM running
> a workload with fixed dataset can have planty of memory unaccepted.
> 
> I don't think "make it over" should be the goal.

I agree, there will be users that don't care when acceptance is over.
But, I'm also sure that there are users that will care deeply.

>>  For instance, maybe they run something and it goes really slow, they
>> watch "accept_memory" until it stops.  They rejoice at their good
>> fortune!  Then, memory allocation starts falling over to a new node and
>> the agony beings anew.
>>
>> I can think of dealing with this in two ways:
>>
>> 	cat /sys/.../unaccepted_pages_left
>>
>> which just walks the bitmap and counts the amount of pages remaining. or
>> something like:
>>
>> 	echo 1 > /sys/devices/system/node/node0/make_the_pain_stop
>>
>> Which will, well, make the pain stop on node0.
> 
> Sure we can add handles. But API is hard. Maybe we should wait and see
> what is actually needed. (Yes, I'm lazy.:)

Let's just call out the possible (probable?) need for new ABI here.
Maybe it will cue folks who care to speak up.

David Hildenbrand April 11, 2022, 8:47 a.m. UTC | #7

>>>  	if (fpi_flags & FPI_TO_TAIL)
>>>  		to_tail = true;
>>>  	else if (is_shuffle_order(order))
>>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
>>>  static inline bool page_expected_state(struct page *page,
>>>  					unsigned long check_flags)
>>>  {
>>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
>>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
>>> +	    !PageUnaccepted(page))
>>>  		return false;
>>
>> That probably deserves a comment, and maybe its own if() statement.
> 
> Own if does not work. PageUnaccepted() is encoded in _mapcount.
> 
> What about this:
> 
> 	/*
> 	 * page->_mapcount is expected to be -1.
> 	 *
> 	 * There is an exception for PageUnaccepted(). The page type can be set
> 	 * for pages on free list. Page types are encoded in _mapcount.
> 	 *
> 	 * PageUnaccepted() will get cleared in post_alloc_hook().
> 	 */
> 	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
> 		return false;
> 
> ?
> 

Please don't. Keep the usage of PG_* details inside page-flags.h

Mike Rapoport April 11, 2022, 10:07 a.m. UTC | #8

On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote:
> On 4/9/22 08:54, Kirill A. Shutemov wrote:
> > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote:
> 
> >>>  	if (fpi_flags & FPI_TO_TAIL)
> >>>  		to_tail = true;
> >>>  	else if (is_shuffle_order(order))
> >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
> >>>  static inline bool page_expected_state(struct page *page,
> >>>  					unsigned long check_flags)
> >>>  {
> >>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> >>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> >>> +	    !PageUnaccepted(page))
> >>>  		return false;
> >>
> >> That probably deserves a comment, and maybe its own if() statement.
> > 
> > Own if does not work. PageUnaccepted() is encoded in _mapcount.
> > 
> > What about this:
> > 
> > 	/*
> > 	 * page->_mapcount is expected to be -1.
> > 	 *
> > 	 * There is an exception for PageUnaccepted(). The page type can be set
> > 	 * for pages on free list. Page types are encoded in _mapcount.
> > 	 *
> > 	 * PageUnaccepted() will get cleared in post_alloc_hook().
> > 	 */
> > 	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))

Maybe I'm missing something, but isn't this true for any PageType?

> > 		return false;
> > 
> > ?
> 
> That's better.  But, aren't the PG_* names usually reserved for real
> page->flags bits?  That naming might be part of my confusion.

We use them for PageType as well like PG_buddy, PG_offline, PG_Table.

Borislav Petkov April 11, 2022, 3:55 p.m. UTC | #9

On Sun, Apr 10, 2022 at 11:41:57PM -0700, Dave Hansen wrote:
> Let's just call out the possible (probable?) need for new ABI here.
> Maybe it will cue folks who care to speak up.

Err, why would you teach the user to go poke at some arbitrary sysfs
nodes when the accepting code can simply issue a printk from time to
time

  "Guest unnaccepted memory progress: XX%. This slows down operations at the moment."

Dave Hansen April 11, 2022, 4:27 p.m. UTC | #10

On 4/11/22 08:55, Borislav Petkov wrote:
> On Sun, Apr 10, 2022 at 11:41:57PM -0700, Dave Hansen wrote:
>> Let's just call out the possible (probable?) need for new ABI here.
>> Maybe it will cue folks who care to speak up.
> Err, why would you teach the user to go poke at some arbitrary sysfs
> nodes when the accepting code can simply issue a printk from time to
> time
> 
>   "Guest unnaccepted memory progress: XX%. This slows down operations at the moment."

I guess that's not a horrible place to start.  It's also not *horribly*
different from how guests work today.  If hosts lazily allocate RAM,
they'll see largely the same kind of behavior.

What ends up determining how much memory is pre-accepted versus being
done from the guest?  Is that just a normal part of setting up a TDX
guest, like from the qemu cmdline?  Or, is there some convention with
the virtual firmware?

Tom Lendacky April 11, 2022, 6:55 p.m. UTC | #11

On 4/11/22 11:27, Dave Hansen wrote:
> On 4/11/22 08:55, Borislav Petkov wrote:
>> On Sun, Apr 10, 2022 at 11:41:57PM -0700, Dave Hansen wrote:
>>> Let's just call out the possible (probable?) need for new ABI here.
>>> Maybe it will cue folks who care to speak up.
>> Err, why would you teach the user to go poke at some arbitrary sysfs
>> nodes when the accepting code can simply issue a printk from time to
>> time
>>
>>    "Guest unnaccepted memory progress: XX%. This slows down operations at the moment."
> 
> I guess that's not a horrible place to start.  It's also not *horribly*
> different from how guests work today.  If hosts lazily allocate RAM,
> they'll see largely the same kind of behavior.
> 
> What ends up determining how much memory is pre-accepted versus being
> done from the guest?  Is that just a normal part of setting up a TDX
> guest, like from the qemu cmdline?  Or, is there some convention with
> the virtual firmware?

With SNP, some memory will be accepted as part of the LAUNCH_UPDATE 
sequences that the hypervisor performs, but that is not all of the guest 
memory. Once the guest is started, the (initial implementation of) OVMF 
SNP support will accept (PVALIDATE) all of the remaining guest memory. 
When the kernel boots, there isn't any unaccepted memory.

Once support is available in the kernel for unaccepted memory, then OVMF 
could be updated to only accept a limited amount of memory and pass the 
information about the unaccepted memory to the kernel through the EFI 
memory map.

The approaches would have to be measured to see which ends up being the 
best one. The GHCB specification allows for lots of memory to be accepted 
in a single VMGEXIT (world switch) vs performing a VMGEXIT for each 2MB of 
memory being accepted.

Thanks,
Tom

David Hildenbrand April 12, 2022, 8:15 a.m. UTC | #12

On 08.04.22 21:11, Dave Hansen wrote:
> On 4/5/22 16:43, Kirill A. Shutemov wrote:
>> Kernel only needs to accept memory once after boot, so during the boot
>> and warm up phase there will be a lot of memory acceptance. After things
>> are settled down the only price of the feature if couple of checks for
>> PageUnaccepted() in allocate and free paths. The check refers a hot
>> variable (that also encodes PageBuddy()), so it is cheap and not visible
>> on profiles.
> 
> Let's also not sugar-coat this.  Page acceptance is hideously slow.
> It's agonizingly slow.  To boot, it's done holding a global spinlock
> with interrupts disabled (see patch 6/8).  At the very, very least, each
> acceptance operation involves a couple of what are effectively ring
> transitions, a 2MB memset(), and a bunch of cache flushing.
> 
> The system is going to be downright unusable during this time, right?
> 
> Sure, it's *temporary* and only happens once at boot.  But, it's going
> to suck.
> 
> Am I over-stating this in any way?
> 
> The ACCEPT_MEMORY vmstat is good to have around.  Thanks for adding it.
>  But, I think we should also write down some guidance like:
> 
> 	If your TDX system seems as slow as snail after boot, look at
> 	the "accept_memory" counter in /proc/vmstat.  If it is
> 	incrementing, then TDX memory acceptance is likely to blame.
> 
> Do we need anything more discrete to tell users when acceptance is over?
>  For instance, maybe they run something and it goes really slow, they
> watch "accept_memory" until it stops.  They rejoice at their good
> fortune!  Then, memory allocation starts falling over to a new node and
> the agony beings anew.
> 
> I can think of dealing with this in two ways:
> 
> 	cat /sys/.../unaccepted_pages_left
> 
> which just walks the bitmap and counts the amount of pages remaining. or
> something like:
> 
> 	echo 1 > /sys/devices/system/node/node0/make_the_pain_stop
> 
> Which will, well, make the pain stop on node0.
> 

Either I'm missing something important or the random pain might just
take a really long time to stop?

I mean, we tend to reallocate the memory first that we freed last
(putting it to the head of the freelist when freeing and picking from
the head when allocating).

So unless your kernel goes crazy and allocates each and every page right
after boot, essentially accepting all memory, you might have random
unaccepted pages lurking at the tail of the freelists.

So if the VM is running for 355 days without significant memory
pressure, you can still run into unaccepted pages at day 356 that
results in a random delay due to acceptance of memory.

I think we most certainly want some way to make the random pain stop, or
to make the random pain go away after boot quickly. The
"unaccepted_pages_left" indicator would just be a "hey, there might be
random delays, but you cannot do anything about it". Magic toggles like
"make_the_pain_stop" are not so nice.

Can we simply automate this using a kthread or smth like that, which
just traverses the free page lists and accepts pages (similar, but
different to free page reporting)?

Dave Hansen April 12, 2022, 4:08 p.m. UTC | #13

On 4/12/22 01:15, David Hildenbrand wrote:
> Can we simply automate this using a kthread or smth like that, which
> just traverses the free page lists and accepts pages (similar, but
> different to free page reporting)?

That's definitely doable.

The downside is that this will force premature consumption of physical
memory resources that the guest may never use.  That's a particular
problem on TDX systems since there is no way for a VMM to reclaim guest
memory short of killing the guest.

In other words, I can see a good argument either way:
1. The kernel should accept everything to avoid the perf nastiness
2. The kernel should accept only what it needs in order to reduce memory
   use

I'm kinda partial to #1 though, if I had to pick only one.

The other option might be to tie this all to DEFERRED_STRUCT_PAGE_INIT.
 Have the rule that everything that gets a 'struct page' must be
accepted.  If you want to do delayed acceptance, you do it via
DEFERRED_STRUCT_PAGE_INIT.

David Hildenbrand April 13, 2022, 10:36 a.m. UTC | #14

On 12.04.22 18:08, Dave Hansen wrote:
> On 4/12/22 01:15, David Hildenbrand wrote:
>> Can we simply automate this using a kthread or smth like that, which
>> just traverses the free page lists and accepts pages (similar, but
>> different to free page reporting)?
> 
> That's definitely doable.
> 
> The downside is that this will force premature consumption of physical
> memory resources that the guest may never use.  That's a particular
> problem on TDX systems since there is no way for a VMM to reclaim guest
> memory short of killing the guest.

IIRC, the hypervisor will usually effectively populate all guest RAM
either way right now. So yes, for hypervisors that might optimize for
that, that statement would be true. But I lost track how helpful it
would be in the near future e.g., with the fd-based private guest memory
-- maybe they already optimize for delayed acceptance of memory, turning
it into delayed population.

> 
> In other words, I can see a good argument either way:
> 1. The kernel should accept everything to avoid the perf nastiness
> 2. The kernel should accept only what it needs in order to reduce memory
>    use
> 
> I'm kinda partial to #1 though, if I had to pick only one.
> 
> The other option might be to tie this all to DEFERRED_STRUCT_PAGE_INIT.
>  Have the rule that everything that gets a 'struct page' must be
> accepted.  If you want to do delayed acceptance, you do it via
> DEFERRED_STRUCT_PAGE_INIT.

That could also be an option, yes. At least being able to chose would be
good. But IIRC, DEFERRED_STRUCT_PAGE_INIT will still make the system get
stuck during boot and wait until everything was accepted.

I see the following variants:

1) Slow boot; after boot, all memory is already accepted.
2) Fast boot; after boot, all memory will slowly but steadily get
   accepted in the background. After a while, all memory is accepted and
   can be signaled to user space.
3) Fast boot; after boot, memory gets accepted on demand. This is what
   we have in this series.

I somehow don't quite like 3), but with deferred population in the
hypervisor, it might just make sense.

Kirill A. Shutemov April 13, 2022, 11:30 a.m. UTC | #15

On Wed, Apr 13, 2022 at 12:36:11PM +0200, David Hildenbrand wrote:
> On 12.04.22 18:08, Dave Hansen wrote:
> > On 4/12/22 01:15, David Hildenbrand wrote:
> >> Can we simply automate this using a kthread or smth like that, which
> >> just traverses the free page lists and accepts pages (similar, but
> >> different to free page reporting)?
> > 
> > That's definitely doable.
> > 
> > The downside is that this will force premature consumption of physical
> > memory resources that the guest may never use.  That's a particular
> > problem on TDX systems since there is no way for a VMM to reclaim guest
> > memory short of killing the guest.
> 
> IIRC, the hypervisor will usually effectively populate all guest RAM
> either way right now.

No, it is not usual. By default QEMU/KVM uses anonymous mapping and
fault-in memory on demand.

Yes, there's an option to pre-populate guest memory, but it is not the
default.

> So yes, for hypervisors that might optimize for
> that, that statement would be true. But I lost track how helpful it
> would be in the near future e.g., with the fd-based private guest memory
> -- maybe they already optimize for delayed acceptance of memory, turning
> it into delayed population.
> 
> > 
> > In other words, I can see a good argument either way:
> > 1. The kernel should accept everything to avoid the perf nastiness
> > 2. The kernel should accept only what it needs in order to reduce memory
> >    use
> > 
> > I'm kinda partial to #1 though, if I had to pick only one.
> > 
> > The other option might be to tie this all to DEFERRED_STRUCT_PAGE_INIT.
> >  Have the rule that everything that gets a 'struct page' must be
> > accepted.  If you want to do delayed acceptance, you do it via
> > DEFERRED_STRUCT_PAGE_INIT.
> 
> That could also be an option, yes. At least being able to chose would be
> good. But IIRC, DEFERRED_STRUCT_PAGE_INIT will still make the system get
> stuck during boot and wait until everything was accepted.

Right. It deferred page init has to be done before init.

> I see the following variants:
> 
> 1) Slow boot; after boot, all memory is already accepted.
> 2) Fast boot; after boot, all memory will slowly but steadily get
>    accepted in the background. After a while, all memory is accepted and
>    can be signaled to user space.
> 3) Fast boot; after boot, memory gets accepted on demand. This is what
>    we have in this series.
> 
> I somehow don't quite like 3), but with deferred population in the
> hypervisor, it might just make sense.

Conceptionally, 3 is not different from what happens now. The first time
normal VM touches the page (like on handling __GFP_ZERO) the page gets
allocated on host. It can take very long time if it kicks in direct
reclaim on the host.

The only difference is that it is *usually* slower.

I guest we can make a case for making 1 an option to match pre-populated
use case for normal VMs.

Frankly, I think option 2 is the worst one. You still CPU cycles from the
workload after boot to do the job that may or may not be needed. It is an
half-measure that helps nobody.

David Hildenbrand April 13, 2022, 11:32 a.m. UTC | #16

On 13.04.22 13:30, Kirill A. Shutemov wrote:
> On Wed, Apr 13, 2022 at 12:36:11PM +0200, David Hildenbrand wrote:
>> On 12.04.22 18:08, Dave Hansen wrote:
>>> On 4/12/22 01:15, David Hildenbrand wrote:
>>>> Can we simply automate this using a kthread or smth like that, which
>>>> just traverses the free page lists and accepts pages (similar, but
>>>> different to free page reporting)?
>>>
>>> That's definitely doable.
>>>
>>> The downside is that this will force premature consumption of physical
>>> memory resources that the guest may never use.  That's a particular
>>> problem on TDX systems since there is no way for a VMM to reclaim guest
>>> memory short of killing the guest.
>>
>> IIRC, the hypervisor will usually effectively populate all guest RAM
>> either way right now.
> 
> No, it is not usual. By default QEMU/KVM uses anonymous mapping and
> fault-in memory on demand.
> 
> Yes, there's an option to pre-populate guest memory, but it is not the
> default.

Let me be clearer: I'm talking about the TDX/SEV world, not ordinary
unencrypted VMs. For ordinary encrypted VMs we do have populate on
demand frequently.

For SEV we currently pin all guest memory and consequently don't have
populate on demand. For TDX, again, I did not follow how fd-based
private guest memory will behave. I thought I remembered that we will
similarly not have populate-on-demand.

Preallocation is usually used with huge pages, but I guess that's out of
scope right now for encrypted VMs.

Kirill A. Shutemov April 13, 2022, 11:40 a.m. UTC | #17

On Mon, Apr 11, 2022 at 01:07:29PM +0300, Mike Rapoport wrote:
> On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote:
> > On 4/9/22 08:54, Kirill A. Shutemov wrote:
> > > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote:
> > 
> > >>>  	if (fpi_flags & FPI_TO_TAIL)
> > >>>  		to_tail = true;
> > >>>  	else if (is_shuffle_order(order))
> > >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
> > >>>  static inline bool page_expected_state(struct page *page,
> > >>>  					unsigned long check_flags)
> > >>>  {
> > >>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> > >>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> > >>> +	    !PageUnaccepted(page))
> > >>>  		return false;
> > >>
> > >> That probably deserves a comment, and maybe its own if() statement.
> > > 
> > > Own if does not work. PageUnaccepted() is encoded in _mapcount.
> > > 
> > > What about this:
> > > 
> > > 	/*
> > > 	 * page->_mapcount is expected to be -1.
> > > 	 *
> > > 	 * There is an exception for PageUnaccepted(). The page type can be set
> > > 	 * for pages on free list. Page types are encoded in _mapcount.
> > > 	 *
> > > 	 * PageUnaccepted() will get cleared in post_alloc_hook().
> > > 	 */
> > > 	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
> 
> Maybe I'm missing something, but isn't this true for any PageType?
> 
> > > 		return false;
> > > 
> > > ?
> > 
> > That's better.  But, aren't the PG_* names usually reserved for real
> > page->flags bits?  That naming might be part of my confusion.
> 
> We use them for PageType as well like PG_buddy, PG_offline, PG_Table.

PG_buddy gets clear on remove from the free list, before the chec.

PG_offline and PG_table pages are never on free lists.

Mike Rapoport April 13, 2022, 2:39 p.m. UTC | #18

On Wed, Apr 13, 2022 at 12:36:11PM +0200, David Hildenbrand wrote:
> On 12.04.22 18:08, Dave Hansen wrote:
> > On 4/12/22 01:15, David Hildenbrand wrote:
> > 
> > The other option might be to tie this all to DEFERRED_STRUCT_PAGE_INIT.
> >  Have the rule that everything that gets a 'struct page' must be
> > accepted.  If you want to do delayed acceptance, you do it via
> > DEFERRED_STRUCT_PAGE_INIT.
> 
> That could also be an option, yes. At least being able to chose would be
> good. But IIRC, DEFERRED_STRUCT_PAGE_INIT will still make the system get
> stuck during boot and wait until everything was accepted.

The deferred page init runs multithreaded, so guest with SMP will be stuck
for less time.
 
> I see the following variants:
> 
> 1) Slow boot; after boot, all memory is already accepted.
> 2) Fast boot; after boot, all memory will slowly but steadily get
>    accepted in the background. After a while, all memory is accepted and
>    can be signaled to user space.
> 3) Fast boot; after boot, memory gets accepted on demand. This is what
>    we have in this series.
> 
> I somehow don't quite like 3), but with deferred population in the
> hypervisor, it might just make sense.

IMHO, deferred population in hypervisor will be way more complex than this
series with similar "visible" performance.

Mike Rapoport April 13, 2022, 2:48 p.m. UTC | #19

On Wed, Apr 13, 2022 at 02:40:01PM +0300, Kirill A. Shutemov wrote:
> On Mon, Apr 11, 2022 at 01:07:29PM +0300, Mike Rapoport wrote:
> > On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote:
> > > On 4/9/22 08:54, Kirill A. Shutemov wrote:
> > > > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote:
> > > 
> > > >>>  	if (fpi_flags & FPI_TO_TAIL)
> > > >>>  		to_tail = true;
> > > >>>  	else if (is_shuffle_order(order))
> > > >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
> > > >>>  static inline bool page_expected_state(struct page *page,
> > > >>>  					unsigned long check_flags)
> > > >>>  {
> > > >>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> > > >>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> > > >>> +	    !PageUnaccepted(page))
> > > >>>  		return false;
> > > >>
> > > >> That probably deserves a comment, and maybe its own if() statement.
> > > > 
> > > > Own if does not work. PageUnaccepted() is encoded in _mapcount.
> > > > 
> > > > What about this:
> > > > 
> > > > 	/*
> > > > 	 * page->_mapcount is expected to be -1.
> > > > 	 *
> > > > 	 * There is an exception for PageUnaccepted(). The page type can be set
> > > > 	 * for pages on free list. Page types are encoded in _mapcount.
> > > > 	 *
> > > > 	 * PageUnaccepted() will get cleared in post_alloc_hook().
> > > > 	 */
> > > > 	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
> > 
> > Maybe I'm missing something, but isn't this true for any PageType?
> 
> PG_buddy gets clear on remove from the free list, before the chec.
> 
> PG_offline and PG_table pages are never on free lists.

Right, this will work 'cause PageType is inverted. I still think this
condition is hard to parse and I liked the old variant with
!PageUnaccepted() better.

Maybe if we wrap the whole construct in a helper it will be less eye
hurting.
 
> -- 
>  Kirill A. Shutemov

Kirill A. Shutemov April 13, 2022, 3:15 p.m. UTC | #20

On Wed, Apr 13, 2022 at 05:48:09PM +0300, Mike Rapoport wrote:
> On Wed, Apr 13, 2022 at 02:40:01PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Apr 11, 2022 at 01:07:29PM +0300, Mike Rapoport wrote:
> > > On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote:
> > > > On 4/9/22 08:54, Kirill A. Shutemov wrote:
> > > > > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote:
> > > > 
> > > > >>>  	if (fpi_flags & FPI_TO_TAIL)
> > > > >>>  		to_tail = true;
> > > > >>>  	else if (is_shuffle_order(order))
> > > > >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
> > > > >>>  static inline bool page_expected_state(struct page *page,
> > > > >>>  					unsigned long check_flags)
> > > > >>>  {
> > > > >>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> > > > >>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> > > > >>> +	    !PageUnaccepted(page))
> > > > >>>  		return false;
> > > > >>
> > > > >> That probably deserves a comment, and maybe its own if() statement.
> > > > > 
> > > > > Own if does not work. PageUnaccepted() is encoded in _mapcount.
> > > > > 
> > > > > What about this:
> > > > > 
> > > > > 	/*
> > > > > 	 * page->_mapcount is expected to be -1.
> > > > > 	 *
> > > > > 	 * There is an exception for PageUnaccepted(). The page type can be set
> > > > > 	 * for pages on free list. Page types are encoded in _mapcount.
> > > > > 	 *
> > > > > 	 * PageUnaccepted() will get cleared in post_alloc_hook().
> > > > > 	 */
> > > > > 	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
> > > 
> > > Maybe I'm missing something, but isn't this true for any PageType?
> > 
> > PG_buddy gets clear on remove from the free list, before the chec.
> > 
> > PG_offline and PG_table pages are never on free lists.
> 
> Right, this will work 'cause PageType is inverted. I still think this
> condition is hard to parse and I liked the old variant with
> !PageUnaccepted() better.

Well the old way to deal with PageUnaccepted() had a flaw: if the page is
PageUnaccepted() it will allow any other page types to pass here. Like
PG_unaccepted + PG_buddy will slide here.

> Maybe if we wrap the whole construct in a helper it will be less eye
> hurting.

Hm. Any suggestion how such helper could look like? Cannot think of
anything sane.

Dave Hansen April 13, 2022, 3:36 p.m. UTC | #21

On 4/13/22 04:30, Kirill A. Shutemov wrote:
>> 2) Fast boot; after boot, all memory will slowly but steadily get
>>    accepted in the background. After a while, all memory is accepted and
>>    can be signaled to user space.
...
> Frankly, I think option 2 is the worst one. You still CPU cycles from the
> workload after boot to do the job that may or may not be needed. It is an
> half-measure that helps nobody.

Let's not be too hyperbolic here.  "Worst" is entirely subjective and it
totally depends on your perspective and what you care about.

There are basically four options:

 * Accept everything in early boot
 * Accept with deferred page free
 * Accept with kthread after boot
 * Accept on demand

and four things that matter:

 * Code complexity
 * Time to a shell prompt
 * CPU/Memory waste
 * Deterministic overhead

Did I miss any?

News flash: none of the options wins on all the things that matter.
We're going to have to pick one (or maybe two).  I'm also not horribly
convinced that there's a problem here worth solving, especially one that
requires surgery in the core of the buddy allocator.

This is essentially making a performance argument: it takes too long to
boot if we go with a simpler solution.  Yet, I haven't seen any data.  I
think we need to go with the simplest approach(es) until there's some
actual data to guide us here.

Here's another way to look at it:

> https://docs.google.com/spreadsheets/d/1Fpv0Yp0CTF5_JXHR2pywvNtImTwUVGTxDMlJ5t8qiis/edit?usp=sharing

David Hildenbrand April 13, 2022, 4:07 p.m. UTC | #22

On 13.04.22 17:36, Dave Hansen wrote:
> On 4/13/22 04:30, Kirill A. Shutemov wrote:
>>> 2) Fast boot; after boot, all memory will slowly but steadily get
>>>    accepted in the background. After a while, all memory is accepted and
>>>    can be signaled to user space.
> ...
>> Frankly, I think option 2 is the worst one. You still CPU cycles from the
>> workload after boot to do the job that may or may not be needed. It is an
>> half-measure that helps nobody.
> 
> Let's not be too hyperbolic here.  "Worst" is entirely subjective and it
> totally depends on your perspective and what you care about.

Right. Some people might want to start their workload as soon as the
pain is really over. Some might want to have a functional system before
that, others might not care.

> 
> There are basically four options:
> 
>  * Accept everything in early boot
>  * Accept with deferred page free
>  * Accept with kthread after boot
>  * Accept on demand
> 
> and four things that matter:
> 
>  * Code complexity
>  * Time to a shell prompt
>  * CPU/Memory waste
>  * Deterministic overhead
> 
> Did I miss any?

Nothing that comes to mind.

> 
> News flash: none of the options wins on all the things that matter.
> We're going to have to pick one (or maybe two).  I'm also not horribly
> convinced that there's a problem here worth solving, especially one that
> requires surgery in the core of the buddy allocator.
> 
> This is essentially making a performance argument: it takes too long to
> boot if we go with a simpler solution.  Yet, I haven't seen any data.  I
> think we need to go with the simplest approach(es) until there's some
> actual data to guide us here.

Simplest meaning: accept everything during early boot and don't touch
core-mm/buddy code, correct?

> 
> Here's another way to look at it:
> 
>> https://docs.google.com/spreadsheets/d/1Fpv0Yp0CTF5_JXHR2pywvNtImTwUVGTxDMlJ5t8qiis/edit?usp=sharing
>

Dave Hansen April 13, 2022, 4:13 p.m. UTC | #23

On 4/13/22 09:07, David Hildenbrand wrote:
> Simplest meaning: accept everything during early boot and don't touch
> core-mm/buddy code, correct?

Yes, exactly.

Kirill A. Shutemov April 13, 2022, 4:24 p.m. UTC | #24

On Wed, Apr 13, 2022 at 08:36:52AM -0700, Dave Hansen wrote:
> On 4/13/22 04:30, Kirill A. Shutemov wrote:
> >> 2) Fast boot; after boot, all memory will slowly but steadily get
> >>    accepted in the background. After a while, all memory is accepted and
> >>    can be signaled to user space.
> ...
> > Frankly, I think option 2 is the worst one. You still CPU cycles from the
> > workload after boot to do the job that may or may not be needed. It is an
> > half-measure that helps nobody.
> 
> Let's not be too hyperbolic here.  "Worst" is entirely subjective and it
> totally depends on your perspective and what you care about.
> 
> There are basically four options:
> 
>  * Accept everything in early boot
>  * Accept with deferred page free
>  * Accept with kthread after boot
>  * Accept on demand
> 
> and four things that matter:
> 
>  * Code complexity
>  * Time to a shell prompt
>  * CPU/Memory waste
>  * Deterministic overhead
> 
> Did I miss any?

"Time to shell" is not equal to "time to do the job". Real workloads do
stuff beyond memory allocations. But, yes, it is harder quantify.

> News flash: none of the options wins on all the things that matter.
> We're going to have to pick one (or maybe two).  I'm also not horribly
> convinced that there's a problem here worth solving, especially one that
> requires surgery in the core of the buddy allocator.
> 
> This is essentially making a performance argument: it takes too long to
> boot if we go with a simpler solution.  Yet, I haven't seen any data.  I
> think we need to go with the simplest approach(es) until there's some
> actual data to guide us here.
> 
> Here's another way to look at it:
> 
> > https://docs.google.com/spreadsheets/d/1Fpv0Yp0CTF5_JXHR2pywvNtImTwUVGTxDMlJ5t8qiis/edit?usp=sharing

The link is view-only.

AFAICS, complexity of the kthread approach is on par or greater comparing
to on-demand. You need coordination between allocator and the thread.
It can be hard to hit right balance for the kthread between being CPU hog
and not providing enough accepted memory.

Mike Rapoport April 13, 2022, 8:06 p.m. UTC | #25

On Wed, Apr 13, 2022 at 06:15:17PM +0300, Kirill A. Shutemov wrote:
> On Wed, Apr 13, 2022 at 05:48:09PM +0300, Mike Rapoport wrote:
> > On Wed, Apr 13, 2022 at 02:40:01PM +0300, Kirill A. Shutemov wrote:
> > > On Mon, Apr 11, 2022 at 01:07:29PM +0300, Mike Rapoport wrote:
> > > > On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote:
> > > > > On 4/9/22 08:54, Kirill A. Shutemov wrote:
> > > > > > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote:
> > > > > 
> > > > > >>>  	if (fpi_flags & FPI_TO_TAIL)
> > > > > >>>  		to_tail = true;
> > > > > >>>  	else if (is_shuffle_order(order))
> > > > > >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page,
> > > > > >>>  static inline bool page_expected_state(struct page *page,
> > > > > >>>  					unsigned long check_flags)
> > > > > >>>  {
> > > > > >>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> > > > > >>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> > > > > >>> +	    !PageUnaccepted(page))
> > > > > >>>  		return false;
> > > > > >>
> > > > > >> That probably deserves a comment, and maybe its own if() statement.
> > > > > > 
> > > > > > Own if does not work. PageUnaccepted() is encoded in _mapcount.
> > > > > > 
> > > > > > What about this:
> > > > > > 
> > > > > > 	/*
> > > > > > 	 * page->_mapcount is expected to be -1.
> > > > > > 	 *
> > > > > > 	 * There is an exception for PageUnaccepted(). The page type can be set
> > > > > > 	 * for pages on free list. Page types are encoded in _mapcount.
> > > > > > 	 *
> > > > > > 	 * PageUnaccepted() will get cleared in post_alloc_hook().
> > > > > > 	 */
> > > > > > 	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
> > > > 
> > > > Maybe I'm missing something, but isn't this true for any PageType?
> > > 
> > > PG_buddy gets clear on remove from the free list, before the chec.
> > > 
> > > PG_offline and PG_table pages are never on free lists.
> > 
> > Right, this will work 'cause PageType is inverted. I still think this
> > condition is hard to parse and I liked the old variant with
> > !PageUnaccepted() better.
> 
> Well the old way to deal with PageUnaccepted() had a flaw: if the page is
> PageUnaccepted() it will allow any other page types to pass here. Like
> PG_unaccepted + PG_buddy will slide here.

It seems to me that there was an implicit assumption that page types are
exclusive and PG_unaccepted would break it.
 
> > Maybe if we wrap the whole construct in a helper it will be less eye
> > hurting.
> 
> Hm. Any suggestion how such helper could look like? Cannot think of
> anything sane.

Me neither :(

How about updating the comment to be

	/*
	 * The page must not be mapped to userspace and must not have a
	 * PageType other than PageUnaccepted.
	 * This means that page->_mapcount must be -1 or have only
	 * PG_unaccepted bit cleared.
	 */
	if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1))
 
> -- 
>  Kirill A. Shutemov

[PATCHv4,1/8] mm: Add support for unaccepted memory

Commit Message

Comments

Patch