mbox series

[PATCHv2,0/7] Implement support for unaccepted memory

Message ID 20220111113314.27173-1-kirill.shutemov@linux.intel.com
Headers show
Series Implement support for unaccepted memory | expand

Message

Kirill A. Shutemov Jan. 11, 2022, 11:33 a.m. UTC
UEFI Specification version 2.9 introduces the concept of memory
acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The approach lowers boot time substantially. Boot to shell is ~2.5x
faster for 4G TDX VM and ~4x faster for 64G.

Patches 1-6/7 are generic and don't have any dependencies on TDX. They
should serve AMD SEV needs as well. TDX-specific code isolated in the
last patch. This patch requires the core TDX patchset which is currently
under review.

Kirill A. Shutemov (7):
  mm: Add support for unaccepted memory
  efi/x86: Get full memory map in allocate_e820()
  efi/x86: Implement support for unaccepted memory
  x86/boot/compressed: Handle unaccepted memory
  x86/mm: Reserve unaccepted memory bitmap
  x86/mm: Provide helpers for unaccepted memory
  x86/tdx: Unaccepted memory support

 Documentation/x86/zero-page.rst              |  1 +
 arch/x86/Kconfig                             |  1 +
 arch/x86/boot/compressed/Makefile            |  1 +
 arch/x86/boot/compressed/bitmap.c            | 86 ++++++++++++++++++
 arch/x86/boot/compressed/kaslr.c             | 14 ++-
 arch/x86/boot/compressed/misc.c              |  9 ++
 arch/x86/boot/compressed/tdx.c               | 67 ++++++++++++++
 arch/x86/boot/compressed/unaccepted_memory.c | 64 +++++++++++++
 arch/x86/include/asm/page.h                  |  5 ++
 arch/x86/include/asm/tdx.h                   |  2 +
 arch/x86/include/asm/unaccepted_memory.h     | 17 ++++
 arch/x86/include/uapi/asm/bootparam.h        |  3 +-
 arch/x86/kernel/e820.c                       | 10 +++
 arch/x86/kernel/tdx.c                        |  7 ++
 arch/x86/mm/Makefile                         |  2 +
 arch/x86/mm/unaccepted_memory.c              | 94 ++++++++++++++++++++
 drivers/firmware/efi/Kconfig                 | 14 +++
 drivers/firmware/efi/efi.c                   |  1 +
 drivers/firmware/efi/libstub/x86-stub.c      | 86 ++++++++++++++----
 include/linux/efi.h                          |  3 +-
 include/linux/page-flags.h                   |  4 +
 mm/internal.h                                | 15 ++++
 mm/memblock.c                                |  1 +
 mm/page_alloc.c                              | 21 ++++-
 24 files changed, 508 insertions(+), 20 deletions(-)
 create mode 100644 arch/x86/boot/compressed/bitmap.c
 create mode 100644 arch/x86/boot/compressed/unaccepted_memory.c
 create mode 100644 arch/x86/include/asm/unaccepted_memory.h
 create mode 100644 arch/x86/mm/unaccepted_memory.c

Comments

Dave Hansen Jan. 11, 2022, 7:46 p.m. UTC | #1
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 1018e50566f3..6dfa594192de 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1400,6 +1400,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>   		 */
>   		kmemleak_alloc_phys(found, size, 0, 0);
>   
> +	accept_memory(found, found + size);
>   	return found;
>   }

This could use a comment.

Looking at this, I also have to wonder if accept_memory() is a bit too 
generic.  Should it perhaps be: cc_accept_memory() or 
cc_guest_accept_memory()?

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5952749ad40..5707b4b5f774 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1064,6 +1064,7 @@ static inline void __free_one_page(struct page *page,
>   	unsigned int max_order;
>   	struct page *buddy;
>   	bool to_tail;
> +	bool offline = PageOffline(page);
>   
>   	max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order);
>   
> @@ -1097,6 +1098,10 @@ static inline void __free_one_page(struct page *page,
>   			clear_page_guard(zone, buddy, order, migratetype);
>   		else
>   			del_page_from_free_list(buddy, zone, order);
> +
> +		if (PageOffline(buddy))
> +			offline = true;
> +
>   		combined_pfn = buddy_pfn & pfn;
>   		page = page + (combined_pfn - pfn);
>   		pfn = combined_pfn;
> @@ -1130,6 +1135,9 @@ static inline void __free_one_page(struct page *page,
>   done_merging:
>   	set_buddy_order(page, order);
>   
> +	if (offline)
> +		__SetPageOffline(page);
> +
>   	if (fpi_flags & FPI_TO_TAIL)
>   		to_tail = true;
>   	else if (is_shuffle_order(order))

This is touching some pretty hot code paths.  You mention both that 
accepting memory is slow and expensive, yet you're doing it in the core 
allocator.

That needs at least some discussion in the changelog.

> @@ -1155,7 +1163,8 @@ static inline void __free_one_page(struct page *page,
>   static inline bool page_expected_state(struct page *page,
>   					unsigned long check_flags)
>   {
> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> +	    !PageOffline(page))
>   		return false;

Looking at stuff like this, I can't help but think that a:

	#define PageOffline PageUnaccepted

and some other renaming would be a fine idea.  I get that the Offline 
bit can be reused, but I'm not sure that the "Offline" *naming* should 
be reused.  What you're doing here is logically distinct from existing 
offlining.

>   	if (unlikely((unsigned long)page->mapping |
> @@ -1734,6 +1743,8 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
>   {
>   	if (early_page_uninitialised(pfn))
>   		return;
> +
> +	maybe_set_page_offline(page, order);
>   	__free_pages_core(page, order);
>   }
>   
> @@ -1823,10 +1834,12 @@ static void __init deferred_free_range(unsigned long pfn,
>   	if (nr_pages == pageblock_nr_pages &&
>   	    (pfn & (pageblock_nr_pages - 1)) == 0) {
>   		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> +		maybe_set_page_offline(page, pageblock_order);
>   		__free_pages_core(page, pageblock_order);
>   		return;
>   	}
>   
> +	accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT);
>   	for (i = 0; i < nr_pages; i++, page++, pfn++) {
>   		if ((pfn & (pageblock_nr_pages - 1)) == 0)
>   			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> @@ -2297,6 +2310,9 @@ static inline void expand(struct zone *zone, struct page *page,
>   		if (set_page_guard(zone, &page[size], high, migratetype))
>   			continue;
>   
> +		if (PageOffline(page))
> +			__SetPageOffline(&page[size]);

Yeah, this is really begging for comments.  Please add some.

>   		add_to_free_list(&page[size], zone, high, migratetype);
>   		set_buddy_order(&page[size], high);
>   	}
> @@ -2393,6 +2409,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>   	 */
>   	kernel_unpoison_pages(page, 1 << order);
>   
> +	if (PageOffline(page))
> +		accept_and_clear_page_offline(page, order);
> +
>   	/*
>   	 * As memory initialization might be integrated into KASAN,
>   	 * kasan_alloc_pages and kernel_init_free_pages must be

I guess once there are no more PageOffline() pages in the allocator, the 
only impact from these patches will be a bunch of conditional branches 
from the "if (PageOffline(page))" that always have the same result.  The 
branch predictors should do a good job with that.

*BUT*, that overhead is going to be universally inflicted on all users 
on x86, even those without TDX.  I guess the compiler will save non-x86 
users because they'll have an empty stub for 
accept_and_clear_page_offline() which the compiler will optimize away.

It sure would be nice to have some changelog material about why this is 
OK, though.  This is especially true since there's a global spinlock 
hidden in accept_and_clear_page_offline() wrapping a slow and "costly" 
operation.
David Hildenbrand Jan. 12, 2022, 11:31 a.m. UTC | #2
> 
> Looking at stuff like this, I can't help but think that a:
> 
> 	#define PageOffline PageUnaccepted
> 
> and some other renaming would be a fine idea.  I get that the Offline 
> bit can be reused, but I'm not sure that the "Offline" *naming* should 
> be reused.  What you're doing here is logically distinct from existing 
> offlining.

Yes, or using a new pagetype bit to make the distinction clearer.
Especially the function names like maybe_set_page_offline() et. Al are
confusing IMHO. They are all about accepting unaccepted memory ... and
should express that.

I assume PageOffline() will be set only on the first sub-page of a
high-order PageBuddy() page, correct?

Then we'll have to monitor all PageOffline() users such that they can
actually deal with PageBuddy() pages spanning *multiple* base pages for
a PageBuddy() page. For now it's clear that if a page is PageOffline(),
it cannot be PageBuddy() and cannot span more than one base page.

E.g., fs/proc/kcore.c:read_kcore() assumes that PageOffline() is set on
individual base pages.
Kirill A. Shutemov Jan. 12, 2022, 6:30 p.m. UTC | #3
On Tue, Jan 11, 2022 at 11:46:37AM -0800, Dave Hansen wrote:
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 1018e50566f3..6dfa594192de 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1400,6 +1400,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
> >   		 */
> >   		kmemleak_alloc_phys(found, size, 0, 0);
> > +	accept_memory(found, found + size);
> >   	return found;
> >   }
> 
> This could use a comment.

How about this:

	/*
	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
	 * requiring memory to be accepted before it can be used by the
	 * guest.
	 *
	 * Accept the memory of the allocated buffer.
	 */
> 
> Looking at this, I also have to wonder if accept_memory() is a bit too
> generic.  Should it perhaps be: cc_accept_memory() or
> cc_guest_accept_memory()?

I'll rename accept_memory() to cc_accept_memory() and
accept_and_clear_page_offline() to cc_accept_and_clear_page_offline().

> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c5952749ad40..5707b4b5f774 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1064,6 +1064,7 @@ static inline void __free_one_page(struct page *page,
> >   	unsigned int max_order;
> >   	struct page *buddy;
> >   	bool to_tail;
> > +	bool offline = PageOffline(page);
> >   	max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order);
> > @@ -1097,6 +1098,10 @@ static inline void __free_one_page(struct page *page,
> >   			clear_page_guard(zone, buddy, order, migratetype);
> >   		else
> >   			del_page_from_free_list(buddy, zone, order);
> > +
> > +		if (PageOffline(buddy))
> > +			offline = true;
> > +
> >   		combined_pfn = buddy_pfn & pfn;
> >   		page = page + (combined_pfn - pfn);
> >   		pfn = combined_pfn;
> > @@ -1130,6 +1135,9 @@ static inline void __free_one_page(struct page *page,
> >   done_merging:
> >   	set_buddy_order(page, order);
> > +	if (offline)
> > +		__SetPageOffline(page);
> > +

I'll add

	/* Mark page PageOffline() if any merged page was PageOffline() */

above the 'if'.

> >   	if (fpi_flags & FPI_TO_TAIL)
> >   		to_tail = true;
> >   	else if (is_shuffle_order(order))
> 
> This is touching some pretty hot code paths.  You mention both that
> accepting memory is slow and expensive, yet you're doing it in the core
> allocator.
> 
> That needs at least some discussion in the changelog.

That is page type transfer on page merging. What expensive do you see here?
The cachelines with both struct pages are hot already.

> > @@ -1155,7 +1163,8 @@ static inline void __free_one_page(struct page *page,
> >   static inline bool page_expected_state(struct page *page,
> >   					unsigned long check_flags)
> >   {
> > -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> > +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
> > +	    !PageOffline(page))
> >   		return false;
> 
> Looking at stuff like this, I can't help but think that a:
> 
> 	#define PageOffline PageUnaccepted
> 
> and some other renaming would be a fine idea.  I get that the Offline bit
> can be reused, but I'm not sure that the "Offline" *naming* should be
> reused.  What you're doing here is logically distinct from existing
> offlining.

I find the Offline name fitting. In both cases page is not accessible
without additional preparation.

Why do you want to multiply entities?

> >   	if (unlikely((unsigned long)page->mapping |
> > @@ -1734,6 +1743,8 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
> >   {
> >   	if (early_page_uninitialised(pfn))
> >   		return;
> > +
> > +	maybe_set_page_offline(page, order);
> >   	__free_pages_core(page, order);
> >   }
> > @@ -1823,10 +1834,12 @@ static void __init deferred_free_range(unsigned long pfn,
> >   	if (nr_pages == pageblock_nr_pages &&
> >   	    (pfn & (pageblock_nr_pages - 1)) == 0) {
> >   		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> > +		maybe_set_page_offline(page, pageblock_order);
> >   		__free_pages_core(page, pageblock_order);
> >   		return;
> >   	}
> > +	accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT);
> >   	for (i = 0; i < nr_pages; i++, page++, pfn++) {
> >   		if ((pfn & (pageblock_nr_pages - 1)) == 0)
> >   			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> > @@ -2297,6 +2310,9 @@ static inline void expand(struct zone *zone, struct page *page,
> >   		if (set_page_guard(zone, &page[size], high, migratetype))
> >   			continue;
> > +		if (PageOffline(page))
> > +			__SetPageOffline(&page[size]);
> 
> Yeah, this is really begging for comments.  Please add some.

I'll add
		/* Transfer PageOffline() to newly split pages */
> 
> >   		add_to_free_list(&page[size], zone, high, migratetype);
> >   		set_buddy_order(&page[size], high);
> >   	}
> > @@ -2393,6 +2409,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >   	 */
> >   	kernel_unpoison_pages(page, 1 << order);
> > +	if (PageOffline(page))
> > +		accept_and_clear_page_offline(page, order);
> > +
> >   	/*
> >   	 * As memory initialization might be integrated into KASAN,
> >   	 * kasan_alloc_pages and kernel_init_free_pages must be
> 
> I guess once there are no more PageOffline() pages in the allocator, the
> only impact from these patches will be a bunch of conditional branches from
> the "if (PageOffline(page))" that always have the same result.  The branch
> predictors should do a good job with that.
> 
> *BUT*, that overhead is going to be universally inflicted on all users on
> x86, even those without TDX.  I guess the compiler will save non-x86 users
> because they'll have an empty stub for accept_and_clear_page_offline() which
> the compiler will optimize away.
> 
> It sure would be nice to have some changelog material about why this is OK,
> though.  This is especially true since there's a global spinlock hidden in
> accept_and_clear_page_offline() wrapping a slow and "costly" operation.

Okay, I will come up with an explanation in commit message.
Dave Hansen Jan. 12, 2022, 6:40 p.m. UTC | #4
On 1/12/22 10:30, Kirill A. Shutemov wrote:
> On Tue, Jan 11, 2022 at 11:46:37AM -0800, Dave Hansen wrote:
>>> diff --git a/mm/memblock.c b/mm/memblock.c
>>> index 1018e50566f3..6dfa594192de 100644
>>> --- a/mm/memblock.c
>>> +++ b/mm/memblock.c
>>> @@ -1400,6 +1400,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>>>    		 */
>>>    		kmemleak_alloc_phys(found, size, 0, 0);
>>> +	accept_memory(found, found + size);
>>>    	return found;
>>>    }
>>
>> This could use a comment.
> 
> How about this:
> 
> 	/*
> 	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> 	 * requiring memory to be accepted before it can be used by the
> 	 * guest.
> 	 *
> 	 * Accept the memory of the allocated buffer.
> 	 */

I think a one-liner that might cue the reader to go look at 
accept_memory() itself would be fine.  Maybe:

	/* Make the memblock usable when running in picky VM guests: */

That implies that the memory isn't usable without doing this and also 
points out that it's related to running in a guest.

>> Looking at this, I also have to wonder if accept_memory() is a bit too
>> generic.  Should it perhaps be: cc_accept_memory() or
>> cc_guest_accept_memory()?
> 
> I'll rename accept_memory() to cc_accept_memory() and
> accept_and_clear_page_offline() to cc_accept_and_clear_page_offline().
> 
>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index c5952749ad40..5707b4b5f774 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1064,6 +1064,7 @@ static inline void __free_one_page(struct page *page,
>>>    	unsigned int max_order;
>>>    	struct page *buddy;
>>>    	bool to_tail;
>>> +	bool offline = PageOffline(page);
>>>    	max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order);
>>> @@ -1097,6 +1098,10 @@ static inline void __free_one_page(struct page *page,
>>>    			clear_page_guard(zone, buddy, order, migratetype);
>>>    		else
>>>    			del_page_from_free_list(buddy, zone, order);
>>> +
>>> +		if (PageOffline(buddy))
>>> +			offline = true;
>>> +
>>>    		combined_pfn = buddy_pfn & pfn;
>>>    		page = page + (combined_pfn - pfn);
>>>    		pfn = combined_pfn;
>>> @@ -1130,6 +1135,9 @@ static inline void __free_one_page(struct page *page,
>>>    done_merging:
>>>    	set_buddy_order(page, order);
>>> +	if (offline)
>>> +		__SetPageOffline(page);
>>> +
> 
> I'll add
> 
> 	/* Mark page PageOffline() if any merged page was PageOffline() */
> 
> above the 'if'.
> 
>>>    	if (fpi_flags & FPI_TO_TAIL)
>>>    		to_tail = true;
>>>    	else if (is_shuffle_order(order))
>>
>> This is touching some pretty hot code paths.  You mention both that
>> accepting memory is slow and expensive, yet you're doing it in the core
>> allocator.
>>
>> That needs at least some discussion in the changelog.
> 
> That is page type transfer on page merging. What expensive do you see here?
> The cachelines with both struct pages are hot already.

I meant that comment generically rather than at this specific hunk.

Just in general, I think this series needs to acknowledge that it is 
touching very core parts of the allocator and might make page allocation 
*MASSIVELY* slower, albeit temporarily.

>>> @@ -1155,7 +1163,8 @@ static inline void __free_one_page(struct page *page,
>>>    static inline bool page_expected_state(struct page *page,
>>>    					unsigned long check_flags)
>>>    {
>>> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
>>> +	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
>>> +	    !PageOffline(page))
>>>    		return false;
>>
>> Looking at stuff like this, I can't help but think that a:
>>
>> 	#define PageOffline PageUnaccepted
>>
>> and some other renaming would be a fine idea.  I get that the Offline bit
>> can be reused, but I'm not sure that the "Offline" *naming* should be
>> reused.  What you're doing here is logically distinct from existing
>> offlining.
> 
> I find the Offline name fitting. In both cases page is not accessible
> without additional preparation.
> 
> Why do you want to multiply entities?

The name wouldn't be bad *if* there was no other use of "Offline".  But, 
logically, your use of "Offline" and the existing use of "Offline" are 
different things.  They are totally orthogonal areas of the code.  They 
should have different names.

Again, I'm fine with using the same _bit_ in page->flags.  But, the two 
logical uses need two different names.
Kirill A. Shutemov Jan. 12, 2022, 7:15 p.m. UTC | #5
On Wed, Jan 12, 2022 at 12:31:10PM +0100, David Hildenbrand wrote:
> 
> > 
> > Looking at stuff like this, I can't help but think that a:
> > 
> > 	#define PageOffline PageUnaccepted
> > 
> > and some other renaming would be a fine idea.  I get that the Offline 
> > bit can be reused, but I'm not sure that the "Offline" *naming* should 
> > be reused.  What you're doing here is logically distinct from existing 
> > offlining.
> 
> Yes, or using a new pagetype bit to make the distinction clearer.
> Especially the function names like maybe_set_page_offline() et. Al are
> confusing IMHO. They are all about accepting unaccepted memory ... and
> should express that.

"Unaccepted" is UEFI treminology and I'm not sure we want to expose
core-mm to it. Power/S390/ARM may have a different name for the same
concept. Offline/online is neutral terminology, familiar to MM developers.

What if I change accept->online in function names and document the meaning
properly?

> I assume PageOffline() will be set only on the first sub-page of a
> high-order PageBuddy() page, correct?
> 
> Then we'll have to monitor all PageOffline() users such that they can
> actually deal with PageBuddy() pages spanning *multiple* base pages for
> a PageBuddy() page. For now it's clear that if a page is PageOffline(),
> it cannot be PageBuddy() and cannot span more than one base page.

> E.g., fs/proc/kcore.c:read_kcore() assumes that PageOffline() is set on
> individual base pages.

Right, pages that offline from hotplug POV are never on page allocator's
free lists, so it cannot ever step on them.
Kirill A. Shutemov Jan. 12, 2022, 7:29 p.m. UTC | #6
On Tue, Jan 11, 2022 at 09:17:19AM -0800, Dave Hansen wrote:
> On 1/11/22 03:33, Kirill A. Shutemov wrote:
> ...
> > +void mark_unaccepted(struct boot_params *params, u64 start, u64 end)
> > +{
> > +	/*
> > +	 * The accepted memory bitmap only works at PMD_SIZE granularity.
> > +	 * If a request comes in to mark memory as unaccepted which is not
> > +	 * PMD_SIZE-aligned, simply accept the memory now since it can not be
> > +	 * *marked* as unaccepted.
> > +	 */
> > +
> > +	/* Immediately accept whole range if it is within a PMD_SIZE block: */
> > +	if ((start & PMD_MASK) == (end & PMD_MASK)) {
> > +		npages = (end - start) / PAGE_SIZE;
> > +		__accept_memory(start, start + npages * PAGE_SIZE);
> > +		return;
> > +	}
> 
> I still don't quite like how this turned out.  It's still a bit unclear to
> the reader that this has covered all the corner cases.  I think this needs a
> better comment:
> 
> 	/*
> 	 * Handle <PMD_SIZE blocks that do not end at a PMD boundary.
> 	 *
> 	 * Immediately accept the whole block.  This handles the case
> 	 * where the below round_{up,down}() would "lose" a small,
> 	 * <PMD_SIZE block.
> 	 */
> 	if ((start & PMD_MASK) == (end & PMD_MASK)) {
> 		...
> 		return;
> 	}
> 
> 	/*
> 	 * There is at least one more block to accept.  Both 'start'
> 	 * and 'end' may not be PMD-aligned.
> 	 */

Okay, looks better. Thanks.

> > +	/* Immediately accept a <PMD_SIZE piece at the start: */
> > +	if (start & ~PMD_MASK) {
> > +		__accept_memory(start, round_up(start, PMD_SIZE));
> > +		start = round_up(start, PMD_SIZE);
> > +	}
> > +
> > +	/* Immediately accept a <PMD_SIZE piece at the end: */
> > +	if (end & ~PMD_MASK) {
> > +		__accept_memory(round_down(end, PMD_SIZE), end);
> > +		end = round_down(end, PMD_SIZE);
> > +	}
> 
> 	/*
> 	 * 'start' and 'end' are now both PMD-aligned.
> 	 * Record the range as being unaccepted:
> 	 */

Okay.

> > +	if (start == end)
> > +		return;
> 
> Does bitmap_set()not accept zero-sized 'len' arguments?

Looks like it does. Will drop this.

> > +	bitmap_set((unsigned long *)params->unaccepted_memory,
> > +		   start / PMD_SIZE, (end - start) / PMD_SIZE);
> > +}
> 
> The code you have there is _precise_.  It will never eagerly accept any area
> that _can_ be represented in the bitmap.  But, that's kinda hard to
> describe.  Maybe we should be a bit more sloppy about accepting things up
> front to make it easier to describe:
> 
> 	/*
> 	 * Accept small regions that might not be
> 	 * able to be represented in the bitmap:
> 	 */
> 	if (end - start < PMD_SIZE*2) {
> 		npages = (end - start) / PAGE_SIZE;
> 		__accept_memory(start, start + npages * PAGE_SIZE);
> 		return;
> 	}
> 
> 	/*
> 	 * No matter how the start and end are aligned, at
> 	 * least one unaccepted PMD_SIZE area will remain.
> 	 */
> 
> 	... now do the start/end rounding
> 
> That has the downside of accepting a few things that it doesn't *HAVE* to
> accept.  But, its behavior is very easy to describe.

Hm. Okay. I will give it a try. I like how it is now, but maybe it will be
better.

> 
> > diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
> > new file mode 100644
> > index 000000000000..cbc24040b853
> > --- /dev/null
> > +++ b/arch/x86/include/asm/unaccepted_memory.h
> > @@ -0,0 +1,12 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/* Copyright (C) 2020 Intel Corporation */
> > +#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
> > +#define _ASM_X86_UNACCEPTED_MEMORY_H
> > +
> > +#include <linux/types.h>
> > +
> > +struct boot_params;
> > +
> > +void mark_unaccepted(struct boot_params *params, u64 start, u64 num);
> > +
> > +#endif
> > diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
> > index b25d3f82c2f3..16bc686a198d 100644
> > --- a/arch/x86/include/uapi/asm/bootparam.h
> > +++ b/arch/x86/include/uapi/asm/bootparam.h
> > @@ -217,7 +217,8 @@ struct boot_params {
> >   	struct boot_e820_entry e820_table[E820_MAX_ENTRIES_ZEROPAGE]; /* 0x2d0 */
> >   	__u8  _pad8[48];				/* 0xcd0 */
> >   	struct edd_info eddbuf[EDDMAXNR];		/* 0xd00 */
> > -	__u8  _pad9[276];				/* 0xeec */
> > +	__u64 unaccepted_memory;			/* 0xeec */
> > +	__u8  _pad9[268];				/* 0xef4 */
> >   } __attribute__((packed));
> >   /**
> > diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
> > index 2c3dac5ecb36..36c1bf33f112 100644
> > --- a/drivers/firmware/efi/Kconfig
> > +++ b/drivers/firmware/efi/Kconfig
> > @@ -243,6 +243,20 @@ config EFI_DISABLE_PCI_DMA
> >   	  options "efi=disable_early_pci_dma" or "efi=no_disable_early_pci_dma"
> >   	  may be used to override this option.
> > +config UNACCEPTED_MEMORY
> > +	bool
> > +	depends on EFI_STUB
> > +	help
> > +	   Some Virtual Machine platforms, such as Intel TDX, introduce
> > +	   the concept of memory acceptance, requiring memory to be accepted
> > +	   before it can be used by the guest. This protects against a class of
> > +	   attacks by the virtual machine platform.
> 
> 	Some Virtual Machine platforms, such as Intel TDX, require
> 	some memory to be "accepted" by the guest before it can be used.
> 	This requirement protects against a class of attacks by the
> 	virtual machine platform.
> 
> Can we make this "class of attacks" a bit more concrete?  Maybe:
> 
> 	This mechanism helps prevent malicious hosts from making changes
> 	to guest memory.
> 
> ??

Okay.

> > +	   UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
> > +
> > +	   This option adds support for unaccepted memory and makes such memory
> > +	   usable by kernel.
> > +
> >   endmenu
> >   config EFI_EMBEDDED_FIRMWARE
> > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> > index ae79c3300129..abe862c381b6 100644
> > --- a/drivers/firmware/efi/efi.c
> > +++ b/drivers/firmware/efi/efi.c
> > @@ -740,6 +740,7 @@ static __initdata char memory_type_name[][13] = {
> >   	"MMIO Port",
> >   	"PAL Code",
> >   	"Persistent",
> > +	"Unaccepted",
> >   };
> >   char * __init efi_md_typeattr_format(char *buf, size_t size,
> > diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> > index a0b946182b5e..346b12d6f1b2 100644
> > --- a/drivers/firmware/efi/libstub/x86-stub.c
> > +++ b/drivers/firmware/efi/libstub/x86-stub.c
> > @@ -9,12 +9,14 @@
> >   #include <linux/efi.h>
> >   #include <linux/pci.h>
> >   #include <linux/stddef.h>
> > +#include <linux/bitmap.h>
> >   #include <asm/efi.h>
> >   #include <asm/e820/types.h>
> >   #include <asm/setup.h>
> >   #include <asm/desc.h>
> >   #include <asm/boot.h>
> > +#include <asm/unaccepted_memory.h>
> >   #include "efistub.h"
> > @@ -504,6 +506,13 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
> >   			e820_type = E820_TYPE_PMEM;
> >   			break;
> > +		case EFI_UNACCEPTED_MEMORY:
> > +			if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> > +				continue;
> > +			e820_type = E820_TYPE_RAM;
> > +			mark_unaccepted(params, d->phys_addr,
> > +					d->phys_addr + PAGE_SIZE * d->num_pages);
> > +			break;
> >   		default:
> >   			continue;
> >   		}
> > @@ -575,6 +584,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
> >   {
> >   	efi_status_t status;
> >   	__u32 nr_desc;
> > +	bool unaccepted_memory_present = false;
> > +	u64 max_addr = 0;
> > +	int i;
> >   	status = efi_get_memory_map(map);
> >   	if (status != EFI_SUCCESS)
> > @@ -589,9 +601,55 @@ static efi_status_t allocate_e820(struct boot_params *params,
> >   		if (status != EFI_SUCCESS)
> >   			goto out;
> >   	}
> > +
> > +	if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
> > +		goto out;
> > +
> > +	/* Check if there's any unaccepted memory and find the max address */
> > +	for (i = 0; i < nr_desc; i++) {
> > +		efi_memory_desc_t *d;
> > +
> > +		d = efi_early_memdesc_ptr(*map->map, *map->desc_size, i);
> > +		if (d->type == EFI_UNACCEPTED_MEMORY)
> > +			unaccepted_memory_present = true;
> > +		if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
> > +			max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
> > +	}
> > +
> > +	/*
> > +	 * If unaccepted memory present allocate a bitmap to track what memory
> 
> 			       ^ is
> 
> > +	 * has to be accepted before access.
> > +	 *
> > +	 * One bit in the bitmap represents 2MiB in the address space: one 4k
> > +	 * page is enough to track 64GiB or physical address space.
> 
> That's a bit awkward and needs a "or->of".  Perhaps:
> 
> 	* One bit in the bitmap represents 2MiB in the address space:
> 	* A 4k bitmap can track 64GiB of physical address space.

Okay.

> 
> > +	 * In the worst case scenario -- a huge hole in the middle of the
> > +	 * address space -- It needs 256MiB to handle 4PiB of the address
> > +	 * space.
> > +	 *
> > +	 * TODO: handle situation if params->unaccepted_memory has already set.
> > +	 * It's required to deal with kexec.
> 
> What happens today with kexec() since its not dealt with?

I didn't give it a try, but I assume it will hang.

There are more things to do to make kexec working and safe. We will get
there, but it is not top priority.
Dave Hansen Jan. 12, 2022, 7:35 p.m. UTC | #7
On 1/12/22 11:29 AM, Kirill A. Shutemov wrote:
>>> +	 * In the worst case scenario -- a huge hole in the middle of the
>>> +	 * address space -- It needs 256MiB to handle 4PiB of the address
>>> +	 * space.
>>> +	 *
>>> +	 * TODO: handle situation if params->unaccepted_memory has already set.
>>> +	 * It's required to deal with kexec.
>> What happens today with kexec() since its not dealt with?
> I didn't give it a try, but I assume it will hang.
> 
> There are more things to do to make kexec working and safe. We will get
> there, but it is not top priority.

Well, if we know it's broken, shouldn't we at least turn kexec off?

It would be dirt simple to do in Kconfig.  As would setting:

	kexec_load_disabled = true;

which would probably also do the trick.  That's from three seconds of
looking.  I'm sure you can come up with something better.
Kirill A. Shutemov Jan. 12, 2022, 7:43 p.m. UTC | #8
On Tue, Jan 11, 2022 at 11:10:40AM -0800, Dave Hansen wrote:
> On 1/11/22 03:33, Kirill A. Shutemov wrote:
> > Unaccepted memory bitmap is allocated during decompression stage and
> > handed over to main kernel image via boot_params. The bitmap is used to
> > track if memory has been accepted.
> > 
> > Reserve unaccepted memory bitmap has to prevent reallocating memory for
> > other means.
> 
> I'm having a hard time parsing that changelog, especially the second
> paragraph.  Could you give it another shot?

What about this:

	Unaccepted memory bitmap is allocated during decompression stage and
	handed over to main kernel image via boot_params.

	Kernel tracks what memory has been accepted in the bitmap.

	Reserve memory where the bitmap is placed to prevent memblock from
	re-allocating the memory for other needs.

?

> > +	/* Mark unaccepted memory bitmap reserved */
> > +	if (boot_params.unaccepted_memory) {
> > +		unsigned long size;
> > +
> > +		/* One bit per 2MB */
> > +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> > +				    PMD_SIZE * BITS_PER_BYTE);
> > +		memblock_reserve(boot_params.unaccepted_memory, size);
> > +	}
> 
> Is it OK that the size of the bitmap is inferred from
> e820__end_of_ram_pfn()?  Is this OK in the presence of mem= and other things
> that muck with the e820?

Good question. I think we are fine. If kernel is not able to allocate
memory from a part of physical address space we don't need the bitmap for
it either.
Kirill A. Shutemov Jan. 12, 2022, 7:43 p.m. UTC | #9
On Tue, Jan 11, 2022 at 12:01:56PM -0800, Dave Hansen wrote:
> On 1/11/22 03:33, Kirill A. Shutemov wrote:
> > Core-mm requires few helpers to support unaccepted memory:
> > 
> >   - accept_memory() checks the range of addresses against the bitmap and
> >     accept memory if needed;
> > 
> >   - maybe_set_page_offline() checks the bitmap and marks a page with
> >     PageOffline() if memory acceptance required on the first
> >     allocation of the page.
> > 
> >   - accept_and_clear_page_offline() accepts memory for the page and clears
> >     PageOffline().
> > 
> ...
> > +void accept_memory(phys_addr_t start, phys_addr_t end)
> > +{
> > +	unsigned long flags;
> > +	if (!boot_params.unaccepted_memory)
> > +		return;
> > +
> > +	spin_lock_irqsave(&unaccepted_memory_lock, flags);
> > +	__accept_memory(start, end);
> > +	spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> > +}
> 
> Not a big deal, but please cc me on all the patches in the series.  This is
> called from the core mm patches which I wasn't cc'd on.
> 
> This also isn't obvious, but this introduces a new, global lock into the
> fast path of the page allocator and holds it for extended periods of time.
> It won't be taken any more once all memory is accepted, but you can sure bet
> that it will be noticeable until that happens.
> 
> *PLEASE* document this.  It needs changelog and probably code comments.

Okay, will do.
Dave Hansen Jan. 12, 2022, 7:53 p.m. UTC | #10
On 1/12/22 11:43 AM, Kirill A. Shutemov wrote:
> On Tue, Jan 11, 2022 at 11:10:40AM -0800, Dave Hansen wrote:
>> On 1/11/22 03:33, Kirill A. Shutemov wrote:
>>> Unaccepted memory bitmap is allocated during decompression stage and
>>> handed over to main kernel image via boot_params. The bitmap is used to
>>> track if memory has been accepted.
>>>
>>> Reserve unaccepted memory bitmap has to prevent reallocating memory for
>>> other means.
>>
>> I'm having a hard time parsing that changelog, especially the second
>> paragraph.  Could you give it another shot?
> 
> What about this:
> 
> 	Unaccepted memory bitmap is allocated during decompression stage and
> 	handed over to main kernel image via boot_params.
> 
> 	Kernel tracks what memory has been accepted in the bitmap.
> 
> 	Reserve memory where the bitmap is placed to prevent memblock from
> 	re-allocating the memory for other needs.
> 
> ?

Ahh, I get what you're trying to say now.  But, it still really lacks a
coherent problem statement.  How about this?

	== Problem ==

	A given page of memory can only be accepted once.  The kernel
	has a need to accept memory both in the early decompression
	stage and during normal runtime.

	== Solution ==

	Use a bitmap to communicate the acceptance state of each page
	between the decompression stage and normal runtime.  This
	eliminates the possibility of attempting to double-accept a
	page.

	== Details ==

	Allocate the bitmap during decompression stage and hand it over
	to the main kernel image via boot_params.

	In the runtime kernel, reserve the bitmap's memory to ensure
	nothing overwrites it.

>>> +	/* Mark unaccepted memory bitmap reserved */
>>> +	if (boot_params.unaccepted_memory) {
>>> +		unsigned long size;
>>> +
>>> +		/* One bit per 2MB */
>>> +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
>>> +				    PMD_SIZE * BITS_PER_BYTE);
>>> +		memblock_reserve(boot_params.unaccepted_memory, size);
>>> +	}
>>
>> Is it OK that the size of the bitmap is inferred from
>> e820__end_of_ram_pfn()?  Is this OK in the presence of mem= and other things
>> that muck with the e820?
> 
> Good question. I think we are fine. If kernel is not able to allocate
> memory from a part of physical address space we don't need the bitmap for
> it either.

That's a good point.  If the e820 range does a one-way shrink it's
probably fine.  The only problem would be if the bitmap had space for
for stuff past e820__end_of_ram_pfn() *and* it later needed to be accepted.

Would it be worth recording the size of the reservation and then
double-checking against it in the bitmap operations?
Mike Rapoport Jan. 13, 2022, 7:42 a.m. UTC | #11
On Wed, Jan 12, 2022 at 10:40:53AM -0800, Dave Hansen wrote:
> On 1/12/22 10:30, Kirill A. Shutemov wrote:
> > On Tue, Jan 11, 2022 at 11:46:37AM -0800, Dave Hansen wrote:
> > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > index 1018e50566f3..6dfa594192de 100644
> > > > --- a/mm/memblock.c
> > > > +++ b/mm/memblock.c
> > > > @@ -1400,6 +1400,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
> > > >    		 */
> > > >    		kmemleak_alloc_phys(found, size, 0, 0);
> > > > +	accept_memory(found, found + size);
> > > >    	return found;
> > > >    }
> > > 
> > > This could use a comment.
> > 
> > How about this:
> > 
> > 	/*
> > 	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> > 	 * requiring memory to be accepted before it can be used by the
> > 	 * guest.
> > 	 *
> > 	 * Accept the memory of the allocated buffer.
> > 	 */
> 
> I think a one-liner that might cue the reader to go look at accept_memory()
> itself would be fine.  Maybe:
> 
> 	/* Make the memblock usable when running in picky VM guests: */

I'd s/memblock/found range/ or something like that, memblock is too vague
IMO
 
> That implies that the memory isn't usable without doing this and also points
> out that it's related to running in a guest.
David Hildenbrand Jan. 14, 2022, 1:22 p.m. UTC | #12
On 12.01.22 20:15, Kirill A. Shutemov wrote:
> On Wed, Jan 12, 2022 at 12:31:10PM +0100, David Hildenbrand wrote:
>>
>>>
>>> Looking at stuff like this, I can't help but think that a:
>>>
>>> 	#define PageOffline PageUnaccepted
>>>
>>> and some other renaming would be a fine idea.  I get that the Offline 
>>> bit can be reused, but I'm not sure that the "Offline" *naming* should 
>>> be reused.  What you're doing here is logically distinct from existing 
>>> offlining.
>>
>> Yes, or using a new pagetype bit to make the distinction clearer.
>> Especially the function names like maybe_set_page_offline() et. Al are
>> confusing IMHO. They are all about accepting unaccepted memory ... and
>> should express that.
> 
> "Unaccepted" is UEFI treminology and I'm not sure we want to expose
> core-mm to it. Power/S390/ARM may have a different name for the same
> concept. Offline/online is neutral terminology, familiar to MM developers.

Personally, I'd much rather prefer clear UEFI terminology for now than
making the code more confusing to get. We can always generalize later
iff there are similar needs by other archs (and if they are able to come
up witha  better name). But maybe we can find a different name immediately.

The issue with online vs. offline I have is that we already have enough
confusion:

offline page: memory section is offline. These pages are not managed by
the buddy. The memmap is stale unless we're dealing with special
ZONE_DEVICE memory.

logically offline pages: memory section is online and pages are
PageOffline(). These pages were removed from the buddy e.g., to free
them up in the hypervisor.

soft offline pages:  memory section is online and pages are
PageHWPoison(). These pages are removed from the buddy such that we
cannot allocate them to not trigger MCEs.


offline pages are exposed to the buddy by onlining them
(generic_online_page()), which is init+freeing. PageOffline() and
PageHWPoison() are onlined by removing the flag and freeing them to the
buddy.


Your case is different such that the pages are managed by the buddy and
they don't really have online/offline semantics compared to what we
already have. All the buddy has to do is prepare them for initial use.


I'm fine with reusing PageOffline(), but for the purpose of reading the
code, I think we really want some different terminology in page_alloc.c

So using any such terminology would make it clearer to me:
* PageBuddyUnprepared()
* PageBuddyUninitialized()
* PageBuddyUnprocessed()
* PageBuddyUnready()


> 
> What if I change accept->online in function names and document the meaning
> properly?
> 
>> I assume PageOffline() will be set only on the first sub-page of a
>> high-order PageBuddy() page, correct?
>>
>> Then we'll have to monitor all PageOffline() users such that they can
>> actually deal with PageBuddy() pages spanning *multiple* base pages for
>> a PageBuddy() page. For now it's clear that if a page is PageOffline(),
>> it cannot be PageBuddy() and cannot span more than one base page.
> 
>> E.g., fs/proc/kcore.c:read_kcore() assumes that PageOffline() is set on
>> individual base pages.
> 
> Right, pages that offline from hotplug POV are never on page allocator's
> free lists, so it cannot ever step on them.
>
Mike Rapoport Jan. 15, 2022, 6:46 p.m. UTC | #13
On Wed, Jan 12, 2022 at 11:53:42AM -0800, Dave Hansen wrote:
> On 1/12/22 11:43 AM, Kirill A. Shutemov wrote:
> > On Tue, Jan 11, 2022 at 11:10:40AM -0800, Dave Hansen wrote:
> >> On 1/11/22 03:33, Kirill A. Shutemov wrote:
> >>
> >>> +	/* Mark unaccepted memory bitmap reserved */
> >>> +	if (boot_params.unaccepted_memory) {
> >>> +		unsigned long size;
> >>> +
> >>> +		/* One bit per 2MB */
> >>> +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> >>> +				    PMD_SIZE * BITS_PER_BYTE);
> >>> +		memblock_reserve(boot_params.unaccepted_memory, size);
> >>> +	}
> >>
> >> Is it OK that the size of the bitmap is inferred from
> >> e820__end_of_ram_pfn()?  Is this OK in the presence of mem= and other things
> >> that muck with the e820?
> > 
> > Good question. I think we are fine. If kernel is not able to allocate
> > memory from a part of physical address space we don't need the bitmap for
> > it either.
> 
> That's a good point.  If the e820 range does a one-way shrink it's
> probably fine.  The only problem would be if the bitmap had space for
> for stuff past e820__end_of_ram_pfn() *and* it later needed to be accepted.

It's unlikely, but e820 can grow because of EFI and because of memmap=.
To be completely on the safe side, the unaccepted bitmap should be reserved
after parse_early_param() and efi_memblock_x86_reserve_range().

Since we anyway do not have memblock allocations before
e820__memblock_setup(), the simplest thing would be to put the reservation
first thing in e820__memblock_setup().
Brijesh Singh Jan. 18, 2022, 9:05 p.m. UTC | #14
Hi Kirill,

...

> 
> The approach lowers boot time substantially. Boot to shell is ~2.5x
> faster for 4G TDX VM and ~4x faster for 64G.
> 
> Patches 1-6/7 are generic and don't have any dependencies on TDX. They
> should serve AMD SEV needs as well. TDX-specific code isolated in the
> last patch. This patch requires the core TDX patchset which is currently
> under review.
> 

I can confirm that this series works for the SEV-SNP guest. I was able 
to hook the SEV-SNP page validation vmgexit (similar to the TDX patch#7) 
and have verified that the guest kernel successfully accepted all the 
memory regions marked unaccepted by the EFI boot loader.
Not a big deal, but can I ask you to include me in Cc on the future 
series; I should be able to do more testing on SNP hardware and provide 
my Test-by tag.

~ Brijesh