Message ID | 20220405234343.74045-2-kirill.shutemov@linux.intel.com |
---|---|
State | New |
Headers | show |
Series | mm, x86/cc: Implement support for unaccepted memory | expand |
On 4/5/22 16:43, Kirill A. Shutemov wrote: > UEFI Specification version 2.9 introduces the concept of memory > acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD > SEV-SNP, requiring memory to be accepted before it can be used by the ^ require > guest. Accepting happens via a protocol specific for the Virtual Machine > platform. ^ s/for/to > Accepting memory is costly and it makes VMM allocate memory for the > accepted guest physical address range. It's better to postpone memory > acceptance until memory is needed. It lowers boot time and reduces > memory overhead. > > Support of such memory requires a few changes in core-mm code: > > - memblock has to accept memory on allocation; > > - page allocator has to accept memory on the first allocation of the > page; > > Memblock change is trivial. > > The page allocator is modified to accept pages on the first allocation. > PageUnaccepted() is used to indicate that the page requires acceptance. Does this consume an actual page flag or is it aliased? > Kernel only needs to accept memory once after boot, so during the boot > and warm up phase there will be a lot of memory acceptance. After things > are settled down the only price of the feature if couple of checks for > PageUnaccepted() in allocate and free paths. The check refers a hot ^ to ... > + /* > + * PageUnaccepted() indicates that the page has to be "accepted" before it can > + * be used. Page allocator has to call accept_page() before returning the page > + * to the caller. > + */ Let's talk about "used" with a bit more detail. Maybe: /* * PageUnaccepted() indicates that the page has to be "accepted" before * it can be read or written. The page allocator must to call * accept_page() before touching the page or returning it to the caller. */ ... > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2db95780e003..53f4aa1c92a7 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -121,6 +121,12 @@ typedef int __bitwise fpi_t; > */ > #define FPI_SKIP_KASAN_POISON ((__force fpi_t)BIT(2)) > > +/* > + * Check if the page needs to be marked as PageUnaccepted(). > + * Used for the new pages added to the buddy allocator for the first time. > + */ > +#define FPI_UNACCEPTED ((__force fpi_t)BIT(3)) > + > /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ > static DEFINE_MUTEX(pcp_batch_high_lock); > #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) > @@ -1023,6 +1029,26 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, > return page_is_buddy(higher_page, higher_buddy, order + 1); > } > > +static void accept_page(struct page *page, unsigned int order) > +{ > + phys_addr_t start = page_to_phys(page); > + int i; > + > + accept_memory(start, start + (PAGE_SIZE << order)); > + > + for (i = 0; i < (1 << order); i++) { > + if (PageUnaccepted(page + i)) > + __ClearPageUnaccepted(page + i); > + } > +} It's probably worth a comment somewhere that this can be really slow. > +static bool page_is_unaccepted(struct page *page, unsigned int order) > +{ > + phys_addr_t start = page_to_phys(page); > + > + return memory_is_unaccepted(start, start + (PAGE_SIZE << order)); > +} > + > /* > * Freeing function for a buddy system allocator. > * > @@ -1058,6 +1084,7 @@ static inline void __free_one_page(struct page *page, > unsigned long combined_pfn; > struct page *buddy; > bool to_tail; > + bool unaccepted = PageUnaccepted(page); > > VM_BUG_ON(!zone_is_initialized(zone)); > VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); > @@ -1089,6 +1116,11 @@ static inline void __free_one_page(struct page *page, > clear_page_guard(zone, buddy, order, migratetype); > else > del_page_from_free_list(buddy, zone, order); > + > + /* Mark page unaccepted if any of merged pages were unaccepted */ > + if (PageUnaccepted(buddy)) > + unaccepted = true; Naming nit: following the logic with a double-negative like !unaccepted is a bit hard. Would this be more readable if it were: bool page_needs_acceptance = PageUnaccepted(page); and then the code below... > combined_pfn = buddy_pfn & pfn; > page = page + (combined_pfn - pfn); > pfn = combined_pfn; > @@ -1124,6 +1156,17 @@ static inline void __free_one_page(struct page *page, > done_merging: > set_buddy_order(page, order); > > + /* > + * Check if the page needs to be marked as PageUnaccepted(). > + * Used for the new pages added to the buddy allocator for the first > + * time. > + */ > + if (!unaccepted && (fpi_flags & FPI_UNACCEPTED)) > + unaccepted = page_is_unaccepted(page, order); if (page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED)) page_needs_acceptance = page_is_unaccepted(page, order); > + if (unaccepted) > + __SetPageUnaccepted(page); This is getting hard for me to follow. There are: 1. Pages that come in here with PageUnaccepted()==1 2. Pages that come in here with PageUnaccepted()==0, but a buddy that was PageUnaccepted()==1 In either of those cases, the bitmap will be consulted to see if the page is *truly* unaccepted or not. But, I'm struggling to figure out how a page could end up in one of those scenarios and *not* be page_is_unaccepted(). There are three pieces of information that come in: 1. PageUnaccepted(page) 2. PageUnaccepted(buddies[]) 3. the bitmap and one piece of information going out: PageUnaccepted(page); I think I need a more coherent description of how those four things fit together. > if (fpi_flags & FPI_TO_TAIL) > to_tail = true; > else if (is_shuffle_order(order)) > @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page, > static inline bool page_expected_state(struct page *page, > unsigned long check_flags) > { > - if (unlikely(atomic_read(&page->_mapcount) != -1)) > + if (unlikely(atomic_read(&page->_mapcount) != -1) && > + !PageUnaccepted(page)) > return false; That probably deserves a comment, and maybe its own if() statement. > if (unlikely((unsigned long)page->mapping | > @@ -1654,7 +1698,8 @@ void __free_pages_core(struct page *page, unsigned int order) > * Bypass PCP and place fresh pages right to the tail, primarily > * relevant for memory onlining. > */ > - __free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON); > + __free_pages_ok(page, order, > + FPI_TO_TAIL | FPI_SKIP_KASAN_POISON | FPI_UNACCEPTED); > } > > #ifdef CONFIG_NUMA > @@ -1807,6 +1852,7 @@ static void __init deferred_free_range(unsigned long pfn, > return; > } > > + accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT); > for (i = 0; i < nr_pages; i++, page++, pfn++) { > if ((pfn & (pageblock_nr_pages - 1)) == 0) > set_pageblock_migratetype(page, MIGRATE_MOVABLE); Comment, please. I assume doing the slow accept up front is OK here because this is in the deferred path. But, it would be nice to know for sure. > @@ -2266,6 +2312,10 @@ static inline void expand(struct zone *zone, struct page *page, > if (set_page_guard(zone, &page[size], high, migratetype)) > continue; > > + /* Transfer PageUnaccepted() to the newly split pages */ > + if (PageUnaccepted(page)) > + __SetPageUnaccepted(&page[size]); We don't want to just accept the page here, right? Because we're holding the zone lock? Maybe we should mention that: /* * Transfer PageUnaccepted() to the newly split pages so * they can be accepted after dropping the zone lock. */ > add_to_free_list(&page[size], zone, high, migratetype); > set_buddy_order(&page[size], high); > } > @@ -2396,6 +2446,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order, > */ > kernel_unpoison_pages(page, 1 << order); > > + if (PageUnaccepted(page)) > + accept_page(page, order); > + > /* > * As memory initialization might be integrated into KASAN, > * KASAN unpoisoning and memory initializion code must be Is accepted memory guaranteed to be zeroed? Do we want to skip the __GFP_ZERO behavior later in this function? Or is that just a silly over-optimization?
On 4/5/22 16:43, Kirill A. Shutemov wrote: > Kernel only needs to accept memory once after boot, so during the boot > and warm up phase there will be a lot of memory acceptance. After things > are settled down the only price of the feature if couple of checks for > PageUnaccepted() in allocate and free paths. The check refers a hot > variable (that also encodes PageBuddy()), so it is cheap and not visible > on profiles. Let's also not sugar-coat this. Page acceptance is hideously slow. It's agonizingly slow. To boot, it's done holding a global spinlock with interrupts disabled (see patch 6/8). At the very, very least, each acceptance operation involves a couple of what are effectively ring transitions, a 2MB memset(), and a bunch of cache flushing. The system is going to be downright unusable during this time, right? Sure, it's *temporary* and only happens once at boot. But, it's going to suck. Am I over-stating this in any way? The ACCEPT_MEMORY vmstat is good to have around. Thanks for adding it. But, I think we should also write down some guidance like: If your TDX system seems as slow as snail after boot, look at the "accept_memory" counter in /proc/vmstat. If it is incrementing, then TDX memory acceptance is likely to blame. Do we need anything more discrete to tell users when acceptance is over? For instance, maybe they run something and it goes really slow, they watch "accept_memory" until it stops. They rejoice at their good fortune! Then, memory allocation starts falling over to a new node and the agony beings anew. I can think of dealing with this in two ways: cat /sys/.../unaccepted_pages_left which just walks the bitmap and counts the amount of pages remaining. or something like: echo 1 > /sys/devices/system/node/node0/make_the_pain_stop Which will, well, make the pain stop on node0.
On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote: > On 4/5/22 16:43, Kirill A. Shutemov wrote: > > UEFI Specification version 2.9 introduces the concept of memory > > acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD > > SEV-SNP, requiring memory to be accepted before it can be used by the > > ^ require Heh. That's wording form the spec. > > guest. Accepting happens via a protocol specific for the Virtual Machine > > platform. > > ^ s/for/to > > > Accepting memory is costly and it makes VMM allocate memory for the > > accepted guest physical address range. It's better to postpone memory > > acceptance until memory is needed. It lowers boot time and reduces > > memory overhead. > > > > Support of such memory requires a few changes in core-mm code: > > > > - memblock has to accept memory on allocation; > > > > - page allocator has to accept memory on the first allocation of the > > page; > > > > Memblock change is trivial. > > > > The page allocator is modified to accept pages on the first allocation. > > PageUnaccepted() is used to indicate that the page requires acceptance. > > Does this consume an actual page flag or is it aliased? It is encoded as a page type in mapcount of unallocated memory. It is not aliased with PageOffline() as I did before. I will mention that it is a new page type. > > Kernel only needs to accept memory once after boot, so during the boot > > and warm up phase there will be a lot of memory acceptance. After things > > are settled down the only price of the feature if couple of checks for > > PageUnaccepted() in allocate and free paths. The check refers a hot > > ^ to > > ... > > + /* > > + * PageUnaccepted() indicates that the page has to be "accepted" before it can > > + * be used. Page allocator has to call accept_page() before returning the page > > + * to the caller. > > + */ > > Let's talk about "used" with a bit more detail. Maybe: > > /* > * PageUnaccepted() indicates that the page has to be "accepted" before > * it can be read or written. The page allocator must to call > * accept_page() before touching the page or returning it to the caller. > */ I guess s/must to call/must call/, right? > ... > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 2db95780e003..53f4aa1c92a7 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -121,6 +121,12 @@ typedef int __bitwise fpi_t; > > */ > > #define FPI_SKIP_KASAN_POISON ((__force fpi_t)BIT(2)) > > > > +/* > > + * Check if the page needs to be marked as PageUnaccepted(). > > + * Used for the new pages added to the buddy allocator for the first time. > > + */ > > +#define FPI_UNACCEPTED ((__force fpi_t)BIT(3)) > > + > > /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ > > static DEFINE_MUTEX(pcp_batch_high_lock); > > #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) > > @@ -1023,6 +1029,26 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, > > return page_is_buddy(higher_page, higher_buddy, order + 1); > > } > > > > +static void accept_page(struct page *page, unsigned int order) > > +{ > > + phys_addr_t start = page_to_phys(page); > > + int i; > > + > > + accept_memory(start, start + (PAGE_SIZE << order)); > > + > > + for (i = 0; i < (1 << order); i++) { > > + if (PageUnaccepted(page + i)) > > + __ClearPageUnaccepted(page + i); > > + } > > +} > > It's probably worth a comment somewhere that this can be really slow. > > > +static bool page_is_unaccepted(struct page *page, unsigned int order) > > +{ > > + phys_addr_t start = page_to_phys(page); > > + > > + return memory_is_unaccepted(start, start + (PAGE_SIZE << order)); > > +} > > + > > /* > > * Freeing function for a buddy system allocator. > > * > > @@ -1058,6 +1084,7 @@ static inline void __free_one_page(struct page *page, > > unsigned long combined_pfn; > > struct page *buddy; > > bool to_tail; > > + bool unaccepted = PageUnaccepted(page); > > > > VM_BUG_ON(!zone_is_initialized(zone)); > > VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); > > @@ -1089,6 +1116,11 @@ static inline void __free_one_page(struct page *page, > > clear_page_guard(zone, buddy, order, migratetype); > > else > > del_page_from_free_list(buddy, zone, order); > > + > > + /* Mark page unaccepted if any of merged pages were unaccepted */ > > + if (PageUnaccepted(buddy)) > > + unaccepted = true; > > Naming nit: following the logic with a double-negative like !unaccepted > is a bit hard. Would this be more readable if it were: > > bool page_needs_acceptance = PageUnaccepted(page); > > and then the code below... > > > combined_pfn = buddy_pfn & pfn; > > page = page + (combined_pfn - pfn); > > pfn = combined_pfn; > > @@ -1124,6 +1156,17 @@ static inline void __free_one_page(struct page *page, > > done_merging: > > set_buddy_order(page, order); > > > > + /* > > + * Check if the page needs to be marked as PageUnaccepted(). > > + * Used for the new pages added to the buddy allocator for the first > > + * time. > > + */ > > + if (!unaccepted && (fpi_flags & FPI_UNACCEPTED)) > > + unaccepted = page_is_unaccepted(page, order); > > if (page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED)) > page_needs_acceptance = page_is_unaccepted(page, order); > > > + if (unaccepted) > > + __SetPageUnaccepted(page); > > This is getting hard for me to follow. > > There are: > 1. Pages that come in here with PageUnaccepted()==1 > 2. Pages that come in here with PageUnaccepted()==0, but a buddy that > was PageUnaccepted()==1 > > In either of those cases, the bitmap will be consulted to see if the > page is *truly* unaccepted or not. But, I'm struggling to figure out > how a page could end up in one of those scenarios and *not* be > page_is_unaccepted(). > > There are three pieces of information that come in: > 1. PageUnaccepted(page) > 2. PageUnaccepted(buddies[]) > 3. the bitmap 1 and 2 are the same conceptionally: merged-in pieces of the resulting page. > > and one piece of information going out: > > PageUnaccepted(page); > > I think I need a more coherent description of how those four things fit > together. The page gets marked as PageUnaccepted() if any of merged-in pages is PageUnaccepted(). For new pages, just being added to buddy allocator, consult page_is_unaccepted(). FPI_UNACCEPTED indicates that the page is new and page_is_unaccepted() check is required. Avoid calling page_is_unaccepted() if it is known that the page needs acceptance anyway. It can be costly. Is it good enough explanation? FPI_UNACCEPTED is not a good name. Any help with a better one? FPI_CHECK_UNACCEPTED? > > if (fpi_flags & FPI_TO_TAIL) > > to_tail = true; > > else if (is_shuffle_order(order)) > > @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page, > > static inline bool page_expected_state(struct page *page, > > unsigned long check_flags) > > { > > - if (unlikely(atomic_read(&page->_mapcount) != -1)) > > + if (unlikely(atomic_read(&page->_mapcount) != -1) && > > + !PageUnaccepted(page)) > > return false; > > That probably deserves a comment, and maybe its own if() statement. Own if does not work. PageUnaccepted() is encoded in _mapcount. What about this: /* * page->_mapcount is expected to be -1. * * There is an exception for PageUnaccepted(). The page type can be set * for pages on free list. Page types are encoded in _mapcount. * * PageUnaccepted() will get cleared in post_alloc_hook(). */ if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1)) return false; ? > > if (unlikely((unsigned long)page->mapping | > > @@ -1654,7 +1698,8 @@ void __free_pages_core(struct page *page, unsigned int order) > > * Bypass PCP and place fresh pages right to the tail, primarily > > * relevant for memory onlining. > > */ > > - __free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON); > > + __free_pages_ok(page, order, > > + FPI_TO_TAIL | FPI_SKIP_KASAN_POISON | FPI_UNACCEPTED); > > } > > > > #ifdef CONFIG_NUMA > > @@ -1807,6 +1852,7 @@ static void __init deferred_free_range(unsigned long pfn, > > return; > > } > > > > + accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT); > > for (i = 0; i < nr_pages; i++, page++, pfn++) { > > if ((pfn & (pageblock_nr_pages - 1)) == 0) > > set_pageblock_migratetype(page, MIGRATE_MOVABLE); > > Comment, please. I assume doing the slow accept up front is OK here > because this is in the deferred path. But, it would be nice to know for > sure. It is acceptance of smaller than page block upfront. I will add a comment. > > > @@ -2266,6 +2312,10 @@ static inline void expand(struct zone *zone, struct page *page, > > if (set_page_guard(zone, &page[size], high, migratetype)) > > continue; > > > > + /* Transfer PageUnaccepted() to the newly split pages */ > > + if (PageUnaccepted(page)) > > + __SetPageUnaccepted(&page[size]); > > We don't want to just accept the page here, right? Because we're > holding the zone lock? Maybe we should mention that: > > /* > * Transfer PageUnaccepted() to the newly split pages so > * they can be accepted after dropping the zone lock. > */ Okay. > > add_to_free_list(&page[size], zone, high, migratetype); > > set_buddy_order(&page[size], high); > > } > > @@ -2396,6 +2446,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order, > > */ > > kernel_unpoison_pages(page, 1 << order); > > > > + if (PageUnaccepted(page)) > > + accept_page(page, order); > > + > > /* > > * As memory initialization might be integrated into KASAN, > > * KASAN unpoisoning and memory initializion code must be > > Is accepted memory guaranteed to be zeroed? Do we want to skip the > __GFP_ZERO behavior later in this function? Or is that just a silly > over-optimization? For TDX, it is true that the memory gets cleared on acceptance, but I don't we can say the same for any possible implementation. I would rather leave __GFP_ZERO for peace of mind. Clearing the cache-hot page for the second time shouldn't be a big deal comparing to acceptance cost.
On Fri, Apr 08, 2022 at 12:11:58PM -0700, Dave Hansen wrote: > On 4/5/22 16:43, Kirill A. Shutemov wrote: > > Kernel only needs to accept memory once after boot, so during the boot > > and warm up phase there will be a lot of memory acceptance. After things > > are settled down the only price of the feature if couple of checks for > > PageUnaccepted() in allocate and free paths. The check refers a hot > > variable (that also encodes PageBuddy()), so it is cheap and not visible > > on profiles. > > Let's also not sugar-coat this. Page acceptance is hideously slow. > It's agonizingly slow. To boot, it's done holding a global spinlock > with interrupts disabled (see patch 6/8). At the very, very least, each > acceptance operation involves a couple of what are effectively ring > transitions, a 2MB memset(), and a bunch of cache flushing. > > The system is going to be downright unusable during this time, right? Well, yes. The CPU that doing accepting is completely blocked by it. But other CPUs may proceed until in in its turn steps onto memory accepting. > Sure, it's *temporary* and only happens once at boot. But, it's going > to suck. > > Am I over-stating this in any way? > > The ACCEPT_MEMORY vmstat is good to have around. Thanks for adding it. > But, I think we should also write down some guidance like: > > If your TDX system seems as slow as snail after boot, look at > the "accept_memory" counter in /proc/vmstat. If it is > incrementing, then TDX memory acceptance is likely to blame. Sure. Will add to commit message. > Do we need anything more discrete to tell users when acceptance is over? I can imagine setups that where acceptance is never over. A VM running a workload with fixed dataset can have planty of memory unaccepted. I don't think "make it over" should be the goal. > For instance, maybe they run something and it goes really slow, they > watch "accept_memory" until it stops. They rejoice at their good > fortune! Then, memory allocation starts falling over to a new node and > the agony beings anew. > > I can think of dealing with this in two ways: > > cat /sys/.../unaccepted_pages_left > > which just walks the bitmap and counts the amount of pages remaining. or > something like: > > echo 1 > /sys/devices/system/node/node0/make_the_pain_stop > > Which will, well, make the pain stop on node0. Sure we can add handles. But API is hard. Maybe we should wait and see what is actually needed. (Yes, I'm lazy.:)
On 4/9/22 08:54, Kirill A. Shutemov wrote: > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote: >>> The page allocator is modified to accept pages on the first allocation. >>> PageUnaccepted() is used to indicate that the page requires acceptance. >> >> Does this consume an actual page flag or is it aliased? > > It is encoded as a page type in mapcount of unallocated memory. It is not > aliased with PageOffline() as I did before. > > I will mention that it is a new page type. Guess I should have looked at the code. :) Are we just increasingly using the StudlyNames() for anything to do with pages? >>> + /* >>> + * PageUnaccepted() indicates that the page has to be "accepted" before it can >>> + * be used. Page allocator has to call accept_page() before returning the page >>> + * to the caller. >>> + */ >> >> Let's talk about "used" with a bit more detail. Maybe: >> >> /* >> * PageUnaccepted() indicates that the page has to be "accepted" before >> * it can be read or written. The page allocator must to call >> * accept_page() before touching the page or returning it to the caller. >> */ > > I guess s/must to call/must call/, right? Yep. ... >>> + /* >>> + * Check if the page needs to be marked as PageUnaccepted(). >>> + * Used for the new pages added to the buddy allocator for the first >>> + * time. >>> + */ >>> + if (!unaccepted && (fpi_flags & FPI_UNACCEPTED)) >>> + unaccepted = page_is_unaccepted(page, order); >> >> if (page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED)) >> page_needs_acceptance = page_is_unaccepted(page, order); >> >>> + if (unaccepted) >>> + __SetPageUnaccepted(page); >> >> This is getting hard for me to follow. >> >> There are: >> 1. Pages that come in here with PageUnaccepted()==1 >> 2. Pages that come in here with PageUnaccepted()==0, but a buddy that >> was PageUnaccepted()==1 >> >> In either of those cases, the bitmap will be consulted to see if the >> page is *truly* unaccepted or not. But, I'm struggling to figure out >> how a page could end up in one of those scenarios and *not* be >> page_is_unaccepted(). >> >> There are three pieces of information that come in: >> 1. PageUnaccepted(page) >> 2. PageUnaccepted(buddies[]) >> 3. the bitmap > > 1 and 2 are the same conceptionally: merged-in pieces of the resulting page. > >> >> and one piece of information going out: >> >> PageUnaccepted(page); >> >> I think I need a more coherent description of how those four things fit >> together. > > The page gets marked as PageUnaccepted() if any of merged-in pages is > PageUnaccepted(). > > For new pages, just being added to buddy allocator, consult > page_is_unaccepted(). FPI_UNACCEPTED indicates that the page is new and > page_is_unaccepted() check is required. > > Avoid calling page_is_unaccepted() if it is known that the page needs > acceptance anyway. It can be costly. > > Is it good enough explanation? Yeah, elaborating on the slow and fast paths makes a lot of sense. > FPI_UNACCEPTED is not a good name. Any help with a better one? > FPI_CHECK_UNACCEPTED? Maybe even something like FPI_UNACCEPTED_SLOWPATH. Then you can say that when this is passed in the pages might not have PageUnaccepted() set and the slow bitmap needs to be consulted. >>> if (fpi_flags & FPI_TO_TAIL) >>> to_tail = true; >>> else if (is_shuffle_order(order)) >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page, >>> static inline bool page_expected_state(struct page *page, >>> unsigned long check_flags) >>> { >>> - if (unlikely(atomic_read(&page->_mapcount) != -1)) >>> + if (unlikely(atomic_read(&page->_mapcount) != -1) && >>> + !PageUnaccepted(page)) >>> return false; >> >> That probably deserves a comment, and maybe its own if() statement. > > Own if does not work. PageUnaccepted() is encoded in _mapcount. > > What about this: > > /* > * page->_mapcount is expected to be -1. > * > * There is an exception for PageUnaccepted(). The page type can be set > * for pages on free list. Page types are encoded in _mapcount. > * > * PageUnaccepted() will get cleared in post_alloc_hook(). > */ > if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1)) > return false; > > ? That's better. But, aren't the PG_* names usually reserved for real page->flags bits? That naming might be part of my confusion. >>> add_to_free_list(&page[size], zone, high, migratetype); >>> set_buddy_order(&page[size], high); >>> } >>> @@ -2396,6 +2446,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order, >>> */ >>> kernel_unpoison_pages(page, 1 << order); >>> >>> + if (PageUnaccepted(page)) >>> + accept_page(page, order); >>> + >>> /* >>> * As memory initialization might be integrated into KASAN, >>> * KASAN unpoisoning and memory initializion code must be >> >> Is accepted memory guaranteed to be zeroed? Do we want to skip the >> __GFP_ZERO behavior later in this function? Or is that just a silly >> over-optimization? > > For TDX, it is true that the memory gets cleared on acceptance, but I > don't we can say the same for any possible implementation. > > I would rather leave __GFP_ZERO for peace of mind. Clearing the cache-hot > page for the second time shouldn't be a big deal comparing to acceptance > cost. Sure, fair enough.
On 4/9/22 10:52, Kirill A. Shutemov wrote: > On Fri, Apr 08, 2022 at 12:11:58PM -0700, Dave Hansen wrote: >> On 4/5/22 16:43, Kirill A. Shutemov wrote: >>> Kernel only needs to accept memory once after boot, so during the boot >>> and warm up phase there will be a lot of memory acceptance. After things >>> are settled down the only price of the feature if couple of checks for >>> PageUnaccepted() in allocate and free paths. The check refers a hot >>> variable (that also encodes PageBuddy()), so it is cheap and not visible >>> on profiles. >> >> Let's also not sugar-coat this. Page acceptance is hideously slow. >> It's agonizingly slow. To boot, it's done holding a global spinlock >> with interrupts disabled (see patch 6/8). At the very, very least, each >> acceptance operation involves a couple of what are effectively ring >> transitions, a 2MB memset(), and a bunch of cache flushing. >> >> The system is going to be downright unusable during this time, right? ... >> Do we need anything more discrete to tell users when acceptance is over? > > I can imagine setups that where acceptance is never over. A VM running > a workload with fixed dataset can have planty of memory unaccepted. > > I don't think "make it over" should be the goal. I agree, there will be users that don't care when acceptance is over. But, I'm also sure that there are users that will care deeply. >> For instance, maybe they run something and it goes really slow, they >> watch "accept_memory" until it stops. They rejoice at their good >> fortune! Then, memory allocation starts falling over to a new node and >> the agony beings anew. >> >> I can think of dealing with this in two ways: >> >> cat /sys/.../unaccepted_pages_left >> >> which just walks the bitmap and counts the amount of pages remaining. or >> something like: >> >> echo 1 > /sys/devices/system/node/node0/make_the_pain_stop >> >> Which will, well, make the pain stop on node0. > > Sure we can add handles. But API is hard. Maybe we should wait and see > what is actually needed. (Yes, I'm lazy.:) Let's just call out the possible (probable?) need for new ABI here. Maybe it will cue folks who care to speak up.
>>> if (fpi_flags & FPI_TO_TAIL) >>> to_tail = true; >>> else if (is_shuffle_order(order)) >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page, >>> static inline bool page_expected_state(struct page *page, >>> unsigned long check_flags) >>> { >>> - if (unlikely(atomic_read(&page->_mapcount) != -1)) >>> + if (unlikely(atomic_read(&page->_mapcount) != -1) && >>> + !PageUnaccepted(page)) >>> return false; >> >> That probably deserves a comment, and maybe its own if() statement. > > Own if does not work. PageUnaccepted() is encoded in _mapcount. > > What about this: > > /* > * page->_mapcount is expected to be -1. > * > * There is an exception for PageUnaccepted(). The page type can be set > * for pages on free list. Page types are encoded in _mapcount. > * > * PageUnaccepted() will get cleared in post_alloc_hook(). > */ > if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1)) > return false; > > ? > Please don't. Keep the usage of PG_* details inside page-flags.h
On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote: > On 4/9/22 08:54, Kirill A. Shutemov wrote: > > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote: > > >>> if (fpi_flags & FPI_TO_TAIL) > >>> to_tail = true; > >>> else if (is_shuffle_order(order)) > >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page, > >>> static inline bool page_expected_state(struct page *page, > >>> unsigned long check_flags) > >>> { > >>> - if (unlikely(atomic_read(&page->_mapcount) != -1)) > >>> + if (unlikely(atomic_read(&page->_mapcount) != -1) && > >>> + !PageUnaccepted(page)) > >>> return false; > >> > >> That probably deserves a comment, and maybe its own if() statement. > > > > Own if does not work. PageUnaccepted() is encoded in _mapcount. > > > > What about this: > > > > /* > > * page->_mapcount is expected to be -1. > > * > > * There is an exception for PageUnaccepted(). The page type can be set > > * for pages on free list. Page types are encoded in _mapcount. > > * > > * PageUnaccepted() will get cleared in post_alloc_hook(). > > */ > > if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1)) Maybe I'm missing something, but isn't this true for any PageType? > > return false; > > > > ? > > That's better. But, aren't the PG_* names usually reserved for real > page->flags bits? That naming might be part of my confusion. We use them for PageType as well like PG_buddy, PG_offline, PG_Table.
On Sun, Apr 10, 2022 at 11:41:57PM -0700, Dave Hansen wrote: > Let's just call out the possible (probable?) need for new ABI here. > Maybe it will cue folks who care to speak up. Err, why would you teach the user to go poke at some arbitrary sysfs nodes when the accepting code can simply issue a printk from time to time "Guest unnaccepted memory progress: XX%. This slows down operations at the moment."
On 4/11/22 08:55, Borislav Petkov wrote: > On Sun, Apr 10, 2022 at 11:41:57PM -0700, Dave Hansen wrote: >> Let's just call out the possible (probable?) need for new ABI here. >> Maybe it will cue folks who care to speak up. > Err, why would you teach the user to go poke at some arbitrary sysfs > nodes when the accepting code can simply issue a printk from time to > time > > "Guest unnaccepted memory progress: XX%. This slows down operations at the moment." I guess that's not a horrible place to start. It's also not *horribly* different from how guests work today. If hosts lazily allocate RAM, they'll see largely the same kind of behavior. What ends up determining how much memory is pre-accepted versus being done from the guest? Is that just a normal part of setting up a TDX guest, like from the qemu cmdline? Or, is there some convention with the virtual firmware?
On 4/11/22 11:27, Dave Hansen wrote: > On 4/11/22 08:55, Borislav Petkov wrote: >> On Sun, Apr 10, 2022 at 11:41:57PM -0700, Dave Hansen wrote: >>> Let's just call out the possible (probable?) need for new ABI here. >>> Maybe it will cue folks who care to speak up. >> Err, why would you teach the user to go poke at some arbitrary sysfs >> nodes when the accepting code can simply issue a printk from time to >> time >> >> "Guest unnaccepted memory progress: XX%. This slows down operations at the moment." > > I guess that's not a horrible place to start. It's also not *horribly* > different from how guests work today. If hosts lazily allocate RAM, > they'll see largely the same kind of behavior. > > What ends up determining how much memory is pre-accepted versus being > done from the guest? Is that just a normal part of setting up a TDX > guest, like from the qemu cmdline? Or, is there some convention with > the virtual firmware? With SNP, some memory will be accepted as part of the LAUNCH_UPDATE sequences that the hypervisor performs, but that is not all of the guest memory. Once the guest is started, the (initial implementation of) OVMF SNP support will accept (PVALIDATE) all of the remaining guest memory. When the kernel boots, there isn't any unaccepted memory. Once support is available in the kernel for unaccepted memory, then OVMF could be updated to only accept a limited amount of memory and pass the information about the unaccepted memory to the kernel through the EFI memory map. The approaches would have to be measured to see which ends up being the best one. The GHCB specification allows for lots of memory to be accepted in a single VMGEXIT (world switch) vs performing a VMGEXIT for each 2MB of memory being accepted. Thanks, Tom
On 08.04.22 21:11, Dave Hansen wrote: > On 4/5/22 16:43, Kirill A. Shutemov wrote: >> Kernel only needs to accept memory once after boot, so during the boot >> and warm up phase there will be a lot of memory acceptance. After things >> are settled down the only price of the feature if couple of checks for >> PageUnaccepted() in allocate and free paths. The check refers a hot >> variable (that also encodes PageBuddy()), so it is cheap and not visible >> on profiles. > > Let's also not sugar-coat this. Page acceptance is hideously slow. > It's agonizingly slow. To boot, it's done holding a global spinlock > with interrupts disabled (see patch 6/8). At the very, very least, each > acceptance operation involves a couple of what are effectively ring > transitions, a 2MB memset(), and a bunch of cache flushing. > > The system is going to be downright unusable during this time, right? > > Sure, it's *temporary* and only happens once at boot. But, it's going > to suck. > > Am I over-stating this in any way? > > The ACCEPT_MEMORY vmstat is good to have around. Thanks for adding it. > But, I think we should also write down some guidance like: > > If your TDX system seems as slow as snail after boot, look at > the "accept_memory" counter in /proc/vmstat. If it is > incrementing, then TDX memory acceptance is likely to blame. > > Do we need anything more discrete to tell users when acceptance is over? > For instance, maybe they run something and it goes really slow, they > watch "accept_memory" until it stops. They rejoice at their good > fortune! Then, memory allocation starts falling over to a new node and > the agony beings anew. > > I can think of dealing with this in two ways: > > cat /sys/.../unaccepted_pages_left > > which just walks the bitmap and counts the amount of pages remaining. or > something like: > > echo 1 > /sys/devices/system/node/node0/make_the_pain_stop > > Which will, well, make the pain stop on node0. > Either I'm missing something important or the random pain might just take a really long time to stop? I mean, we tend to reallocate the memory first that we freed last (putting it to the head of the freelist when freeing and picking from the head when allocating). So unless your kernel goes crazy and allocates each and every page right after boot, essentially accepting all memory, you might have random unaccepted pages lurking at the tail of the freelists. So if the VM is running for 355 days without significant memory pressure, you can still run into unaccepted pages at day 356 that results in a random delay due to acceptance of memory. I think we most certainly want some way to make the random pain stop, or to make the random pain go away after boot quickly. The "unaccepted_pages_left" indicator would just be a "hey, there might be random delays, but you cannot do anything about it". Magic toggles like "make_the_pain_stop" are not so nice. Can we simply automate this using a kthread or smth like that, which just traverses the free page lists and accepts pages (similar, but different to free page reporting)?
On 4/12/22 01:15, David Hildenbrand wrote: > Can we simply automate this using a kthread or smth like that, which > just traverses the free page lists and accepts pages (similar, but > different to free page reporting)? That's definitely doable. The downside is that this will force premature consumption of physical memory resources that the guest may never use. That's a particular problem on TDX systems since there is no way for a VMM to reclaim guest memory short of killing the guest. In other words, I can see a good argument either way: 1. The kernel should accept everything to avoid the perf nastiness 2. The kernel should accept only what it needs in order to reduce memory use I'm kinda partial to #1 though, if I had to pick only one. The other option might be to tie this all to DEFERRED_STRUCT_PAGE_INIT. Have the rule that everything that gets a 'struct page' must be accepted. If you want to do delayed acceptance, you do it via DEFERRED_STRUCT_PAGE_INIT.
On 12.04.22 18:08, Dave Hansen wrote: > On 4/12/22 01:15, David Hildenbrand wrote: >> Can we simply automate this using a kthread or smth like that, which >> just traverses the free page lists and accepts pages (similar, but >> different to free page reporting)? > > That's definitely doable. > > The downside is that this will force premature consumption of physical > memory resources that the guest may never use. That's a particular > problem on TDX systems since there is no way for a VMM to reclaim guest > memory short of killing the guest. IIRC, the hypervisor will usually effectively populate all guest RAM either way right now. So yes, for hypervisors that might optimize for that, that statement would be true. But I lost track how helpful it would be in the near future e.g., with the fd-based private guest memory -- maybe they already optimize for delayed acceptance of memory, turning it into delayed population. > > In other words, I can see a good argument either way: > 1. The kernel should accept everything to avoid the perf nastiness > 2. The kernel should accept only what it needs in order to reduce memory > use > > I'm kinda partial to #1 though, if I had to pick only one. > > The other option might be to tie this all to DEFERRED_STRUCT_PAGE_INIT. > Have the rule that everything that gets a 'struct page' must be > accepted. If you want to do delayed acceptance, you do it via > DEFERRED_STRUCT_PAGE_INIT. That could also be an option, yes. At least being able to chose would be good. But IIRC, DEFERRED_STRUCT_PAGE_INIT will still make the system get stuck during boot and wait until everything was accepted. I see the following variants: 1) Slow boot; after boot, all memory is already accepted. 2) Fast boot; after boot, all memory will slowly but steadily get accepted in the background. After a while, all memory is accepted and can be signaled to user space. 3) Fast boot; after boot, memory gets accepted on demand. This is what we have in this series. I somehow don't quite like 3), but with deferred population in the hypervisor, it might just make sense.
On Wed, Apr 13, 2022 at 12:36:11PM +0200, David Hildenbrand wrote: > On 12.04.22 18:08, Dave Hansen wrote: > > On 4/12/22 01:15, David Hildenbrand wrote: > >> Can we simply automate this using a kthread or smth like that, which > >> just traverses the free page lists and accepts pages (similar, but > >> different to free page reporting)? > > > > That's definitely doable. > > > > The downside is that this will force premature consumption of physical > > memory resources that the guest may never use. That's a particular > > problem on TDX systems since there is no way for a VMM to reclaim guest > > memory short of killing the guest. > > IIRC, the hypervisor will usually effectively populate all guest RAM > either way right now. No, it is not usual. By default QEMU/KVM uses anonymous mapping and fault-in memory on demand. Yes, there's an option to pre-populate guest memory, but it is not the default. > So yes, for hypervisors that might optimize for > that, that statement would be true. But I lost track how helpful it > would be in the near future e.g., with the fd-based private guest memory > -- maybe they already optimize for delayed acceptance of memory, turning > it into delayed population. > > > > > In other words, I can see a good argument either way: > > 1. The kernel should accept everything to avoid the perf nastiness > > 2. The kernel should accept only what it needs in order to reduce memory > > use > > > > I'm kinda partial to #1 though, if I had to pick only one. > > > > The other option might be to tie this all to DEFERRED_STRUCT_PAGE_INIT. > > Have the rule that everything that gets a 'struct page' must be > > accepted. If you want to do delayed acceptance, you do it via > > DEFERRED_STRUCT_PAGE_INIT. > > That could also be an option, yes. At least being able to chose would be > good. But IIRC, DEFERRED_STRUCT_PAGE_INIT will still make the system get > stuck during boot and wait until everything was accepted. Right. It deferred page init has to be done before init. > I see the following variants: > > 1) Slow boot; after boot, all memory is already accepted. > 2) Fast boot; after boot, all memory will slowly but steadily get > accepted in the background. After a while, all memory is accepted and > can be signaled to user space. > 3) Fast boot; after boot, memory gets accepted on demand. This is what > we have in this series. > > I somehow don't quite like 3), but with deferred population in the > hypervisor, it might just make sense. Conceptionally, 3 is not different from what happens now. The first time normal VM touches the page (like on handling __GFP_ZERO) the page gets allocated on host. It can take very long time if it kicks in direct reclaim on the host. The only difference is that it is *usually* slower. I guest we can make a case for making 1 an option to match pre-populated use case for normal VMs. Frankly, I think option 2 is the worst one. You still CPU cycles from the workload after boot to do the job that may or may not be needed. It is an half-measure that helps nobody.
On 13.04.22 13:30, Kirill A. Shutemov wrote: > On Wed, Apr 13, 2022 at 12:36:11PM +0200, David Hildenbrand wrote: >> On 12.04.22 18:08, Dave Hansen wrote: >>> On 4/12/22 01:15, David Hildenbrand wrote: >>>> Can we simply automate this using a kthread or smth like that, which >>>> just traverses the free page lists and accepts pages (similar, but >>>> different to free page reporting)? >>> >>> That's definitely doable. >>> >>> The downside is that this will force premature consumption of physical >>> memory resources that the guest may never use. That's a particular >>> problem on TDX systems since there is no way for a VMM to reclaim guest >>> memory short of killing the guest. >> >> IIRC, the hypervisor will usually effectively populate all guest RAM >> either way right now. > > No, it is not usual. By default QEMU/KVM uses anonymous mapping and > fault-in memory on demand. > > Yes, there's an option to pre-populate guest memory, but it is not the > default. Let me be clearer: I'm talking about the TDX/SEV world, not ordinary unencrypted VMs. For ordinary encrypted VMs we do have populate on demand frequently. For SEV we currently pin all guest memory and consequently don't have populate on demand. For TDX, again, I did not follow how fd-based private guest memory will behave. I thought I remembered that we will similarly not have populate-on-demand. Preallocation is usually used with huge pages, but I guess that's out of scope right now for encrypted VMs.
On Mon, Apr 11, 2022 at 01:07:29PM +0300, Mike Rapoport wrote: > On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote: > > On 4/9/22 08:54, Kirill A. Shutemov wrote: > > > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote: > > > > >>> if (fpi_flags & FPI_TO_TAIL) > > >>> to_tail = true; > > >>> else if (is_shuffle_order(order)) > > >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page, > > >>> static inline bool page_expected_state(struct page *page, > > >>> unsigned long check_flags) > > >>> { > > >>> - if (unlikely(atomic_read(&page->_mapcount) != -1)) > > >>> + if (unlikely(atomic_read(&page->_mapcount) != -1) && > > >>> + !PageUnaccepted(page)) > > >>> return false; > > >> > > >> That probably deserves a comment, and maybe its own if() statement. > > > > > > Own if does not work. PageUnaccepted() is encoded in _mapcount. > > > > > > What about this: > > > > > > /* > > > * page->_mapcount is expected to be -1. > > > * > > > * There is an exception for PageUnaccepted(). The page type can be set > > > * for pages on free list. Page types are encoded in _mapcount. > > > * > > > * PageUnaccepted() will get cleared in post_alloc_hook(). > > > */ > > > if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1)) > > Maybe I'm missing something, but isn't this true for any PageType? > > > > return false; > > > > > > ? > > > > That's better. But, aren't the PG_* names usually reserved for real > > page->flags bits? That naming might be part of my confusion. > > We use them for PageType as well like PG_buddy, PG_offline, PG_Table. PG_buddy gets clear on remove from the free list, before the chec. PG_offline and PG_table pages are never on free lists.
On Wed, Apr 13, 2022 at 12:36:11PM +0200, David Hildenbrand wrote: > On 12.04.22 18:08, Dave Hansen wrote: > > On 4/12/22 01:15, David Hildenbrand wrote: > > > > The other option might be to tie this all to DEFERRED_STRUCT_PAGE_INIT. > > Have the rule that everything that gets a 'struct page' must be > > accepted. If you want to do delayed acceptance, you do it via > > DEFERRED_STRUCT_PAGE_INIT. > > That could also be an option, yes. At least being able to chose would be > good. But IIRC, DEFERRED_STRUCT_PAGE_INIT will still make the system get > stuck during boot and wait until everything was accepted. The deferred page init runs multithreaded, so guest with SMP will be stuck for less time. > I see the following variants: > > 1) Slow boot; after boot, all memory is already accepted. > 2) Fast boot; after boot, all memory will slowly but steadily get > accepted in the background. After a while, all memory is accepted and > can be signaled to user space. > 3) Fast boot; after boot, memory gets accepted on demand. This is what > we have in this series. > > I somehow don't quite like 3), but with deferred population in the > hypervisor, it might just make sense. IMHO, deferred population in hypervisor will be way more complex than this series with similar "visible" performance.
On Wed, Apr 13, 2022 at 02:40:01PM +0300, Kirill A. Shutemov wrote: > On Mon, Apr 11, 2022 at 01:07:29PM +0300, Mike Rapoport wrote: > > On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote: > > > On 4/9/22 08:54, Kirill A. Shutemov wrote: > > > > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote: > > > > > > >>> if (fpi_flags & FPI_TO_TAIL) > > > >>> to_tail = true; > > > >>> else if (is_shuffle_order(order)) > > > >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page, > > > >>> static inline bool page_expected_state(struct page *page, > > > >>> unsigned long check_flags) > > > >>> { > > > >>> - if (unlikely(atomic_read(&page->_mapcount) != -1)) > > > >>> + if (unlikely(atomic_read(&page->_mapcount) != -1) && > > > >>> + !PageUnaccepted(page)) > > > >>> return false; > > > >> > > > >> That probably deserves a comment, and maybe its own if() statement. > > > > > > > > Own if does not work. PageUnaccepted() is encoded in _mapcount. > > > > > > > > What about this: > > > > > > > > /* > > > > * page->_mapcount is expected to be -1. > > > > * > > > > * There is an exception for PageUnaccepted(). The page type can be set > > > > * for pages on free list. Page types are encoded in _mapcount. > > > > * > > > > * PageUnaccepted() will get cleared in post_alloc_hook(). > > > > */ > > > > if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1)) > > > > Maybe I'm missing something, but isn't this true for any PageType? > > PG_buddy gets clear on remove from the free list, before the chec. > > PG_offline and PG_table pages are never on free lists. Right, this will work 'cause PageType is inverted. I still think this condition is hard to parse and I liked the old variant with !PageUnaccepted() better. Maybe if we wrap the whole construct in a helper it will be less eye hurting. > -- > Kirill A. Shutemov
On Wed, Apr 13, 2022 at 05:48:09PM +0300, Mike Rapoport wrote: > On Wed, Apr 13, 2022 at 02:40:01PM +0300, Kirill A. Shutemov wrote: > > On Mon, Apr 11, 2022 at 01:07:29PM +0300, Mike Rapoport wrote: > > > On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote: > > > > On 4/9/22 08:54, Kirill A. Shutemov wrote: > > > > > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote: > > > > > > > > >>> if (fpi_flags & FPI_TO_TAIL) > > > > >>> to_tail = true; > > > > >>> else if (is_shuffle_order(order)) > > > > >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page, > > > > >>> static inline bool page_expected_state(struct page *page, > > > > >>> unsigned long check_flags) > > > > >>> { > > > > >>> - if (unlikely(atomic_read(&page->_mapcount) != -1)) > > > > >>> + if (unlikely(atomic_read(&page->_mapcount) != -1) && > > > > >>> + !PageUnaccepted(page)) > > > > >>> return false; > > > > >> > > > > >> That probably deserves a comment, and maybe its own if() statement. > > > > > > > > > > Own if does not work. PageUnaccepted() is encoded in _mapcount. > > > > > > > > > > What about this: > > > > > > > > > > /* > > > > > * page->_mapcount is expected to be -1. > > > > > * > > > > > * There is an exception for PageUnaccepted(). The page type can be set > > > > > * for pages on free list. Page types are encoded in _mapcount. > > > > > * > > > > > * PageUnaccepted() will get cleared in post_alloc_hook(). > > > > > */ > > > > > if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1)) > > > > > > Maybe I'm missing something, but isn't this true for any PageType? > > > > PG_buddy gets clear on remove from the free list, before the chec. > > > > PG_offline and PG_table pages are never on free lists. > > Right, this will work 'cause PageType is inverted. I still think this > condition is hard to parse and I liked the old variant with > !PageUnaccepted() better. Well the old way to deal with PageUnaccepted() had a flaw: if the page is PageUnaccepted() it will allow any other page types to pass here. Like PG_unaccepted + PG_buddy will slide here. > Maybe if we wrap the whole construct in a helper it will be less eye > hurting. Hm. Any suggestion how such helper could look like? Cannot think of anything sane.
On 4/13/22 04:30, Kirill A. Shutemov wrote: >> 2) Fast boot; after boot, all memory will slowly but steadily get >> accepted in the background. After a while, all memory is accepted and >> can be signaled to user space. ... > Frankly, I think option 2 is the worst one. You still CPU cycles from the > workload after boot to do the job that may or may not be needed. It is an > half-measure that helps nobody. Let's not be too hyperbolic here. "Worst" is entirely subjective and it totally depends on your perspective and what you care about. There are basically four options: * Accept everything in early boot * Accept with deferred page free * Accept with kthread after boot * Accept on demand and four things that matter: * Code complexity * Time to a shell prompt * CPU/Memory waste * Deterministic overhead Did I miss any? News flash: none of the options wins on all the things that matter. We're going to have to pick one (or maybe two). I'm also not horribly convinced that there's a problem here worth solving, especially one that requires surgery in the core of the buddy allocator. This is essentially making a performance argument: it takes too long to boot if we go with a simpler solution. Yet, I haven't seen any data. I think we need to go with the simplest approach(es) until there's some actual data to guide us here. Here's another way to look at it: > https://docs.google.com/spreadsheets/d/1Fpv0Yp0CTF5_JXHR2pywvNtImTwUVGTxDMlJ5t8qiis/edit?usp=sharing
On 13.04.22 17:36, Dave Hansen wrote: > On 4/13/22 04:30, Kirill A. Shutemov wrote: >>> 2) Fast boot; after boot, all memory will slowly but steadily get >>> accepted in the background. After a while, all memory is accepted and >>> can be signaled to user space. > ... >> Frankly, I think option 2 is the worst one. You still CPU cycles from the >> workload after boot to do the job that may or may not be needed. It is an >> half-measure that helps nobody. > > Let's not be too hyperbolic here. "Worst" is entirely subjective and it > totally depends on your perspective and what you care about. Right. Some people might want to start their workload as soon as the pain is really over. Some might want to have a functional system before that, others might not care. > > There are basically four options: > > * Accept everything in early boot > * Accept with deferred page free > * Accept with kthread after boot > * Accept on demand > > and four things that matter: > > * Code complexity > * Time to a shell prompt > * CPU/Memory waste > * Deterministic overhead > > Did I miss any? Nothing that comes to mind. > > News flash: none of the options wins on all the things that matter. > We're going to have to pick one (or maybe two). I'm also not horribly > convinced that there's a problem here worth solving, especially one that > requires surgery in the core of the buddy allocator. > > This is essentially making a performance argument: it takes too long to > boot if we go with a simpler solution. Yet, I haven't seen any data. I > think we need to go with the simplest approach(es) until there's some > actual data to guide us here. Simplest meaning: accept everything during early boot and don't touch core-mm/buddy code, correct? > > Here's another way to look at it: > >> https://docs.google.com/spreadsheets/d/1Fpv0Yp0CTF5_JXHR2pywvNtImTwUVGTxDMlJ5t8qiis/edit?usp=sharing >
On 4/13/22 09:07, David Hildenbrand wrote: > Simplest meaning: accept everything during early boot and don't touch > core-mm/buddy code, correct? Yes, exactly.
On Wed, Apr 13, 2022 at 08:36:52AM -0700, Dave Hansen wrote: > On 4/13/22 04:30, Kirill A. Shutemov wrote: > >> 2) Fast boot; after boot, all memory will slowly but steadily get > >> accepted in the background. After a while, all memory is accepted and > >> can be signaled to user space. > ... > > Frankly, I think option 2 is the worst one. You still CPU cycles from the > > workload after boot to do the job that may or may not be needed. It is an > > half-measure that helps nobody. > > Let's not be too hyperbolic here. "Worst" is entirely subjective and it > totally depends on your perspective and what you care about. > > There are basically four options: > > * Accept everything in early boot > * Accept with deferred page free > * Accept with kthread after boot > * Accept on demand > > and four things that matter: > > * Code complexity > * Time to a shell prompt > * CPU/Memory waste > * Deterministic overhead > > Did I miss any? "Time to shell" is not equal to "time to do the job". Real workloads do stuff beyond memory allocations. But, yes, it is harder quantify. > News flash: none of the options wins on all the things that matter. > We're going to have to pick one (or maybe two). I'm also not horribly > convinced that there's a problem here worth solving, especially one that > requires surgery in the core of the buddy allocator. > > This is essentially making a performance argument: it takes too long to > boot if we go with a simpler solution. Yet, I haven't seen any data. I > think we need to go with the simplest approach(es) until there's some > actual data to guide us here. > > Here's another way to look at it: > > > https://docs.google.com/spreadsheets/d/1Fpv0Yp0CTF5_JXHR2pywvNtImTwUVGTxDMlJ5t8qiis/edit?usp=sharing The link is view-only. AFAICS, complexity of the kthread approach is on par or greater comparing to on-demand. You need coordination between allocator and the thread. It can be hard to hit right balance for the kthread between being CPU hog and not providing enough accepted memory.
On Wed, Apr 13, 2022 at 06:15:17PM +0300, Kirill A. Shutemov wrote: > On Wed, Apr 13, 2022 at 05:48:09PM +0300, Mike Rapoport wrote: > > On Wed, Apr 13, 2022 at 02:40:01PM +0300, Kirill A. Shutemov wrote: > > > On Mon, Apr 11, 2022 at 01:07:29PM +0300, Mike Rapoport wrote: > > > > On Sun, Apr 10, 2022 at 11:38:08PM -0700, Dave Hansen wrote: > > > > > On 4/9/22 08:54, Kirill A. Shutemov wrote: > > > > > > On Fri, Apr 08, 2022 at 11:55:43AM -0700, Dave Hansen wrote: > > > > > > > > > > >>> if (fpi_flags & FPI_TO_TAIL) > > > > > >>> to_tail = true; > > > > > >>> else if (is_shuffle_order(order)) > > > > > >>> @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page, > > > > > >>> static inline bool page_expected_state(struct page *page, > > > > > >>> unsigned long check_flags) > > > > > >>> { > > > > > >>> - if (unlikely(atomic_read(&page->_mapcount) != -1)) > > > > > >>> + if (unlikely(atomic_read(&page->_mapcount) != -1) && > > > > > >>> + !PageUnaccepted(page)) > > > > > >>> return false; > > > > > >> > > > > > >> That probably deserves a comment, and maybe its own if() statement. > > > > > > > > > > > > Own if does not work. PageUnaccepted() is encoded in _mapcount. > > > > > > > > > > > > What about this: > > > > > > > > > > > > /* > > > > > > * page->_mapcount is expected to be -1. > > > > > > * > > > > > > * There is an exception for PageUnaccepted(). The page type can be set > > > > > > * for pages on free list. Page types are encoded in _mapcount. > > > > > > * > > > > > > * PageUnaccepted() will get cleared in post_alloc_hook(). > > > > > > */ > > > > > > if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1)) > > > > > > > > Maybe I'm missing something, but isn't this true for any PageType? > > > > > > PG_buddy gets clear on remove from the free list, before the chec. > > > > > > PG_offline and PG_table pages are never on free lists. > > > > Right, this will work 'cause PageType is inverted. I still think this > > condition is hard to parse and I liked the old variant with > > !PageUnaccepted() better. > > Well the old way to deal with PageUnaccepted() had a flaw: if the page is > PageUnaccepted() it will allow any other page types to pass here. Like > PG_unaccepted + PG_buddy will slide here. It seems to me that there was an implicit assumption that page types are exclusive and PG_unaccepted would break it. > > Maybe if we wrap the whole construct in a helper it will be less eye > > hurting. > > Hm. Any suggestion how such helper could look like? Cannot think of > anything sane. Me neither :( How about updating the comment to be /* * The page must not be mapped to userspace and must not have a * PageType other than PageUnaccepted. * This means that page->_mapcount must be -1 or have only * PG_unaccepted bit cleared. */ if (unlikely((atomic_read(&page->_mapcount) | PG_unaccepted) != -1)) > -- > Kirill A. Shutemov
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 9d8eeaa67d05..aaaedc111092 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -928,6 +928,7 @@ static inline bool is_page_hwpoison(struct page *page) #define PG_offline 0x00000100 #define PG_table 0x00000200 #define PG_guard 0x00000400 +#define PG_unaccepted 0x00000800 #define PageType(page, flag) \ ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) @@ -953,6 +954,18 @@ static __always_inline void __ClearPage##uname(struct page *page) \ page->page_type |= PG_##lname; \ } +#define PAGE_TYPE_OPS_FALSE(uname) \ +static __always_inline int Page##uname(struct page *page) \ +{ \ + return false; \ +} \ +static __always_inline void __SetPage##uname(struct page *page) \ +{ \ +} \ +static __always_inline void __ClearPage##uname(struct page *page) \ +{ \ +} + /* * PageBuddy() indicates that the page is free and in the buddy system * (see mm/page_alloc.c). @@ -983,6 +996,17 @@ PAGE_TYPE_OPS(Buddy, buddy) */ PAGE_TYPE_OPS(Offline, offline) + /* + * PageUnaccepted() indicates that the page has to be "accepted" before it can + * be used. Page allocator has to call accept_page() before returning the page + * to the caller. + */ +#ifdef CONFIG_UNACCEPTED_MEMORY +PAGE_TYPE_OPS(Unaccepted, unaccepted) +#else +PAGE_TYPE_OPS_FALSE(Unaccepted) +#endif + extern void page_offline_freeze(void); extern void page_offline_thaw(void); extern void page_offline_begin(void); diff --git a/mm/internal.h b/mm/internal.h index cf16280ce132..10302fe857c4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -758,4 +758,15 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags); DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); +#ifndef CONFIG_UNACCEPTED_MEMORY +static inline bool memory_is_unaccepted(phys_addr_t start, phys_addr_t end) +{ + return false; +} + +static inline void accept_memory(phys_addr_t start, phys_addr_t end) +{ +} +#endif + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memblock.c b/mm/memblock.c index e4f03a6e8e56..a1f7f8b304d5 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1405,6 +1405,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, */ kmemleak_alloc_phys(found, size, 0, 0); + /* + * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, + * require memory to be accepted before it can be used by the + * guest. + * + * Accept the memory of the allocated buffer. + */ + accept_memory(found, found + size); + return found; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2db95780e003..53f4aa1c92a7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -121,6 +121,12 @@ typedef int __bitwise fpi_t; */ #define FPI_SKIP_KASAN_POISON ((__force fpi_t)BIT(2)) +/* + * Check if the page needs to be marked as PageUnaccepted(). + * Used for the new pages added to the buddy allocator for the first time. + */ +#define FPI_UNACCEPTED ((__force fpi_t)BIT(3)) + /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) @@ -1023,6 +1029,26 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, return page_is_buddy(higher_page, higher_buddy, order + 1); } +static void accept_page(struct page *page, unsigned int order) +{ + phys_addr_t start = page_to_phys(page); + int i; + + accept_memory(start, start + (PAGE_SIZE << order)); + + for (i = 0; i < (1 << order); i++) { + if (PageUnaccepted(page + i)) + __ClearPageUnaccepted(page + i); + } +} + +static bool page_is_unaccepted(struct page *page, unsigned int order) +{ + phys_addr_t start = page_to_phys(page); + + return memory_is_unaccepted(start, start + (PAGE_SIZE << order)); +} + /* * Freeing function for a buddy system allocator. * @@ -1058,6 +1084,7 @@ static inline void __free_one_page(struct page *page, unsigned long combined_pfn; struct page *buddy; bool to_tail; + bool unaccepted = PageUnaccepted(page); VM_BUG_ON(!zone_is_initialized(zone)); VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); @@ -1089,6 +1116,11 @@ static inline void __free_one_page(struct page *page, clear_page_guard(zone, buddy, order, migratetype); else del_page_from_free_list(buddy, zone, order); + + /* Mark page unaccepted if any of merged pages were unaccepted */ + if (PageUnaccepted(buddy)) + unaccepted = true; + combined_pfn = buddy_pfn & pfn; page = page + (combined_pfn - pfn); pfn = combined_pfn; @@ -1124,6 +1156,17 @@ static inline void __free_one_page(struct page *page, done_merging: set_buddy_order(page, order); + /* + * Check if the page needs to be marked as PageUnaccepted(). + * Used for the new pages added to the buddy allocator for the first + * time. + */ + if (!unaccepted && (fpi_flags & FPI_UNACCEPTED)) + unaccepted = page_is_unaccepted(page, order); + + if (unaccepted) + __SetPageUnaccepted(page); + if (fpi_flags & FPI_TO_TAIL) to_tail = true; else if (is_shuffle_order(order)) @@ -1149,7 +1192,8 @@ static inline void __free_one_page(struct page *page, static inline bool page_expected_state(struct page *page, unsigned long check_flags) { - if (unlikely(atomic_read(&page->_mapcount) != -1)) + if (unlikely(atomic_read(&page->_mapcount) != -1) && + !PageUnaccepted(page)) return false; if (unlikely((unsigned long)page->mapping | @@ -1654,7 +1698,8 @@ void __free_pages_core(struct page *page, unsigned int order) * Bypass PCP and place fresh pages right to the tail, primarily * relevant for memory onlining. */ - __free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON); + __free_pages_ok(page, order, + FPI_TO_TAIL | FPI_SKIP_KASAN_POISON | FPI_UNACCEPTED); } #ifdef CONFIG_NUMA @@ -1807,6 +1852,7 @@ static void __init deferred_free_range(unsigned long pfn, return; } + accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT); for (i = 0; i < nr_pages; i++, page++, pfn++) { if ((pfn & (pageblock_nr_pages - 1)) == 0) set_pageblock_migratetype(page, MIGRATE_MOVABLE); @@ -2266,6 +2312,10 @@ static inline void expand(struct zone *zone, struct page *page, if (set_page_guard(zone, &page[size], high, migratetype)) continue; + /* Transfer PageUnaccepted() to the newly split pages */ + if (PageUnaccepted(page)) + __SetPageUnaccepted(&page[size]); + add_to_free_list(&page[size], zone, high, migratetype); set_buddy_order(&page[size], high); } @@ -2396,6 +2446,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order, */ kernel_unpoison_pages(page, 1 << order); + if (PageUnaccepted(page)) + accept_page(page, order); + /* * As memory initialization might be integrated into KASAN, * KASAN unpoisoning and memory initializion code must be