diff mbox series

[PATCHv7,02/14] mm: Add support for unaccepted memory

Message ID 20220614120231.48165-3-kirill.shutemov@linux.intel.com
State New
Headers show
Series mm, x86/cc: Implement support for unaccepted memory | expand

Commit Message

Kirill A. Shutemov June 14, 2022, 12:02 p.m. UTC
UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways kernel can deal with unaccepted memory:

 1. Accept all the memory during the boot. It is easy to implement and
    it doesn't have runtime cost once the system is booted. The downside
    is very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of #2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

Implement #2 for now. It is a reasonable default. Some workloads may
want to use #1 or #3 and they can be implemented later based on user's
demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock has to accept memory on allocation;

  - page allocator has to accept memory on the first allocation of the
    page;

Memblock change is trivial.

The page allocator is modified to accept pages on the first allocation.
The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
used to indicate that the page requires acceptance.

Architecture has to provide two helpers if it wants to support
unaccepted memory:

 - accept_memory() makes a range of physical addresses accepted.

 - range_contains_unaccepted_memory() checks anything within the range
   of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 include/linux/page-flags.h | 31 +++++++++++++
 mm/internal.h              | 12 +++++
 mm/memblock.c              |  9 ++++
 mm/page_alloc.c            | 89 +++++++++++++++++++++++++++++++++++++-
 4 files changed, 139 insertions(+), 2 deletions(-)

Comments

Tom Lendacky June 17, 2022, 7:28 p.m. UTC | #1
On 6/14/22 07:02, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, require memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific to the Virtual Machine
> platform.
> 
> There are several ways kernel can deal with unaccepted memory:
> 
>   1. Accept all the memory during the boot. It is easy to implement and
>      it doesn't have runtime cost once the system is booted. The downside
>      is very long boot time.
> 
>      Accept can be parallelized to multiple CPUs to keep it manageable
>      (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>      memory bandwidth and does not scale beyond the point.
> 
>   2. Accept a block of memory on the first use. It requires more
>      infrastructure and changes in page allocator to make it work, but
>      it provides good boot time.
> 
>      On-demand memory accept means latency spikes every time kernel steps
>      onto a new memory block. The spikes will go away once workload data
>      set size gets stabilized or all memory gets accepted.
> 
>   3. Accept all memory in background. Introduce a thread (or multiple)
>      that gets memory accepted proactively. It will minimize time the
>      system experience latency spikes on memory allocation while keeping
>      low boot time.
> 
>      This approach cannot function on its own. It is an extension of #2:
>      background memory acceptance requires functional scheduler, but the
>      page allocator may need to tap into unaccepted memory before that.
> 
>      The downside of the approach is that these threads also steal CPU
>      cycles and memory bandwidth from the user's workload and may hurt
>      user experience.
> 
> Implement #2 for now. It is a reasonable default. Some workloads may
> want to use #1 or #3 and they can be implemented later based on user's
> demands.
> 
> Support of unaccepted memory requires a few changes in core-mm code:
> 
>    - memblock has to accept memory on allocation;
> 
>    - page allocator has to accept memory on the first allocation of the
>      page;
> 
> Memblock change is trivial.
> 
> The page allocator is modified to accept pages on the first allocation.
> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
> used to indicate that the page requires acceptance.
> 
> Architecture has to provide two helpers if it wants to support
> unaccepted memory:
> 
>   - accept_memory() makes a range of physical addresses accepted.
> 
>   - range_contains_unaccepted_memory() checks anything within the range
>     of physical addresses requires acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
> Reviewed-by: David Hildenbrand <david@redhat.com>
> ---
>   include/linux/page-flags.h | 31 +++++++++++++
>   mm/internal.h              | 12 +++++
>   mm/memblock.c              |  9 ++++
>   mm/page_alloc.c            | 89 +++++++++++++++++++++++++++++++++++++-
>   4 files changed, 139 insertions(+), 2 deletions(-)
> 

> diff --git a/mm/memblock.c b/mm/memblock.c
> index e4f03a6e8e56..a1f7f8b304d5 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1405,6 +1405,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>   		 */
>   		kmemleak_alloc_phys(found, size, 0, 0);
>   
> +	/*
> +	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> +	 * require memory to be accepted before it can be used by the
> +	 * guest.
> +	 *
> +	 * Accept the memory of the allocated buffer.
> +	 */
> +	accept_memory(found, found + size);

The SNP support will kmalloc a descriptor that can be used to supply the 
range to the hypervisor using the GHCB/VMGEXIT. But kmalloc won't work 
when called this early, so we are likely to need an early_accept_memory or 
some kind of flag to know whether this is an early call or not in order to 
use a static descriptor in the file.

Thanks,
Tom

> +
>   	return found;
>   }
>
Tom Lendacky June 17, 2022, 8:53 p.m. UTC | #2
On 6/17/22 14:28, Tom Lendacky wrote:
> On 6/14/22 07:02, Kirill A. Shutemov wrote:
>> UEFI Specification version 2.9 introduces the concept of memory
>> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
>> SEV-SNP, require memory to be accepted before it can be used by the
>> guest. Accepting happens via a protocol specific to the Virtual Machine
>> platform.
>>
>> There are several ways kernel can deal with unaccepted memory:
>>
>>   1. Accept all the memory during the boot. It is easy to implement and
>>      it doesn't have runtime cost once the system is booted. The downside
>>      is very long boot time.
>>
>>      Accept can be parallelized to multiple CPUs to keep it manageable
>>      (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>>      memory bandwidth and does not scale beyond the point.
>>
>>   2. Accept a block of memory on the first use. It requires more
>>      infrastructure and changes in page allocator to make it work, but
>>      it provides good boot time.
>>
>>      On-demand memory accept means latency spikes every time kernel steps
>>      onto a new memory block. The spikes will go away once workload data
>>      set size gets stabilized or all memory gets accepted.
>>
>>   3. Accept all memory in background. Introduce a thread (or multiple)
>>      that gets memory accepted proactively. It will minimize time the
>>      system experience latency spikes on memory allocation while keeping
>>      low boot time.
>>
>>      This approach cannot function on its own. It is an extension of #2:
>>      background memory acceptance requires functional scheduler, but the
>>      page allocator may need to tap into unaccepted memory before that.
>>
>>      The downside of the approach is that these threads also steal CPU
>>      cycles and memory bandwidth from the user's workload and may hurt
>>      user experience.
>>
>> Implement #2 for now. It is a reasonable default. Some workloads may
>> want to use #1 or #3 and they can be implemented later based on user's
>> demands.
>>
>> Support of unaccepted memory requires a few changes in core-mm code:
>>
>>    - memblock has to accept memory on allocation;
>>
>>    - page allocator has to accept memory on the first allocation of the
>>      page;
>>
>> Memblock change is trivial.
>>
>> The page allocator is modified to accept pages on the first allocation.
>> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
>> used to indicate that the page requires acceptance.
>>
>> Architecture has to provide two helpers if it wants to support
>> unaccepted memory:
>>
>>   - accept_memory() makes a range of physical addresses accepted.
>>
>>   - range_contains_unaccepted_memory() checks anything within the range
>>     of physical addresses requires acceptance.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Acked-by: Mike Rapoport <rppt@linux.ibm.com>    # memblock
>> Reviewed-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/linux/page-flags.h | 31 +++++++++++++
>>   mm/internal.h              | 12 +++++
>>   mm/memblock.c              |  9 ++++
>>   mm/page_alloc.c            | 89 +++++++++++++++++++++++++++++++++++++-
>>   4 files changed, 139 insertions(+), 2 deletions(-)
>>
> 
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index e4f03a6e8e56..a1f7f8b304d5 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -1405,6 +1405,15 @@ phys_addr_t __init 
>> memblock_alloc_range_nid(phys_addr_t size,
>>            */
>>           kmemleak_alloc_phys(found, size, 0, 0);
>> +    /*
>> +     * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
>> +     * require memory to be accepted before it can be used by the
>> +     * guest.
>> +     *
>> +     * Accept the memory of the allocated buffer.
>> +     */
>> +    accept_memory(found, found + size);
> 
> The SNP support will kmalloc a descriptor that can be used to supply the 
> range to the hypervisor using the GHCB/VMGEXIT. But kmalloc won't work 
> when called this early, so we are likely to need an early_accept_memory or 
> some kind of flag to know whether this is an early call or not in order to 
> use a static descriptor in the file.

Then again, the accept_memory() call from mm/page_alloc.c can also be 
early... maybe I can check system_state and make a determination as to 
which SNP page state change mechanism to use (GHCB vs MSR protocol). It 
might not be performance optimal, though, as the MSR protocol is one 4K 
page at a time.

Thanks,
Tom

> 
> Thanks,
> Tom
> 
>> +
>>       return found;
>>   }
Borislav Petkov July 21, 2022, 3:14 p.m. UTC | #3
On Tue, Jun 14, 2022 at 03:02:19PM +0300, Kirill A. Shutemov wrote:
>     On-demand memory accept means latency spikes every time kernel steps
>     onto a new memory block. The spikes will go away once workload data
>     set size gets stabilized or all memory gets accepted.

What does that mean?

If we're accepting 2M pages and considering referential locality, how
are those "spikes" even noticeable?
Dave Hansen July 21, 2022, 3:49 p.m. UTC | #4
On 7/21/22 08:14, Borislav Petkov wrote:
> On Tue, Jun 14, 2022 at 03:02:19PM +0300, Kirill A. Shutemov wrote:
>>     On-demand memory accept means latency spikes every time kernel steps
>>     onto a new memory block. The spikes will go away once workload data
>>     set size gets stabilized or all memory gets accepted.
> What does that mean?
> 
> If we're accepting 2M pages and considering referential locality, how
> are those "spikes" even noticeable?

Acceptance is slow and the heavy lifting is done inside the TDX module.
 It involves flushing old aliases out of the caches and initializing the
memory integrity metadata for every cacheline.  This implementation does
acceptance in 2MB chunks while holding a global lock.

So, those (effective) 2MB clflush+memset's (plus a few thousand cycles
for the hypercall/tdcall transitions) can't happen in parallel.  They
are serialized and must wait on each other.  If you have a few hundred
CPUs all trying to allocate memory (say, doing the first kernel compile
after a reboot), this is going to be very, very painful for a while.

That said, I think this is the right place to _start_.  There is going
to need to be some kind of follow-on solution (likely background
acceptance of some kind).  But, even with that solution, *this* code is
still needed to handle the degenerate case where the background accepter
can't keep up with foreground memory needs.
Borislav Petkov July 22, 2022, 7:18 p.m. UTC | #5
On Thu, Jul 21, 2022 at 08:49:31AM -0700, Dave Hansen wrote:
> Acceptance is slow and the heavy lifting is done inside the TDX module.
>  It involves flushing old aliases out of the caches and initializing the
> memory integrity metadata for every cacheline.  This implementation does
> acceptance in 2MB chunks while holding a global lock.

Oh, fun.

> So, those (effective) 2MB clflush+memset's (plus a few thousand cycles
> for the hypercall/tdcall transitions)

So this sounds strange - page validation on AMD - judging by the
pseudocode of the PVALIDATE insn - does a bunch of sanity checks on the
gVA of the page and then installs it into the RMP and also "PVALIDATE
performs the same segmentation and paging checks as a 1-byte read.
PVALIDATE does not invalidate TLB caches."

But that still sounds a lot less work than what the TDX module needs to
do...

> can't happen in parallel. They are serialized and must wait on each
> other.

Ofc, the Intel version of the RMP table needs to be protected. :-)

> If you have a few hundred CPUs all trying to allocate memory (say,
> doing the first kernel compile after a reboot), this is going to be
> very, very painful for a while.
>
> That said, I think this is the right place to _start_. There is going
> to need to be some kind of follow-on solution (likely background
> acceptance of some kind). But, even with that solution, *this* code
> is still needed to handle the degenerate case where the background
> accepter can't keep up with foreground memory needs.

I'm still catering to the view that it should be a two-tier thing: you
validate during boot a certain amount - say 4G - a size for which the
boot delay is acceptable and you do the rest on-demand along with a
background accepter.

That should give you the best of both worlds...

Thx.
Dave Hansen July 22, 2022, 7:30 p.m. UTC | #6
On 7/22/22 12:18, Borislav Petkov wrote:
> On Thu, Jul 21, 2022 at 08:49:31AM -0700, Dave Hansen wrote:
>> So, those (effective) 2MB clflush+memset's (plus a few thousand cycles
>> for the hypercall/tdcall transitions)
> 
> So this sounds strange - page validation on AMD - judging by the
> pseudocode of the PVALIDATE insn - does a bunch of sanity checks on the
> gVA of the page and then installs it into the RMP and also "PVALIDATE
> performs the same segmentation and paging checks as a 1-byte read.
> PVALIDATE does not invalidate TLB caches."
> 
> But that still sounds a lot less work than what the TDX module needs to
> do...

Sure does...  *Something* has to manage the cache coherency so that old
physical aliases of the converted memory don't write back and clobber
new data.  But, maybe the hardware is doing that now.


>> If you have a few hundred CPUs all trying to allocate memory (say,
>> doing the first kernel compile after a reboot), this is going to be
>> very, very painful for a while.
>>
>> That said, I think this is the right place to _start_. There is going
>> to need to be some kind of follow-on solution (likely background
>> acceptance of some kind). But, even with that solution, *this* code
>> is still needed to handle the degenerate case where the background
>> accepter can't keep up with foreground memory needs.
> 
> I'm still catering to the view that it should be a two-tier thing: you
> validate during boot a certain amount - say 4G - a size for which the
> boot delay is acceptable and you do the rest on-demand along with a
> background accepter.
> 
> That should give you the best of both worlds...

Yeah, that two-tier system is the way it's happening today from what I
understand.  This whole conversation is about how to handle the >4GB memory.
Borislav Petkov July 25, 2022, 12:23 p.m. UTC | #7
On Fri, Jul 22, 2022 at 12:30:36PM -0700, Dave Hansen wrote:
> Sure does...  *Something* has to manage the cache coherency so that old
> physical aliases of the converted memory don't write back and clobber
> new data.  But, maybe the hardware is doing that now.

Let's hope.

> Yeah, that two-tier system is the way it's happening today from what
> I understand. This whole conversation is about how to handle the >4GB
> memory.

Would it be possible to pre-accept a bunch of mem - think "pre-fault" -
from userspace?

I.e., I'm thinking some huge process is going to start in the VM, VM
userspace goes and causes a chunk of memory to be pre-accepted and then
the process starts and runs more-or-less smoothly as the majority of its
memory has already been "prepared".

Or does that not make any sense from mm perspective?
David Hildenbrand July 25, 2022, 12:38 p.m. UTC | #8
On 25.07.22 14:23, Borislav Petkov wrote:
> On Fri, Jul 22, 2022 at 12:30:36PM -0700, Dave Hansen wrote:
>> Sure does...  *Something* has to manage the cache coherency so that old
>> physical aliases of the converted memory don't write back and clobber
>> new data.  But, maybe the hardware is doing that now.
> 
> Let's hope.
> 
>> Yeah, that two-tier system is the way it's happening today from what
>> I understand. This whole conversation is about how to handle the >4GB
>> memory.
> 
> Would it be possible to pre-accept a bunch of mem - think "pre-fault" -
> from userspace?
> 
> I.e., I'm thinking some huge process is going to start in the VM, VM
> userspace goes and causes a chunk of memory to be pre-accepted and then
> the process starts and runs more-or-less smoothly as the majority of its
> memory has already been "prepared".
> 
> Or does that not make any sense from mm perspective?
> 

The less core-MM code to handle unaccepted memory the better. Meaning,
that any kind of additional pre-acceptance (in addition to what we have
here) needs good justification.
Borislav Petkov July 25, 2022, 12:53 p.m. UTC | #9
On Mon, Jul 25, 2022 at 02:38:57PM +0200, David Hildenbrand wrote:
> The less core-MM code to handle unaccepted memory the better. Meaning,
> that any kind of additional pre-acceptance (in addition to what we have
> here) needs good justification.

I actually mean to have a user interface to have the core code
pre-accept. I.e., interface to what will be already there anyway.

Or is that frowned upon too?
Mike Rapoport July 25, 2022, 1 p.m. UTC | #10
On Mon, Jul 25, 2022 at 02:23:20PM +0200, Borislav Petkov wrote:
> On Fri, Jul 22, 2022 at 12:30:36PM -0700, Dave Hansen wrote:
> > Sure does...  *Something* has to manage the cache coherency so that old
> > physical aliases of the converted memory don't write back and clobber
> > new data.  But, maybe the hardware is doing that now.
> 
> Let's hope.
> 
> > Yeah, that two-tier system is the way it's happening today from what
> > I understand. This whole conversation is about how to handle the >4GB
> > memory.
> 
> Would it be possible to pre-accept a bunch of mem - think "pre-fault" -
> from userspace?
> 
> I.e., I'm thinking some huge process is going to start in the VM, VM
> userspace goes and causes a chunk of memory to be pre-accepted and then
> the process starts and runs more-or-less smoothly as the majority of its
> memory has already been "prepared".

An application in the VM can do mlock() or mmap(..., MAP_POPULATE, ...) and
this will essentially force acceptance of that memory.

But there's no sysctl or something for that.
 
> Or does that not make any sense from mm perspective?
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov July 25, 2022, 1:05 p.m. UTC | #11
On Mon, Jul 25, 2022 at 04:00:14PM +0300, Mike Rapoport wrote:
> An application in the VM can do mlock() or mmap(..., MAP_POPULATE, ...) and
> this will essentially force acceptance of that memory.

Ah, cool, that's what I meant.

> But there's no sysctl or something for that.

Yeah, no need.

I was simply wondering whether one can relocate the acceptance work to
the moment prior to starting the process so that it can run smoothly
once started and doesn't cause spikes due to on-demand acceptance.

At least not too many.

Thx.
David Hildenbrand July 26, 2022, 2:30 p.m. UTC | #12
On 25.07.22 14:53, Borislav Petkov wrote:
> On Mon, Jul 25, 2022 at 02:38:57PM +0200, David Hildenbrand wrote:
>> The less core-MM code to handle unaccepted memory the better. Meaning,
>> that any kind of additional pre-acceptance (in addition to what we have
>> here) needs good justification.
> 
> I actually mean to have a user interface to have the core code
> pre-accept. I.e., interface to what will be already there anyway.
>

Ah, got it. The "issue" with ordinary prefaulting in user space is that
you'll also end up zeroing the memory. But yeah, it should just accept a
bunch of memory.

> Or is that frowned upon too?

I was assuming you'd want some additional interface that only accept
memory without actually temporarily allocating it.
Vlastimil Babka Aug. 5, 2022, 11:49 a.m. UTC | #13
On 6/14/22 14:02, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, require memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific to the Virtual Machine
> platform.
> 
> There are several ways kernel can deal with unaccepted memory:
> 
>  1. Accept all the memory during the boot. It is easy to implement and
>     it doesn't have runtime cost once the system is booted. The downside
>     is very long boot time.
> 
>     Accept can be parallelized to multiple CPUs to keep it manageable
>     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>     memory bandwidth and does not scale beyond the point.
> 
>  2. Accept a block of memory on the first use. It requires more
>     infrastructure and changes in page allocator to make it work, but
>     it provides good boot time.
> 
>     On-demand memory accept means latency spikes every time kernel steps
>     onto a new memory block. The spikes will go away once workload data
>     set size gets stabilized or all memory gets accepted.
> 
>  3. Accept all memory in background. Introduce a thread (or multiple)
>     that gets memory accepted proactively. It will minimize time the
>     system experience latency spikes on memory allocation while keeping
>     low boot time.
> 
>     This approach cannot function on its own. It is an extension of #2:
>     background memory acceptance requires functional scheduler, but the
>     page allocator may need to tap into unaccepted memory before that.
> 
>     The downside of the approach is that these threads also steal CPU
>     cycles and memory bandwidth from the user's workload and may hurt
>     user experience.
> 
> Implement #2 for now. It is a reasonable default. Some workloads may
> want to use #1 or #3 and they can be implemented later based on user's
> demands.
> 
> Support of unaccepted memory requires a few changes in core-mm code:
> 
>   - memblock has to accept memory on allocation;
> 
>   - page allocator has to accept memory on the first allocation of the
>     page;
> 
> Memblock change is trivial.
> 
> The page allocator is modified to accept pages on the first allocation.
> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
> used to indicate that the page requires acceptance.
> 
> Architecture has to provide two helpers if it wants to support
> unaccepted memory:
> 
>  - accept_memory() makes a range of physical addresses accepted.
> 
>  - range_contains_unaccepted_memory() checks anything within the range
>    of physical addresses requires acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
> Reviewed-by: David Hildenbrand <david@redhat.com>

Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed
before, but it's really not great how this affects the core page allocator
paths. Wouldn't it be possible to only release pages to page allocator when
accepted, and otherwise use some new per-zone variables together with the
bitmap to track how much exactly is where to accept? Then it could be hooked
in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT -
if we fail zone_watermark_fast() and there are unaccepted pages in the zone,
accept them and continue. With a static key to flip in case we eventually
accept everything. Because this is really similar scenario to the deferred
init and that one was solved in a way that adds minimal overhead.

> ---
>  include/linux/page-flags.h | 31 +++++++++++++
>  mm/internal.h              | 12 +++++
>  mm/memblock.c              |  9 ++++
>  mm/page_alloc.c            | 89 +++++++++++++++++++++++++++++++++++++-
>  4 files changed, 139 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index e66f7aa3191d..444ba8f4bfa0 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -942,6 +942,14 @@ static inline bool is_page_hwpoison(struct page *page)
>  #define PG_offline	0x00000100
>  #define PG_table	0x00000200
>  #define PG_guard	0x00000400
> +#define PG_unaccepted	0x00000800
> +
> +/*
> + * Page types allowed at page_expected_state()
> + *
> + * PageUnaccepted() will get cleared in post_alloc_hook().
> + */
> +#define PAGE_TYPES_EXPECTED	PG_unaccepted
>  
>  #define PageType(page, flag)						\
>  	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
> @@ -967,6 +975,18 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
>  	page->page_type |= PG_##lname;					\
>  }
>  
> +#define PAGE_TYPE_OPS_FALSE(uname)					\
> +static __always_inline int Page##uname(struct page *page)		\
> +{									\
> +	return false;							\
> +}									\
> +static __always_inline void __SetPage##uname(struct page *page)		\
> +{									\
> +}									\
> +static __always_inline void __ClearPage##uname(struct page *page)	\
> +{									\
> +}
> +
>  /*
>   * PageBuddy() indicates that the page is free and in the buddy system
>   * (see mm/page_alloc.c).
> @@ -997,6 +1017,17 @@ PAGE_TYPE_OPS(Buddy, buddy)
>   */
>  PAGE_TYPE_OPS(Offline, offline)
>  
> +/*
> + * PageUnaccepted() indicates that the page has to be "accepted" before it can
> + * be read or written. The page allocator must call accept_page() before
> + * touching the page or returning it to the caller.
> + */
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +PAGE_TYPE_OPS(Unaccepted, unaccepted)
> +#else
> +PAGE_TYPE_OPS_FALSE(Unaccepted)
> +#endif
> +
>  extern void page_offline_freeze(void);
>  extern void page_offline_thaw(void);
>  extern void page_offline_begin(void);
> diff --git a/mm/internal.h b/mm/internal.h
> index c0f8fbe0445b..7c1a653edfc8 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -861,4 +861,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
>  
>  DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>  
> +#ifndef CONFIG_UNACCEPTED_MEMORY
> +static inline bool range_contains_unaccepted_memory(phys_addr_t start,
> +						    phys_addr_t end)
> +{
> +	return false;
> +}
> +
> +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +}
> +#endif
> +
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/memblock.c b/mm/memblock.c
> index e4f03a6e8e56..a1f7f8b304d5 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1405,6 +1405,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>  		 */
>  		kmemleak_alloc_phys(found, size, 0, 0);
>  
> +	/*
> +	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> +	 * require memory to be accepted before it can be used by the
> +	 * guest.
> +	 *
> +	 * Accept the memory of the allocated buffer.
> +	 */
> +	accept_memory(found, found + size);
> +
>  	return found;
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e008a3df0485..279c2746aaa8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -122,6 +122,12 @@ typedef int __bitwise fpi_t;
>   */
>  #define FPI_SKIP_KASAN_POISON	((__force fpi_t)BIT(2))
>  
> +/*
> + * Check if the page needs to be marked as PageUnaccepted().
> + * Used for the new pages added to the buddy allocator for the first time.
> + */
> +#define FPI_UNACCEPTED_SLOWPATH	((__force fpi_t)BIT(3))
> +
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
> @@ -993,6 +999,35 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
>  			NULL) != NULL;
>  }
>  
> +/*
> + * Page acceptance can be very slow. Do not call under critical locks.
> + */
> +static void accept_page(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +	int i;
> +
> +	if (!PageUnaccepted(page))
> +		return;
> +
> +	accept_memory(start, start + (PAGE_SIZE << order));
> +	__ClearPageUnaccepted(page);
> +
> +	/* Assert that there is no PageUnaccepted() on tail pages */
> +	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
> +		for (i = 1; i < (1 << order); i++)
> +			VM_BUG_ON_PAGE(PageUnaccepted(page + i), page + i);
> +	}
> +}
> +
> +static bool page_contains_unaccepted(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +	phys_addr_t end = start + (PAGE_SIZE << order);
> +
> +	return range_contains_unaccepted_memory(start, end);
> +}
> +
>  /*
>   * Freeing function for a buddy system allocator.
>   *
> @@ -1027,6 +1062,7 @@ static inline void __free_one_page(struct page *page,
>  	unsigned long combined_pfn;
>  	struct page *buddy;
>  	bool to_tail;
> +	bool page_needs_acceptance = false;
>  
>  	VM_BUG_ON(!zone_is_initialized(zone));
>  	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -1038,6 +1074,11 @@ static inline void __free_one_page(struct page *page,
>  	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
>  	VM_BUG_ON_PAGE(bad_range(zone, page), page);
>  
> +	if (PageUnaccepted(page)) {
> +		page_needs_acceptance = true;
> +		__ClearPageUnaccepted(page);
> +	}
> +
>  	while (order < MAX_ORDER - 1) {
>  		if (compaction_capture(capc, page, order, migratetype)) {
>  			__mod_zone_freepage_state(zone, -(1 << order),
> @@ -1072,6 +1113,13 @@ static inline void __free_one_page(struct page *page,
>  			clear_page_guard(zone, buddy, order, migratetype);
>  		else
>  			del_page_from_free_list(buddy, zone, order);
> +
> +		/* Mark page unaccepted if any of merged pages were unaccepted */
> +		if (PageUnaccepted(buddy)) {
> +			page_needs_acceptance = true;
> +			__ClearPageUnaccepted(buddy);
> +		}
> +
>  		combined_pfn = buddy_pfn & pfn;
>  		page = page + (combined_pfn - pfn);
>  		pfn = combined_pfn;
> @@ -1081,6 +1129,23 @@ static inline void __free_one_page(struct page *page,
>  done_merging:
>  	set_buddy_order(page, order);
>  
> +	/*
> +	 * The page gets marked as PageUnaccepted() if any of merged-in pages
> +	 * is PageUnaccepted().
> +	 *
> +	 * New pages, just being added to buddy allocator, do not have
> +	 * PageUnaccepted() set. FPI_UNACCEPTED_SLOWPATH indicates that the
> +	 * page is new and page_contains_unaccepted() check is required to
> +	 * determinate if acceptance is required.
> +	 *
> +	 * Avoid calling page_contains_unaccepted() if it is known that the page
> +	 * needs acceptance. It can be costly.
> +	 */
> +	if (!page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED_SLOWPATH))
> +		page_needs_acceptance = page_contains_unaccepted(page, order);
> +	if (page_needs_acceptance)
> +		__SetPageUnaccepted(page);
> +
>  	if (fpi_flags & FPI_TO_TAIL)
>  		to_tail = true;
>  	else if (is_shuffle_order(order))
> @@ -1164,7 +1229,13 @@ int split_free_page(struct page *free_page,
>  static inline bool page_expected_state(struct page *page,
>  					unsigned long check_flags)
>  {
> -	if (unlikely(atomic_read(&page->_mapcount) != -1))
> +	/*
> +	 * The page must not be mapped to userspace and must not have
> +	 * a PageType other than listed in PAGE_TYPES_EXPECTED.
> +	 *
> +	 * Note, bit cleared means the page type is set.
> +	 */
> +	if (unlikely((atomic_read(&page->_mapcount) | PAGE_TYPES_EXPECTED) != -1))
>  		return false;
>  
>  	if (unlikely((unsigned long)page->mapping |
> @@ -1669,7 +1740,9 @@ void __free_pages_core(struct page *page, unsigned int order)
>  	 * Bypass PCP and place fresh pages right to the tail, primarily
>  	 * relevant for memory onlining.
>  	 */
> -	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
> +	__free_pages_ok(page, order,
> +			FPI_TO_TAIL | FPI_SKIP_KASAN_POISON |
> +			FPI_UNACCEPTED_SLOWPATH);
>  }
>  
>  #ifdef CONFIG_NUMA
> @@ -1822,6 +1895,9 @@ static void __init deferred_free_range(unsigned long pfn,
>  		return;
>  	}
>  
> +	/* Accept chunks smaller than page-block upfront */
> +	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
> +
>  	for (i = 0; i < nr_pages; i++, page++, pfn++) {
>  		if ((pfn & (pageblock_nr_pages - 1)) == 0)
>  			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> @@ -2281,6 +2357,13 @@ static inline void expand(struct zone *zone, struct page *page,
>  		if (set_page_guard(zone, &page[size], high, migratetype))
>  			continue;
>  
> +		/*
> +		 * Transfer PageUnaccepted() to the newly split pages so
> +		 * they can be accepted after dropping the zone lock.
> +		 */
> +		if (PageUnaccepted(page))
> +			__SetPageUnaccepted(&page[size]);
> +
>  		add_to_free_list(&page[size], zone, high, migratetype);
>  		set_buddy_order(&page[size], high);
>  	}
> @@ -2411,6 +2494,8 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  	 */
>  	kernel_unpoison_pages(page, 1 << order);
>  
> +	accept_page(page, order);
> +
>  	/*
>  	 * As memory initialization might be integrated into KASAN,
>  	 * KASAN unpoisoning and memory initializion code must be
David Hildenbrand Aug. 5, 2022, 12:09 p.m. UTC | #14
On 05.08.22 13:49, Vlastimil Babka wrote:
> On 6/14/22 14:02, Kirill A. Shutemov wrote:
>> UEFI Specification version 2.9 introduces the concept of memory
>> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
>> SEV-SNP, require memory to be accepted before it can be used by the
>> guest. Accepting happens via a protocol specific to the Virtual Machine
>> platform.
>>
>> There are several ways kernel can deal with unaccepted memory:
>>
>>  1. Accept all the memory during the boot. It is easy to implement and
>>     it doesn't have runtime cost once the system is booted. The downside
>>     is very long boot time.
>>
>>     Accept can be parallelized to multiple CPUs to keep it manageable
>>     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>>     memory bandwidth and does not scale beyond the point.
>>
>>  2. Accept a block of memory on the first use. It requires more
>>     infrastructure and changes in page allocator to make it work, but
>>     it provides good boot time.
>>
>>     On-demand memory accept means latency spikes every time kernel steps
>>     onto a new memory block. The spikes will go away once workload data
>>     set size gets stabilized or all memory gets accepted.
>>
>>  3. Accept all memory in background. Introduce a thread (or multiple)
>>     that gets memory accepted proactively. It will minimize time the
>>     system experience latency spikes on memory allocation while keeping
>>     low boot time.
>>
>>     This approach cannot function on its own. It is an extension of #2:
>>     background memory acceptance requires functional scheduler, but the
>>     page allocator may need to tap into unaccepted memory before that.
>>
>>     The downside of the approach is that these threads also steal CPU
>>     cycles and memory bandwidth from the user's workload and may hurt
>>     user experience.
>>
>> Implement #2 for now. It is a reasonable default. Some workloads may
>> want to use #1 or #3 and they can be implemented later based on user's
>> demands.
>>
>> Support of unaccepted memory requires a few changes in core-mm code:
>>
>>   - memblock has to accept memory on allocation;
>>
>>   - page allocator has to accept memory on the first allocation of the
>>     page;
>>
>> Memblock change is trivial.
>>
>> The page allocator is modified to accept pages on the first allocation.
>> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
>> used to indicate that the page requires acceptance.
>>
>> Architecture has to provide two helpers if it wants to support
>> unaccepted memory:
>>
>>  - accept_memory() makes a range of physical addresses accepted.
>>
>>  - range_contains_unaccepted_memory() checks anything within the range
>>    of physical addresses requires acceptance.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
>> Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed
> before, but it's really not great how this affects the core page allocator
> paths. Wouldn't it be possible to only release pages to page allocator when
> accepted, and otherwise use some new per-zone variables together with the
> bitmap to track how much exactly is where to accept? Then it could be hooked
> in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT -
> if we fail zone_watermark_fast() and there are unaccepted pages in the zone,
> accept them and continue. With a static key to flip in case we eventually
> accept everything. Because this is really similar scenario to the deferred
> init and that one was solved in a way that adds minimal overhead.

I kind of like just having the memory stats being correct (e.g., free
memory) and acceptance being an internal detail to be triggered when
allocating pages -- just like the arch_alloc_page() callback.

I'm sure we could optimize for the !unaccepted memory via static keys
also in this version with some checks at the right places if we find
this to hurt performance?
Vlastimil Babka Aug. 5, 2022, 1:38 p.m. UTC | #15
On 8/5/22 14:09, David Hildenbrand wrote:
> On 05.08.22 13:49, Vlastimil Babka wrote:
>> On 6/14/22 14:02, Kirill A. Shutemov wrote:
>>> UEFI Specification version 2.9 introduces the concept of memory
>>> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
>>> SEV-SNP, require memory to be accepted before it can be used by the
>>> guest. Accepting happens via a protocol specific to the Virtual Machine
>>> platform.
>>>
>>> There are several ways kernel can deal with unaccepted memory:
>>>
>>>  1. Accept all the memory during the boot. It is easy to implement and
>>>     it doesn't have runtime cost once the system is booted. The downside
>>>     is very long boot time.
>>>
>>>     Accept can be parallelized to multiple CPUs to keep it manageable
>>>     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>>>     memory bandwidth and does not scale beyond the point.
>>>
>>>  2. Accept a block of memory on the first use. It requires more
>>>     infrastructure and changes in page allocator to make it work, but
>>>     it provides good boot time.
>>>
>>>     On-demand memory accept means latency spikes every time kernel steps
>>>     onto a new memory block. The spikes will go away once workload data
>>>     set size gets stabilized or all memory gets accepted.
>>>
>>>  3. Accept all memory in background. Introduce a thread (or multiple)
>>>     that gets memory accepted proactively. It will minimize time the
>>>     system experience latency spikes on memory allocation while keeping
>>>     low boot time.
>>>
>>>     This approach cannot function on its own. It is an extension of #2:
>>>     background memory acceptance requires functional scheduler, but the
>>>     page allocator may need to tap into unaccepted memory before that.
>>>
>>>     The downside of the approach is that these threads also steal CPU
>>>     cycles and memory bandwidth from the user's workload and may hurt
>>>     user experience.
>>>
>>> Implement #2 for now. It is a reasonable default. Some workloads may
>>> want to use #1 or #3 and they can be implemented later based on user's
>>> demands.
>>>
>>> Support of unaccepted memory requires a few changes in core-mm code:
>>>
>>>   - memblock has to accept memory on allocation;
>>>
>>>   - page allocator has to accept memory on the first allocation of the
>>>     page;
>>>
>>> Memblock change is trivial.
>>>
>>> The page allocator is modified to accept pages on the first allocation.
>>> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
>>> used to indicate that the page requires acceptance.
>>>
>>> Architecture has to provide two helpers if it wants to support
>>> unaccepted memory:
>>>
>>>  - accept_memory() makes a range of physical addresses accepted.
>>>
>>>  - range_contains_unaccepted_memory() checks anything within the range
>>>    of physical addresses requires acceptance.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
>>> Reviewed-by: David Hildenbrand <david@redhat.com>
>> 
>> Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed
>> before, but it's really not great how this affects the core page allocator
>> paths. Wouldn't it be possible to only release pages to page allocator when
>> accepted, and otherwise use some new per-zone variables together with the
>> bitmap to track how much exactly is where to accept? Then it could be hooked
>> in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT -
>> if we fail zone_watermark_fast() and there are unaccepted pages in the zone,
>> accept them and continue. With a static key to flip in case we eventually
>> accept everything. Because this is really similar scenario to the deferred
>> init and that one was solved in a way that adds minimal overhead.
> 
> I kind of like just having the memory stats being correct (e.g., free
> memory) and acceptance being an internal detail to be triggered when
> allocating pages -- just like the arch_alloc_page() callback.

Hm, good point about the stats. Could be tweaked perhaps so it appears
correct on the outside, but might be tricky.

> I'm sure we could optimize for the !unaccepted memory via static keys
> also in this version with some checks at the right places if we find
> this to hurt performance?

It would be great if we would at least somehow hit the necessary code only
when dealing with a >=pageblock size block. The bitmap approach and
accepting everything smaller uprofront actually seems rather compatible. Yet
in the current patch we e.g. check PageUnaccepted(buddy) on every buddy size
while merging.

A list that sits besides the existing free_area, contains only >=pageblock
order sizes of unaccepted pages (no migratetype distinguished) and we tap
into it approximately before __rmqueue_fallback()? There would be some
trickery around releasing zone-lock for doing accept_memory(), but should be
manageable.
David Hildenbrand Aug. 5, 2022, 2:22 p.m. UTC | #16
On 05.08.22 15:38, Vlastimil Babka wrote:
> On 8/5/22 14:09, David Hildenbrand wrote:
>> On 05.08.22 13:49, Vlastimil Babka wrote:
>>> On 6/14/22 14:02, Kirill A. Shutemov wrote:
>>>> UEFI Specification version 2.9 introduces the concept of memory
>>>> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
>>>> SEV-SNP, require memory to be accepted before it can be used by the
>>>> guest. Accepting happens via a protocol specific to the Virtual Machine
>>>> platform.
>>>>
>>>> There are several ways kernel can deal with unaccepted memory:
>>>>
>>>>  1. Accept all the memory during the boot. It is easy to implement and
>>>>     it doesn't have runtime cost once the system is booted. The downside
>>>>     is very long boot time.
>>>>
>>>>     Accept can be parallelized to multiple CPUs to keep it manageable
>>>>     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>>>>     memory bandwidth and does not scale beyond the point.
>>>>
>>>>  2. Accept a block of memory on the first use. It requires more
>>>>     infrastructure and changes in page allocator to make it work, but
>>>>     it provides good boot time.
>>>>
>>>>     On-demand memory accept means latency spikes every time kernel steps
>>>>     onto a new memory block. The spikes will go away once workload data
>>>>     set size gets stabilized or all memory gets accepted.
>>>>
>>>>  3. Accept all memory in background. Introduce a thread (or multiple)
>>>>     that gets memory accepted proactively. It will minimize time the
>>>>     system experience latency spikes on memory allocation while keeping
>>>>     low boot time.
>>>>
>>>>     This approach cannot function on its own. It is an extension of #2:
>>>>     background memory acceptance requires functional scheduler, but the
>>>>     page allocator may need to tap into unaccepted memory before that.
>>>>
>>>>     The downside of the approach is that these threads also steal CPU
>>>>     cycles and memory bandwidth from the user's workload and may hurt
>>>>     user experience.
>>>>
>>>> Implement #2 for now. It is a reasonable default. Some workloads may
>>>> want to use #1 or #3 and they can be implemented later based on user's
>>>> demands.
>>>>
>>>> Support of unaccepted memory requires a few changes in core-mm code:
>>>>
>>>>   - memblock has to accept memory on allocation;
>>>>
>>>>   - page allocator has to accept memory on the first allocation of the
>>>>     page;
>>>>
>>>> Memblock change is trivial.
>>>>
>>>> The page allocator is modified to accept pages on the first allocation.
>>>> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
>>>> used to indicate that the page requires acceptance.
>>>>
>>>> Architecture has to provide two helpers if it wants to support
>>>> unaccepted memory:
>>>>
>>>>  - accept_memory() makes a range of physical addresses accepted.
>>>>
>>>>  - range_contains_unaccepted_memory() checks anything within the range
>>>>    of physical addresses requires acceptance.
>>>>
>>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
>>>> Reviewed-by: David Hildenbrand <david@redhat.com>
>>>
>>> Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed
>>> before, but it's really not great how this affects the core page allocator
>>> paths. Wouldn't it be possible to only release pages to page allocator when
>>> accepted, and otherwise use some new per-zone variables together with the
>>> bitmap to track how much exactly is where to accept? Then it could be hooked
>>> in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT -
>>> if we fail zone_watermark_fast() and there are unaccepted pages in the zone,
>>> accept them and continue. With a static key to flip in case we eventually
>>> accept everything. Because this is really similar scenario to the deferred
>>> init and that one was solved in a way that adds minimal overhead.
>>
>> I kind of like just having the memory stats being correct (e.g., free
>> memory) and acceptance being an internal detail to be triggered when
>> allocating pages -- just like the arch_alloc_page() callback.
> 
> Hm, good point about the stats. Could be tweaked perhaps so it appears
> correct on the outside, but might be tricky.
> 
>> I'm sure we could optimize for the !unaccepted memory via static keys
>> also in this version with some checks at the right places if we find
>> this to hurt performance?
> 
> It would be great if we would at least somehow hit the necessary code only
> when dealing with a >=pageblock size block. The bitmap approach and
> accepting everything smaller uprofront actually seems rather compatible. Yet
> in the current patch we e.g. check PageUnaccepted(buddy) on every buddy size
> while merging.
> 
> A list that sits besides the existing free_area, contains only >=pageblock
> order sizes of unaccepted pages (no migratetype distinguished) and we tap
> into it approximately before __rmqueue_fallback()? There would be some
> trickery around releasing zone-lock for doing accept_memory(), but should be
> manageable.
> 

Just curious, do we have a microbenchmark that is able to reveal the
impact of such code changes before we start worrying?
Dave Hansen Aug. 5, 2022, 2:41 p.m. UTC | #17
On 8/5/22 06:38, Vlastimil Babka wrote:
>> I'm sure we could optimize for the !unaccepted memory via static keys
>> also in this version with some checks at the right places if we find
>> this to hurt performance?
> It would be great if we would at least somehow hit the necessary code only
> when dealing with a >=pageblock size block. The bitmap approach and
> accepting everything smaller uprofront actually seems rather compatible. Yet
> in the current patch we e.g. check PageUnaccepted(buddy) on every buddy size
> while merging.

Needing to check PageUnaccepted() during the merge is fallout from
moving the acceptance to post_alloc_hook().  I _think_ an earlier
version of this did page acceptance under the zone lock, closer to where
the page comes off the 2M/4M lists.

But, page acceptance is horribly slow, so I asked Kirill to move it out
from under the zone lock.  Doing it in post_alloc_hook() (after the zone
lock is dropped) makes a lot of sense since we do zeroing in there and
zeroing is also nice and slow.

But, post_alloc_hook() is long after the 2M page has been split and that
means that we have to deal with potentially unaccepted pages during merges.

I think there are three basic options:

1. This patch: Do acceptance after the zone lock is dropped and deal
   with mixed-acceptance merges
2. Do acceptance under the zone lock as pages come off the 2M/4M lists,
   but before the page is split.
3. Pull the page off the 2M/4M lists, drop the zone lock, accept it,
   then put it back.

I'm not sure any of those other options are better.
Dave Hansen Aug. 5, 2022, 2:53 p.m. UTC | #18
On 8/5/22 07:22, David Hildenbrand wrote:
>> A list that sits besides the existing free_area, contains only >=pageblock
>> order sizes of unaccepted pages (no migratetype distinguished) and we tap
>> into it approximately before __rmqueue_fallback()? There would be some
>> trickery around releasing zone-lock for doing accept_memory(), but should be
>> manageable.
>>
> Just curious, do we have a microbenchmark that is able to reveal the
> impact of such code changes before we start worrying?

Nope.  I went looking to see if I could find any impact.  I think Kirill
did too.  Too bad that effort didn't make it into the changelog yet.

The merging check at least is just checking a field in a cache-hot
'struct page'.  The common case is probably three instructions:

	load to a register
	check the bit
	jump if not set

It adds a wee bit of icache pressure, but it's also the kind of thing
that should be a piece of cake for the branch predictors.

That dynamic check could easily be wrapped by a static branch.  But,
that first requires more code to go dig in the nooks and crannies of the
page allocator to make sure *ALL* pages are accepted.
Vlastimil Babka Aug. 5, 2022, 6:17 p.m. UTC | #19
On 8/5/22 16:41, Dave Hansen wrote:
> On 8/5/22 06:38, Vlastimil Babka wrote:
>>> I'm sure we could optimize for the !unaccepted memory via static keys
>>> also in this version with some checks at the right places if we find
>>> this to hurt performance?
>> It would be great if we would at least somehow hit the necessary code only
>> when dealing with a >=pageblock size block. The bitmap approach and
>> accepting everything smaller uprofront actually seems rather compatible. Yet
>> in the current patch we e.g. check PageUnaccepted(buddy) on every buddy size
>> while merging.
> 
> Needing to check PageUnaccepted() during the merge is fallout from
> moving the acceptance to post_alloc_hook().  I _think_ an earlier
> version of this did page acceptance under the zone lock, closer to where
> the page comes off the 2M/4M lists.
> 
> But, page acceptance is horribly slow, so I asked Kirill to move it out
> from under the zone lock.  Doing it in post_alloc_hook() (after the zone
> lock is dropped) makes a lot of sense since we do zeroing in there and
> zeroing is also nice and slow.
> 
> But, post_alloc_hook() is long after the 2M page has been split and that
> means that we have to deal with potentially unaccepted pages during merges.
> 
> I think there are three basic options:
> 
> 1. This patch: Do acceptance after the zone lock is dropped and deal
>    with mixed-acceptance merges
> 2. Do acceptance under the zone lock as pages come off the 2M/4M lists,
>    but before the page is split.

Rather not, as acceptance can be slow and we shouldn't hog the zone lock
while doing it.

> 3. Pull the page off the 2M/4M lists, drop the zone lock, accept it,
>    then put it back.

Worth trying, IMHO. Perhaps easier to manage if the lists are distinct from
the normal ones, as I suggested.

> I'm not sure any of those other options are better.
Mel Gorman Aug. 10, 2022, 2:19 p.m. UTC | #20
On Fri, Aug 05, 2022 at 01:49:41PM +0200, Vlastimil Babka wrote:
> On 6/14/22 14:02, Kirill A. Shutemov wrote:
> > UEFI Specification version 2.9 introduces the concept of memory
> > acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> > SEV-SNP, require memory to be accepted before it can be used by the
> > guest. Accepting happens via a protocol specific to the Virtual Machine
> > platform.
> > 
> > There are several ways kernel can deal with unaccepted memory:
> > 
> >  1. Accept all the memory during the boot. It is easy to implement and
> >     it doesn't have runtime cost once the system is booted. The downside
> >     is very long boot time.
> > 
> >     Accept can be parallelized to multiple CPUs to keep it manageable
> >     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
> >     memory bandwidth and does not scale beyond the point.
> > 
> >  2. Accept a block of memory on the first use. It requires more
> >     infrastructure and changes in page allocator to make it work, but
> >     it provides good boot time.
> > 
> >     On-demand memory accept means latency spikes every time kernel steps
> >     onto a new memory block. The spikes will go away once workload data
> >     set size gets stabilized or all memory gets accepted.
> > 
> >  3. Accept all memory in background. Introduce a thread (or multiple)
> >     that gets memory accepted proactively. It will minimize time the
> >     system experience latency spikes on memory allocation while keeping
> >     low boot time.
> > 
> >     This approach cannot function on its own. It is an extension of #2:
> >     background memory acceptance requires functional scheduler, but the
> >     page allocator may need to tap into unaccepted memory before that.
> > 
> >     The downside of the approach is that these threads also steal CPU
> >     cycles and memory bandwidth from the user's workload and may hurt
> >     user experience.
> > 
> > Implement #2 for now. It is a reasonable default. Some workloads may
> > want to use #1 or #3 and they can be implemented later based on user's
> > demands.
> > 
> > Support of unaccepted memory requires a few changes in core-mm code:
> > 
> >   - memblock has to accept memory on allocation;
> > 
> >   - page allocator has to accept memory on the first allocation of the
> >     page;
> > 
> > Memblock change is trivial.
> > 
> > The page allocator is modified to accept pages on the first allocation.
> > The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is
> > used to indicate that the page requires acceptance.
> > 
> > Architecture has to provide two helpers if it wants to support
> > unaccepted memory:
> > 
> >  - accept_memory() makes a range of physical addresses accepted.
> > 
> >  - range_contains_unaccepted_memory() checks anything within the range
> >    of physical addresses requires acceptance.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
> > Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed
> before, but it's really not great how this affects the core page allocator
> paths. Wouldn't it be possible to only release pages to page allocator when
> accepted, and otherwise use some new per-zone variables together with the
> bitmap to track how much exactly is where to accept? Then it could be hooked
> in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT -
> if we fail zone_watermark_fast() and there are unaccepted pages in the zone,
> accept them and continue. With a static key to flip in case we eventually
> accept everything. Because this is really similar scenario to the deferred
> init and that one was solved in a way that adds minimal overhead.
> 

I think it might be more straight-forward to always accept pages in the
size of the pageblock. Smaller ranges should matter because they have been
accepted in deferred_free_range. In expand, if PageUnaccepted is set on
a pageblock-sized page, take it off the list, drop the zone->lock leaving
IRQs disabled, accept the memory and reacquire the lock to split the page
into the required order.

IRQs being left disabled is unfortunate but even if the acceptance is slow,
it's presumably not so slow to cause major problems. This would would reduce
and probably eliminate the need to do the assert check in accept_page. It
might also simplify __free_one_page if it's known that a pageblock range
of pages are either all accepted or unaccepted.

Lastly, the default behaviour should probably be "accept all memory at
boot" and use Kconfig to allow acceptable be deferred or overridden by
command line. There are at least two reasons for this. Even though this
is a virtual machine, there still may be latency sensitive applications
running early in boot using pinned vcpu->pcpu and no memory overcommit.
The unpredictable performance of the application early in boot may be
unacceptable and unavoidable. It might take a long time but it could
eventually generate bug reports about "unpredictable performance early
in boot" that will be hard to track down unless accept_memory is observed
using perf at the right time. Even when that does happen, there will need
to be an option to turn it off if the unpredictable performance cannot
be tolerated. Second, any benchmarking done early in boot is likely to
be disrupted making the series a potential bisection magnet that masks a
performance bug elsewhere in the merge window.
Dionna Amalie Glaze Aug. 15, 2022, 9:08 p.m. UTC | #21
>
>
> The unpredictable performance of the application early in boot may be
> unacceptable and unavoidable. It might take a long time but it could
> eventually generate bug reports about "unpredictable performance early
> in boot" that will be hard to track down unless accept_memory is observed
> using perf at the right time. Even when that does happen, there will need
> to be an option to turn it off if the unpredictable performance cannot
> be tolerated. Second, any benchmarking done early in boot is likely to
> be disrupted making the series a potential bisection magnet that masks a
> performance bug elsewhere in the merge window.

I'm doing some boot performance tests now before I run some workload
memory acceptance latency tests.
Note that this testing is on AMD SEV-SNP, so this patch series on top
of the AMD guest patches v12, plus a
patch Brijesh Singh wrote to define __accept_memory for SEV-SNP
https://github.com/AMDESE/linux/commit/ecae2582666d50ce1e633975d703d2f904183ece

I was getting pretty consistent boot times, only going up slightly as
the memory size increased, but at 256GB, the VM crashes because it
touches some unaccepted memory without first accepting it. 255GB boots
fine.

The stack track is in mm/page_alloc.c. I've done a little
investigation, but I can't account for why there's a hard cutoff of
correctness at 256GB

[    0.065563] RIP: 0010:memmap_init_range+0x108/0x173
[    0.066309] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c
48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d d9 a2 66 ff 48
8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff
ff ff
[    0.069108] RSP: 0000:ffffffffad603dc8 EFLAGS: 00010082 ORIG_RAX:
0000000000000404
[    0.070193] RAX: ffffdba740000048 RBX: 0000000000000001 RCX: 0000000000000000
[    0.071170] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffdba740000040
[    0.072224] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000
[    0.073283] R10: 0000000000000001 R11: ffffffffad645c60 R12: 0000000000000000
[    0.074304] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000
[    0.075285] FS:  0000000000000000(0000) GS:ffffffffadd6c000(0000)
knlGS:0000000000000000
[    0.076365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.077194] CR2: ffffdba740000074 CR3: 0008001ee3a0c000 CR4: 00000000000606b0
[    0.078209] Call Trace:
[    0.078524]  <TASK>
[    0.078887]  ? free_area_init+0x5c1/0x66c
[    0.079417]  ? zone_sizes_init+0x52/0x6c
[    0.079934]  ? setup_arch+0xa55/0xb6d
[    0.080417]  ? start_kernel+0x64/0x65a
[    0.080897]  ? secondary_startup_64_no_verify+0xd6/0xdb
[    0.081620]  </TASK>

>
> --
> Mel Gorman
> SUSE Labs
Tom Lendacky Aug. 15, 2022, 10:02 p.m. UTC | #22
On 8/15/22 16:08, Dionna Amalie Glaze wrote:
>>
>>
>> The unpredictable performance of the application early in boot may be
>> unacceptable and unavoidable. It might take a long time but it could
>> eventually generate bug reports about "unpredictable performance early
>> in boot" that will be hard to track down unless accept_memory is observed
>> using perf at the right time. Even when that does happen, there will need
>> to be an option to turn it off if the unpredictable performance cannot
>> be tolerated. Second, any benchmarking done early in boot is likely to
>> be disrupted making the series a potential bisection magnet that masks a
>> performance bug elsewhere in the merge window.
> 
> I'm doing some boot performance tests now before I run some workload
> memory acceptance latency tests.
> Note that this testing is on AMD SEV-SNP, so this patch series on top
> of the AMD guest patches v12, plus a
> patch Brijesh Singh wrote to define __accept_memory for SEV-SNP
> https://github.com/AMDESE/linux/commit/ecae2582666d50ce1e633975d703d2f904183ece

Note that there is a bug in Brijesh's version of the patch and it will 
almost exclusively use the MSR protocol. Please try the version of the 
patch that I recently sent up based on the current unaccepted memory tree 
from Kirill.

https://lore.kernel.org/lkml/cover.1660579062.git.thomas.lendacky@amd.com/

Thanks,
Tom

> 
> I was getting pretty consistent boot times, only going up slightly as
> the memory size increased, but at 256GB, the VM crashes because it
> touches some unaccepted memory without first accepting it. 255GB boots
> fine.
> 
> The stack track is in mm/page_alloc.c. I've done a little
> investigation, but I can't account for why there's a hard cutoff of
> correctness at 256GB
> 
> [    0.065563] RIP: 0010:memmap_init_range+0x108/0x173
> [    0.066309] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c
> 48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d d9 a2 66 ff 48
> 8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff
> ff ff
> [    0.069108] RSP: 0000:ffffffffad603dc8 EFLAGS: 00010082 ORIG_RAX:
> 0000000000000404
> [    0.070193] RAX: ffffdba740000048 RBX: 0000000000000001 RCX: 0000000000000000
> [    0.071170] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffdba740000040
> [    0.072224] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000
> [    0.073283] R10: 0000000000000001 R11: ffffffffad645c60 R12: 0000000000000000
> [    0.074304] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000
> [    0.075285] FS:  0000000000000000(0000) GS:ffffffffadd6c000(0000)
> knlGS:0000000000000000
> [    0.076365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.077194] CR2: ffffdba740000074 CR3: 0008001ee3a0c000 CR4: 00000000000606b0
> [    0.078209] Call Trace:
> [    0.078524]  <TASK>
> [    0.078887]  ? free_area_init+0x5c1/0x66c
> [    0.079417]  ? zone_sizes_init+0x52/0x6c
> [    0.079934]  ? setup_arch+0xa55/0xb6d
> [    0.080417]  ? start_kernel+0x64/0x65a
> [    0.080897]  ? secondary_startup_64_no_verify+0xd6/0xdb
> [    0.081620]  </TASK>
> 
>>
>> --
>> Mel Gorman
>> SUSE Labs
> 
> 
>
Dionna Amalie Glaze Aug. 29, 2022, 4:02 p.m. UTC | #23
> > The stack track is in mm/page_alloc.c. I've done a little
> > investigation, but I can't account for why there's a hard cutoff of
> > correctness at 256GB
> >
> > [    0.065563] RIP: 0010:memmap_init_range+0x108/0x173
> > [    0.066309] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c
> > 48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d d9 a2 66 ff 48
> > 8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff
> > ff ff
> > [    0.069108] RSP: 0000:ffffffffad603dc8 EFLAGS: 00010082 ORIG_RAX:
> > 0000000000000404
> > [    0.070193] RAX: ffffdba740000048 RBX: 0000000000000001 RCX: 0000000000000000
> > [    0.071170] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffdba740000040
> > [    0.072224] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000
> > [    0.073283] R10: 0000000000000001 R11: ffffffffad645c60 R12: 0000000000000000
> > [    0.074304] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000
> > [    0.075285] FS:  0000000000000000(0000) GS:ffffffffadd6c000(0000)
> > knlGS:0000000000000000
> > [    0.076365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    0.077194] CR2: ffffdba740000074 CR3: 0008001ee3a0c000 CR4: 00000000000606b0
> > [    0.078209] Call Trace:
> > [    0.078524]  <TASK>
> > [    0.078887]  ? free_area_init+0x5c1/0x66c
> > [    0.079417]  ? zone_sizes_init+0x52/0x6c
> > [    0.079934]  ? setup_arch+0xa55/0xb6d
> > [    0.080417]  ? start_kernel+0x64/0x65a
> > [    0.080897]  ? secondary_startup_64_no_verify+0xd6/0xdb
> > [    0.081620]  </TASK>
>
> Note that there is a bug in Brijesh's version of the patch and it will
> almost exclusively use the MSR protocol. Please try the version of the
> patch that I recently sent up based on the current unaccepted memory tree
> from Kirill.
>

I've now tested this patch set with Tom's new patch set, and it
appears to be that the problem with 256GB is more likely to be due to
this unaccepted memory patch set rather than something AMD-specific.

Kirill, do you see any problems with 256GB on TDX? It seems there is
some unaccepted memory getting touched in memmap_init_range when the
VM's memory size is at least 256GB
Dave Hansen Aug. 29, 2022, 4:19 p.m. UTC | #24
On 8/29/22 09:02, Dionna Amalie Glaze wrote:
>>> The stack track is in mm/page_alloc.c. I've done a little
>>> investigation, but I can't account for why there's a hard cutoff of
>>> correctness at 256GB
>>>
>>> [    0.065563] RIP: 0010:memmap_init_range+0x108/0x173
>>> [    0.066309] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c
>>> 48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d d9 a2 66 ff 48
>>> 8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff
>>> ff ff
>>> [    0.069108] RSP: 0000:ffffffffad603dc8 EFLAGS: 00010082 ORIG_RAX:
>>> 0000000000000404
>>> [    0.070193] RAX: ffffdba740000048 RBX: 0000000000000001 RCX: 0000000000000000
>>> [    0.071170] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffdba740000040
>>> [    0.072224] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000
>>> [    0.073283] R10: 0000000000000001 R11: ffffffffad645c60 R12: 0000000000000000
>>> [    0.074304] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000
>>> [    0.075285] FS:  0000000000000000(0000) GS:ffffffffadd6c000(0000)
>>> knlGS:0000000000000000
>>> [    0.076365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [    0.077194] CR2: ffffdba740000074 CR3: 0008001ee3a0c000 CR4: 00000000000606b0
>>> [    0.078209] Call Trace:
>>> [    0.078524]  <TASK>
>>> [    0.078887]  ? free_area_init+0x5c1/0x66c
>>> [    0.079417]  ? zone_sizes_init+0x52/0x6c
>>> [    0.079934]  ? setup_arch+0xa55/0xb6d
>>> [    0.080417]  ? start_kernel+0x64/0x65a
>>> [    0.080897]  ? secondary_startup_64_no_verify+0xd6/0xdb
>>> [    0.081620]  </TASK>
>> Note that there is a bug in Brijesh's version of the patch and it will
>> almost exclusively use the MSR protocol. Please try the version of the
>> patch that I recently sent up based on the current unaccepted memory tree
>> from Kirill.
>>
> I've now tested this patch set with Tom's new patch set, and it
> appears to be that the problem with 256GB is more likely to be due to
> this unaccepted memory patch set rather than something AMD-specific.
> 
> Kirill, do you see any problems with 256GB on TDX? It seems there is
> some unaccepted memory getting touched in memmap_init_range when the
> VM's memory size is at least 256GB

It really helps this kind of stuff if you can post the *actual* error.
I assume this was a page fault, so there should have been some other
stuff before the RIP:...

Another thing that's really nice is to do the disassembly of the "Code:"
or share disassembly of memmap_init_range.  Even nicer would be to give
an faddr2line of the RIP value and track down which C code was actually
at fault.

It's *possible* to look into these things from what you posted, but it's
just slow.  I'm sure Kirill will appreciate the help.
Dionna Amalie Glaze Sept. 6, 2022, 5:50 p.m. UTC | #25
>
> It really helps this kind of stuff if you can post the *actual* error.
> I assume this was a page fault, so there should have been some other
> stuff before the RIP:...
>

I posted the error on August 15th.  I was bumping in my last post
since I confirmed with Tom Lendacky that it wasn't AMD's patches at
fault.
Here's a new dump below that matches the disassembly:

[    0.043137] Faking a node at [mem 0x0000000000000000-0x000000403fffffff]
[    0.044018] NODE_DATA(0) allocated [mem 0x403fffc000-0x403fffffff]
[    0.044922] Zone ranges:
[    0.045250]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.046039]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.046828]   Normal   [mem 0x0000000100000000-0x000000403fffffff]
[    0.047657] Movable zone start for each node
[    0.048201] Early memory node ranges
[    0.048674]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
[    0.049474]   node   0: [mem 0x0000000000100000-0x000000000080cfff
[    0.050274]   node   0: [mem 0x000000000080f000-0x00000000beceefff]
[    0.051074]   node   0: [mem 0x00000000befff000-0x00000000bfbb0fff]
[    0.051874]   node   0: [mem 0x00000000bfbb2000-0x00000000bffdbfff]
[    0.052674]   node   0: [mem 0x0000000100000000-0x000000403fffffff]
[    0.053530] Initmem setup node 0 [mem 0x0000000000001000-0x000000403fffffff]
PANIC: Unsupported exit-code 0x404 in early #VC exception (IP:
0xfffffffface0cdd0)
[    0.056667] CPU: 0 PID: 0 Comm: swapper Not tainted
5.17.0-rc6-173762-gffb12b02c6d7-dirty #1
[    0.057744] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[    0.058920] RIP: 0010:memmap_init_range+0x11d/0x188
[    0.059686] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c
48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d a4 1e 65 ff 48
8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff
ff ff
[    0.062121] RSP: 0000:ffffffffac603dc0 EFLAGS: 00010082 ORIG_RAX:
0000000000000404
[    0.063087] RAX: ffffda1ac0000048 RBX: 0000000000000001 RCX: 0000000000000000
[    0.063998] RDX: 0300000000000000 RSI: 0000000000000000 RDI: ffffda1ac000004
[    0.064944] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000
[    0.065873] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[    0.066782] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000
[    0.067695] FS:  0000000000000000(0000) GS:ffffffffacd88000(0000)
knlGS:0000000000000000
[    0.068727] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.069488] CR2: ffffda1ac0000074 CR3: 00080020b680c000 CR4: 00000000000606b0
[    0.070397] Call Trace:
[    0.070710]  <TASK>
[    0.070976]  ? free_area_init+0x724/0x7d4
[    0.071486]  ? zone_sizes_init+0x52/0x6c
[    0.071986]  ? setup_arch+0xa55/0xb77
[    0.072453]  ? start_kernel+0x64/0x65f
[    0.072931]  ? secondary_startup_64_no_verify+0xd6/0xdb
[    0.073598]  </TASK>

Note this is a crash in SEV-SNP, but I assume we'd get the same #VE in TDX.

> Another thing that's really nice is to do the disassembly of the "Code:"
> or share disassembly of memmap_init_range.

0000000000000172 <memmap_init_range>:
 172:   41 56                   push   %r14
 174:   89 f0                   mov    %esi,%eax
 176:   45 89 ce                mov    %r9d,%r14d
 179:   41 55                   push   %r13
 17b:   4c 8d 2c 39             lea    (%rcx,%rdi,1),%r13
 17f:   41 54                   push   %r12
 181:   49 89 d4                mov    %rdx,%r12
 184:   49 8d 55 ff             lea    -0x1(%r13),%rdx
 188:   48 3b 15 00 00 00 00    cmp    0x0(%rip),%rdx        # 18f
<memmap_init_range+0x1d>
 18f:   55                      push   %rbp
 190:   53                      push   %rbx
 191:   48 89 cb                mov    %rcx,%rbx
 194:   76 07                   jbe    19d <memmap_init_range+0x2b>
 196:   48 89 15 00 00 00 00    mov    %rdx,0x0(%rip)        # 19d
<memmap_init_range+0x2b>
 19d:   4c 89 e5                mov    %r12,%rbp
 1a0:   ba 03 00 00 00          mov    $0x3,%edx
 1a5:   48 c1 e0 3a             shl    $0x3a,%rax
 1a9:   48 c1 e5 38             shl    $0x38,%rbp
 1ad:   48 c1 e2 38             shl    $0x38,%rdx
 1b1:   48 21 d5                and    %rdx,%rbp
 1b4:   48 09 c5                or     %rax,%rbp
 1b7:   49 39 dd                cmp    %rbx,%r13
 1ba:   0f 86 31 01 00 00       jbe    2f1 <memmap_init_range+0x17f>
 1c0:   45 85 f6                test   %r14d,%r14d
 1c3:   0f 85 b4 00 00 00       jne    27d <memmap_init_range+0x10b>
 1c9:   49 83 fc 03             cmp    $0x3,%r12
 1cd:   0f 94 c1                sete   %cl
 1d0:   22 0d 00 00 00 00       and    0x0(%rip),%cl        # 1d6
<memmap_init_range+0x64>
 1d6:   0f 84 a1 00 00 00       je     27d <memmap_init_range+0x10b>
 1dc:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 1e3
<memmap_init_range+0x71>
 1e3:   48 85 d2                test   %rdx,%rdx
 1e6:   74 10                   je     1f8 <memmap_init_range+0x86>
 1e8:   48 8b 42 08             mov    0x8(%rdx),%rax
 1ec:   48 03 02                add    (%rdx),%rax
 1ef:   48 c1 e8 0c             shr    $0xc,%rax
 1f3:   48 39 d8                cmp    %rbx,%rax
 1f6:   77 55                   ja     24d <memmap_init_range+0xdb>
 1f8:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # 1ff
<memmap_init_range+0x8d>
 1ff:   4c 6b 05 00 00 00 00    imul   $0x18,0x0(%rip),%r8        #
207 <memmap_init_range+0x95>
 206:   18
 207:   31 f6                   xor    %esi,%esi
 209:   48 89 05 00 00 00 00    mov    %rax,0x0(%rip)        # 210
<memmap_init_range+0x9e>
 210:   49 01 c0                add    %rax,%r8
 213:   48 89 c7                mov    %rax,%rdi
 216:   4c 39 c0                cmp    %r8,%rax
 219:   73 26                   jae    241 <memmap_init_range+0xcf>
 21b:   48 8b 57 08             mov    0x8(%rdi),%rdx
 21f:   48 03 17                add    (%rdi),%rdx
 222:   48 83 c0 18             add    $0x18,%rax
 226:   48 c1 ea 0c             shr    $0xc,%rdx
 22a:   48 39 da                cmp    %rbx,%rdx
 22d:   76 0e                   jbe    23d <memmap_init_range+0xcb>
 22f:   40 84 f6                test   %sil,%sil
 232:   74 19                   je     24d <memmap_init_range+0xdb>
 234:   48 89 3d 00 00 00 00    mov    %rdi,0x0(%rip)        # 23b
<memmap_init_range+0xc9>
 23b:   eb 10                   jmp    24d <memmap_init_range+0xdb>
 23d:   89 ce                   mov    %ecx,%esi
 23f:   eb d2                   jmp    213 <memmap_init_range+0xa1>
 241:   40 84 f6                test   %sil,%sil
 244:   74 07                   je     24d <memmap_init_range+0xdb>
 246:   48 89 05 00 00 00 00    mov    %rax,0x0(%rip)        # 24d
<memmap_init_range+0xdb>
 24d:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 254
<memmap_init_range+0xe2>
 254:   48 8b 02                mov    (%rdx),%rax
 257:   48 8d 88 ff 0f 00 00    lea    0xfff(%rax),%rcx
 25e:   48 c1 e9 0c             shr    $0xc,%rcx
 262:   48 39 d9                cmp    %rbx,%rcx
 265:   77 16                   ja     27d <memmap_init_range+0x10b>
 267:   f6 42 10 02             testb  $0x2,0x10(%rdx)
 26b:   74 10                   je     27d <memmap_init_range+0x10b>
 26d:   48 03 42 08             add    0x8(%rdx),%rax
 271:   48 c1 e8 0c             shr    $0xc,%rax
 275:   48 89 c3                mov    %rax,%rbx
 278:   e9 3a ff ff ff          jmp    1b7 <memmap_init_range+0x45>
 27d:   48 89 df                mov    %rbx,%rdi
 280:   48 c1 e7 06             shl    $0x6,%rdi
 284:   48 03 3d 00 00 00 00    add    0x0(%rip),%rdi        # 28b
<memmap_init_range+0x119>
 28b:   48 8d 47 08             lea    0x8(%rdi),%rax
 28f:   c7 47 34 01 00 00 00    movl   $0x1,0x34(%rdi)    # Here's
where the crash RIP is.
 296:   48 c7 47 38 00 00 00    movq   $0x0,0x38(%rdi)
 29d:   00
 29e:   c7 47 30 ff ff ff ff    movl   $0xffffffff,0x30(%rdi)
 2a5:   48 c7 47 28 00 00 00    movq   $0x0,0x28(%rdi)
 2ac:   00
 2ad:   48 c7 47 20 00 00 00    movq   $0x0,0x20(%rdi)
 2b4:   00
 2b5:   48 c7 47 18 00 00 00    movq   $0x0,0x18(%rdi)
 2bc:   00
 2bd:   48 89 2f                mov    %rbp,(%rdi)
 2c0:   48 89 47 08             mov    %rax,0x8(%rdi)
 2c4:   48 89 47 10             mov    %rax,0x10(%rdi)
 2c8:   41 83 fe 01             cmp    $0x1,%r14d
 2cc:   75 05                   jne    2d3 <memmap_init_range+0x161>
 2ce:   48 0f ba 2f 0c          btsq   $0xc,(%rdi)
 2d3:   f7 c3 ff 01 00 00       test   $0x1ff,%ebx
 2d9:   75 0e                   jne    2e9 <memmap_init_range+0x177>
 2db:   8b 74 24 38             mov    0x38(%rsp),%esi
 2df:   e8 00 00 00 00          call   2e4 <memmap_init_range+0x172>
 2e4:   e8 00 00 00 00          call   2e9 <memmap_init_range+0x177>
 2e9:   48 ff c3                inc    %rbx
 2ec:   e9 c6 fe ff ff          jmp    1b7 <memmap_init_range+0x45>
 2f1:   5b                      pop    %rbx
 2f2:   5d                      pop    %rbp
 2f3:   41 5c                   pop    %r12
 2f5:   41 5d                   pop    %r13
 2f7:   41 5e                   pop    %r14
 2f9:   c3                      ret

> Even nicer would be to give
> an faddr2line of the RIP value and track down which C code was actually
> at fault.

arch_atomic_set
arch/x86/include/asm/atomic.h:41

of INIT_LIST_HEAD in __init_single_page, called from memmap_init_range.
--
-Dionna Glaze, PhD (she/her)
Mike Rapoport Sept. 8, 2022, 12:11 p.m. UTC | #26
On Tue, Sep 06, 2022 at 10:50:42AM -0700, Dionna Amalie Glaze wrote:
> >
> > It really helps this kind of stuff if you can post the *actual* error.
> > I assume this was a page fault, so there should have been some other
> > stuff before the RIP:...
> >
> 
> I posted the error on August 15th.  I was bumping in my last post
> since I confirmed with Tom Lendacky that it wasn't AMD's patches at
> fault.
> Here's a new dump below that matches the disassembly:
> 
> [    0.043137] Faking a node at [mem 0x0000000000000000-0x000000403fffffff]
> [    0.044018] NODE_DATA(0) allocated [mem 0x403fffc000-0x403fffffff]
> [    0.044922] Zone ranges:
> [    0.045250]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> [    0.046039]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> [    0.046828]   Normal   [mem 0x0000000100000000-0x000000403fffffff]
> [    0.047657] Movable zone start for each node
> [    0.048201] Early memory node ranges
> [    0.048674]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
> [    0.049474]   node   0: [mem 0x0000000000100000-0x000000000080cfff
> [    0.050274]   node   0: [mem 0x000000000080f000-0x00000000beceefff]
> [    0.051074]   node   0: [mem 0x00000000befff000-0x00000000bfbb0fff]
> [    0.051874]   node   0: [mem 0x00000000bfbb2000-0x00000000bffdbfff]
> [    0.052674]   node   0: [mem 0x0000000100000000-0x000000403fffffff]
> [    0.053530] Initmem setup node 0 [mem 0x0000000000001000-0x000000403fffffff]
> PANIC: Unsupported exit-code 0x404 in early #VC exception (IP:
> 0xfffffffface0cdd0)
> [    0.056667] CPU: 0 PID: 0 Comm: swapper Not tainted
> 5.17.0-rc6-173762-gffb12b02c6d7-dirty #1
> [    0.057744] Hardware name: Google Google Compute Engine/Google
> Compute Engine, BIOS Google 01/01/2011
> [    0.058920] RIP: 0010:memmap_init_range+0x11d/0x188
> [    0.059686] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c
> 48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d a4 1e 65 ff 48
> 8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff
> ff ff
> [    0.062121] RSP: 0000:ffffffffac603dc0 EFLAGS: 00010082 ORIG_RAX:
> 0000000000000404
> [    0.063087] RAX: ffffda1ac0000048 RBX: 0000000000000001 RCX: 0000000000000000
> [    0.063998] RDX: 0300000000000000 RSI: 0000000000000000 RDI: ffffda1ac000004
> [    0.064944] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000
> [    0.065873] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> [    0.066782] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000
> [    0.067695] FS:  0000000000000000(0000) GS:ffffffffacd88000(0000)
> knlGS:0000000000000000
> [    0.068727] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.069488] CR2: ffffda1ac0000074 CR3: 00080020b680c000 CR4: 00000000000606b0
> [    0.070397] Call Trace:
> [    0.070710]  <TASK>
> [    0.070976]  ? free_area_init+0x724/0x7d4
> [    0.071486]  ? zone_sizes_init+0x52/0x6c
> [    0.071986]  ? setup_arch+0xa55/0xb77
> [    0.072453]  ? start_kernel+0x64/0x65f
> [    0.072931]  ? secondary_startup_64_no_verify+0xd6/0xdb
> [    0.073598]  </TASK>
> 
> Note this is a crash in SEV-SNP, but I assume we'd get the same #VE in TDX.
> 
> > Another thing that's really nice is to do the disassembly of the "Code:"
> > or share disassembly of memmap_init_range.
> 
> 0000000000000172 <memmap_init_range>:
>  172:   41 56                   push   %r14
>  174:   89 f0                   mov    %esi,%eax
>  176:   45 89 ce                mov    %r9d,%r14d
>  179:   41 55                   push   %r13
>  17b:   4c 8d 2c 39             lea    (%rcx,%rdi,1),%r13
>  17f:   41 54                   push   %r12
>  181:   49 89 d4                mov    %rdx,%r12
>  184:   49 8d 55 ff             lea    -0x1(%r13),%rdx
>  188:   48 3b 15 00 00 00 00    cmp    0x0(%rip),%rdx        # 18f
> <memmap_init_range+0x1d>
>  18f:   55                      push   %rbp
>  190:   53                      push   %rbx
>  191:   48 89 cb                mov    %rcx,%rbx
>  194:   76 07                   jbe    19d <memmap_init_range+0x2b>
>  196:   48 89 15 00 00 00 00    mov    %rdx,0x0(%rip)        # 19d
> <memmap_init_range+0x2b>
>  19d:   4c 89 e5                mov    %r12,%rbp
>  1a0:   ba 03 00 00 00          mov    $0x3,%edx
>  1a5:   48 c1 e0 3a             shl    $0x3a,%rax
>  1a9:   48 c1 e5 38             shl    $0x38,%rbp
>  1ad:   48 c1 e2 38             shl    $0x38,%rdx
>  1b1:   48 21 d5                and    %rdx,%rbp
>  1b4:   48 09 c5                or     %rax,%rbp
>  1b7:   49 39 dd                cmp    %rbx,%r13
>  1ba:   0f 86 31 01 00 00       jbe    2f1 <memmap_init_range+0x17f>
>  1c0:   45 85 f6                test   %r14d,%r14d
>  1c3:   0f 85 b4 00 00 00       jne    27d <memmap_init_range+0x10b>
>  1c9:   49 83 fc 03             cmp    $0x3,%r12
>  1cd:   0f 94 c1                sete   %cl
>  1d0:   22 0d 00 00 00 00       and    0x0(%rip),%cl        # 1d6
> <memmap_init_range+0x64>
>  1d6:   0f 84 a1 00 00 00       je     27d <memmap_init_range+0x10b>
>  1dc:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 1e3
> <memmap_init_range+0x71>
>  1e3:   48 85 d2                test   %rdx,%rdx
>  1e6:   74 10                   je     1f8 <memmap_init_range+0x86>
>  1e8:   48 8b 42 08             mov    0x8(%rdx),%rax
>  1ec:   48 03 02                add    (%rdx),%rax
>  1ef:   48 c1 e8 0c             shr    $0xc,%rax
>  1f3:   48 39 d8                cmp    %rbx,%rax
>  1f6:   77 55                   ja     24d <memmap_init_range+0xdb>
>  1f8:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # 1ff
> <memmap_init_range+0x8d>
>  1ff:   4c 6b 05 00 00 00 00    imul   $0x18,0x0(%rip),%r8        #
> 207 <memmap_init_range+0x95>
>  206:   18
>  207:   31 f6                   xor    %esi,%esi
>  209:   48 89 05 00 00 00 00    mov    %rax,0x0(%rip)        # 210
> <memmap_init_range+0x9e>
>  210:   49 01 c0                add    %rax,%r8
>  213:   48 89 c7                mov    %rax,%rdi
>  216:   4c 39 c0                cmp    %r8,%rax
>  219:   73 26                   jae    241 <memmap_init_range+0xcf>
>  21b:   48 8b 57 08             mov    0x8(%rdi),%rdx
>  21f:   48 03 17                add    (%rdi),%rdx
>  222:   48 83 c0 18             add    $0x18,%rax
>  226:   48 c1 ea 0c             shr    $0xc,%rdx
>  22a:   48 39 da                cmp    %rbx,%rdx
>  22d:   76 0e                   jbe    23d <memmap_init_range+0xcb>
>  22f:   40 84 f6                test   %sil,%sil
>  232:   74 19                   je     24d <memmap_init_range+0xdb>
>  234:   48 89 3d 00 00 00 00    mov    %rdi,0x0(%rip)        # 23b
> <memmap_init_range+0xc9>
>  23b:   eb 10                   jmp    24d <memmap_init_range+0xdb>
>  23d:   89 ce                   mov    %ecx,%esi
>  23f:   eb d2                   jmp    213 <memmap_init_range+0xa1>
>  241:   40 84 f6                test   %sil,%sil
>  244:   74 07                   je     24d <memmap_init_range+0xdb>
>  246:   48 89 05 00 00 00 00    mov    %rax,0x0(%rip)        # 24d
> <memmap_init_range+0xdb>
>  24d:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 254
> <memmap_init_range+0xe2>
>  254:   48 8b 02                mov    (%rdx),%rax
>  257:   48 8d 88 ff 0f 00 00    lea    0xfff(%rax),%rcx
>  25e:   48 c1 e9 0c             shr    $0xc,%rcx
>  262:   48 39 d9                cmp    %rbx,%rcx
>  265:   77 16                   ja     27d <memmap_init_range+0x10b>
>  267:   f6 42 10 02             testb  $0x2,0x10(%rdx)
>  26b:   74 10                   je     27d <memmap_init_range+0x10b>
>  26d:   48 03 42 08             add    0x8(%rdx),%rax
>  271:   48 c1 e8 0c             shr    $0xc,%rax
>  275:   48 89 c3                mov    %rax,%rbx
>  278:   e9 3a ff ff ff          jmp    1b7 <memmap_init_range+0x45>
>  27d:   48 89 df                mov    %rbx,%rdi
>  280:   48 c1 e7 06             shl    $0x6,%rdi
>  284:   48 03 3d 00 00 00 00    add    0x0(%rip),%rdi        # 28b
> <memmap_init_range+0x119>
>  28b:   48 8d 47 08             lea    0x8(%rdi),%rax
>  28f:   c7 47 34 01 00 00 00    movl   $0x1,0x34(%rdi)    # Here's
> where the crash RIP is.
>  296:   48 c7 47 38 00 00 00    movq   $0x0,0x38(%rdi)
>  29d:   00
>  29e:   c7 47 30 ff ff ff ff    movl   $0xffffffff,0x30(%rdi)
>  2a5:   48 c7 47 28 00 00 00    movq   $0x0,0x28(%rdi)
>  2ac:   00
>  2ad:   48 c7 47 20 00 00 00    movq   $0x0,0x20(%rdi)
>  2b4:   00
>  2b5:   48 c7 47 18 00 00 00    movq   $0x0,0x18(%rdi)
>  2bc:   00
>  2bd:   48 89 2f                mov    %rbp,(%rdi)
>  2c0:   48 89 47 08             mov    %rax,0x8(%rdi)
>  2c4:   48 89 47 10             mov    %rax,0x10(%rdi)
>  2c8:   41 83 fe 01             cmp    $0x1,%r14d
>  2cc:   75 05                   jne    2d3 <memmap_init_range+0x161>
>  2ce:   48 0f ba 2f 0c          btsq   $0xc,(%rdi)
>  2d3:   f7 c3 ff 01 00 00       test   $0x1ff,%ebx
>  2d9:   75 0e                   jne    2e9 <memmap_init_range+0x177>
>  2db:   8b 74 24 38             mov    0x38(%rsp),%esi
>  2df:   e8 00 00 00 00          call   2e4 <memmap_init_range+0x172>
>  2e4:   e8 00 00 00 00          call   2e9 <memmap_init_range+0x177>
>  2e9:   48 ff c3                inc    %rbx
>  2ec:   e9 c6 fe ff ff          jmp    1b7 <memmap_init_range+0x45>
>  2f1:   5b                      pop    %rbx
>  2f2:   5d                      pop    %rbp
>  2f3:   41 5c                   pop    %r12
>  2f5:   41 5d                   pop    %r13
>  2f7:   41 5e                   pop    %r14
>  2f9:   c3                      ret
> 
> > Even nicer would be to give
> > an faddr2line of the RIP value and track down which C code was actually
> > at fault.
> 
> arch_atomic_set
> arch/x86/include/asm/atomic.h:41
> 
> of INIT_LIST_HEAD in __init_single_page, called from memmap_init_range.

Looks like the first access to the memory map fails, although I think
it's not in INIT_LIST_HEAD() but rather in init_page_count().

I'd start with making sure that page_alloc::memmap_alloc() actually returns
accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
will poisoned in this function, so my guess is it'd crash there.

> --
> -Dionna Glaze, PhD (she/her)
Dionna Amalie Glaze Sept. 8, 2022, 4:23 p.m. UTC | #27
>
> Looks like the first access to the memory map fails, although I think
> it's not in INIT_LIST_HEAD() but rather in init_page_count().
>
> I'd start with making sure that page_alloc::memmap_alloc() actually returns
> accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
> will poisoned in this function, so my guess is it'd crash there.
>

That's a wonderful hint, thank you! I did not run this test
CONFIG_DEBUG_VM set, but you think it's possible it could still be
here?

> --
> Sincerely yours,
> Mike.
Mike Rapoport Sept. 8, 2022, 7:28 p.m. UTC | #28
On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
> >
> > Looks like the first access to the memory map fails, although I think
> > it's not in INIT_LIST_HEAD() but rather in init_page_count().
> >
> > I'd start with making sure that page_alloc::memmap_alloc() actually returns
> > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
> > will poisoned in this function, so my guess is it'd crash there.
> >
> 
> That's a wonderful hint, thank you! I did not run this test
> CONFIG_DEBUG_VM set, but you think it's possible it could still be
> here?

It depends on how you configured your kernel. Say, defconfig does not set
it.
Tom Lendacky Sept. 22, 2022, 2:31 p.m. UTC | #29
On 9/8/22 14:28, Mike Rapoport wrote:
> On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
>>>
>>> Looks like the first access to the memory map fails, although I think
>>> it's not in INIT_LIST_HEAD() but rather in init_page_count().
>>>
>>> I'd start with making sure that page_alloc::memmap_alloc() actually returns
>>> accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
>>> will poisoned in this function, so my guess is it'd crash there.
>>>
>>
>> That's a wonderful hint, thank you! I did not run this test
>> CONFIG_DEBUG_VM set, but you think it's possible it could still be
>> here?
> 
> It depends on how you configured your kernel. Say, defconfig does not set
> it.
> 

I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP 
and fails in memmap_init_range() when attempting to add the first PFN. It 
looks like the underlying page that is backing the vmemmap has not been 
accepted (I receive a #VC 0x404 => page not validated).

Kirill, is this a path that you've looked at? It would appear that 
somewhere in the vmemmap_populate_hugepages() path, some memory acceptance 
needs to be done for the pages that are used to back vmemmap. I'm not very 
familiar with this code, so I'm not sure why everything works for a guest 
with 255GB of memory, but then fails for a guest with 256GB of memory.

Thanks,
Tom
Kirill A. Shutemov Sept. 24, 2022, 1:03 a.m. UTC | #30
On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote:
> On 9/8/22 14:28, Mike Rapoport wrote:
> > On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
> > > > 
> > > > Looks like the first access to the memory map fails, although I think
> > > > it's not in INIT_LIST_HEAD() but rather in init_page_count().
> > > > 
> > > > I'd start with making sure that page_alloc::memmap_alloc() actually returns
> > > > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
> > > > will poisoned in this function, so my guess is it'd crash there.
> > > > 
> > > 
> > > That's a wonderful hint, thank you! I did not run this test
> > > CONFIG_DEBUG_VM set, but you think it's possible it could still be
> > > here?
> > 
> > It depends on how you configured your kernel. Say, defconfig does not set
> > it.
> > 
> 
> I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP
> and fails in memmap_init_range() when attempting to add the first PFN. It
> looks like the underlying page that is backing the vmemmap has not been
> accepted (I receive a #VC 0x404 => page not validated).
> 
> Kirill, is this a path that you've looked at? It would appear that somewhere
> in the vmemmap_populate_hugepages() path, some memory acceptance needs to be
> done for the pages that are used to back vmemmap. I'm not very familiar with
> this code, so I'm not sure why everything works for a guest with 255GB of
> memory, but then fails for a guest with 256GB of memory.

Hm. I don't have machine that large at hands at the moment. And I have not
looked at the codepath before.

I will try to look into the issue.
Mike Rapoport Sept. 24, 2022, 9:36 a.m. UTC | #31
On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote:
> On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote:
> > On 9/8/22 14:28, Mike Rapoport wrote:
> > > On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
> > > > > 
> > > > > Looks like the first access to the memory map fails, although I think
> > > > > it's not in INIT_LIST_HEAD() but rather in init_page_count().
> > > > > 
> > > > > I'd start with making sure that page_alloc::memmap_alloc() actually returns
> > > > > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
> > > > > will poisoned in this function, so my guess is it'd crash there.
> > > > > 
> > > > 
> > > > That's a wonderful hint, thank you! I did not run this test
> > > > CONFIG_DEBUG_VM set, but you think it's possible it could still be
> > > > here?
> > > 
> > > It depends on how you configured your kernel. Say, defconfig does not set
> > > it.
> > > 
> > 
> > I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP
> > and fails in memmap_init_range() when attempting to add the first PFN. It
> > looks like the underlying page that is backing the vmemmap has not been
> > accepted (I receive a #VC 0x404 => page not validated).
> > 
> > Kirill, is this a path that you've looked at? It would appear that somewhere
> > in the vmemmap_populate_hugepages() path, some memory acceptance needs to be
> > done for the pages that are used to back vmemmap. I'm not very familiar with
> > this code, so I'm not sure why everything works for a guest with 255GB of
> > memory, but then fails for a guest with 256GB of memory.
> 
> Hm. I don't have machine that large at hands at the moment. And I have not
> looked at the codepath before.
> 
> I will try to look into the issue.

I'd add some printks to verify we actually try to accept the allocated
memory. E.g. something like the patch below:

diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 9ec2304272dc..8f00639facc4 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -22,6 +22,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
 	if (!boot_params.unaccepted_memory)
 		return;
 
+	if (system_state == SYSTEM_BOOTING)
+		pr_info("%s: start: %pa end: %pa %pS\n", __func__, &start, &end, (void *)_RET_IP_);
+
 	bitmap = __va(boot_params.unaccepted_memory);
 	range_start = start / PMD_SIZE;
 
diff --git a/mm/memblock.c b/mm/memblock.c
index a1f7f8b304d5..029dd520102d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1535,7 +1535,7 @@ void * __init memblock_alloc_exact_nid_raw(
 			phys_addr_t min_addr, phys_addr_t max_addr,
 			int nid)
 {
-	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n",
+	pr_info("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n",
 		     __func__, (u64)size, (u64)align, nid, &min_addr,
 		     &max_addr, (void *)_RET_IP_);
 
@@ -1567,7 +1567,7 @@ void * __init memblock_alloc_try_nid_raw(
 			phys_addr_t min_addr, phys_addr_t max_addr,
 			int nid)
 {
-	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n",
+	pr_info("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n",
 		     __func__, (u64)size, (u64)align, nid, &min_addr,
 		     &max_addr, (void *)_RET_IP_);
 
 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov
>
Kirill A. Shutemov Sept. 26, 2022, 12:10 p.m. UTC | #32
On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote:
> On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote:
> > On 9/8/22 14:28, Mike Rapoport wrote:
> > > On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
> > > > > 
> > > > > Looks like the first access to the memory map fails, although I think
> > > > > it's not in INIT_LIST_HEAD() but rather in init_page_count().
> > > > > 
> > > > > I'd start with making sure that page_alloc::memmap_alloc() actually returns
> > > > > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
> > > > > will poisoned in this function, so my guess is it'd crash there.
> > > > > 
> > > > 
> > > > That's a wonderful hint, thank you! I did not run this test
> > > > CONFIG_DEBUG_VM set, but you think it's possible it could still be
> > > > here?
> > > 
> > > It depends on how you configured your kernel. Say, defconfig does not set
> > > it.
> > > 
> > 
> > I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP
> > and fails in memmap_init_range() when attempting to add the first PFN. It
> > looks like the underlying page that is backing the vmemmap has not been
> > accepted (I receive a #VC 0x404 => page not validated).
> > 
> > Kirill, is this a path that you've looked at? It would appear that somewhere
> > in the vmemmap_populate_hugepages() path, some memory acceptance needs to be
> > done for the pages that are used to back vmemmap. I'm not very familiar with
> > this code, so I'm not sure why everything works for a guest with 255GB of
> > memory, but then fails for a guest with 256GB of memory.
> 
> Hm. I don't have machine that large at hands at the moment. And I have not
> looked at the codepath before.
> 
> I will try to look into the issue.

I'm not able to trigger the bug.

With help of vm.overcommit_memory=1, I was managed boot TDX guest to shell
with 256G and 1T of guest memory just fine.

Any chance it is SEV-SNP specific?

Or maybe there some difference in kernel config? Could you share yours?
Tom Lendacky Sept. 26, 2022, 1:38 p.m. UTC | #33
On 9/26/22 07:10, Kirill A. Shutemov wrote:
> On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote:
>> On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote:
>>> On 9/8/22 14:28, Mike Rapoport wrote:
>>>> On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
>>>>>>
>>>>>> Looks like the first access to the memory map fails, although I think
>>>>>> it's not in INIT_LIST_HEAD() but rather in init_page_count().
>>>>>>
>>>>>> I'd start with making sure that page_alloc::memmap_alloc() actually returns
>>>>>> accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
>>>>>> will poisoned in this function, so my guess is it'd crash there.
>>>>>>
>>>>>
>>>>> That's a wonderful hint, thank you! I did not run this test
>>>>> CONFIG_DEBUG_VM set, but you think it's possible it could still be
>>>>> here?
>>>>
>>>> It depends on how you configured your kernel. Say, defconfig does not set
>>>> it.
>>>>
>>>
>>> I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP
>>> and fails in memmap_init_range() when attempting to add the first PFN. It
>>> looks like the underlying page that is backing the vmemmap has not been
>>> accepted (I receive a #VC 0x404 => page not validated).
>>>
>>> Kirill, is this a path that you've looked at? It would appear that somewhere
>>> in the vmemmap_populate_hugepages() path, some memory acceptance needs to be
>>> done for the pages that are used to back vmemmap. I'm not very familiar with
>>> this code, so I'm not sure why everything works for a guest with 255GB of
>>> memory, but then fails for a guest with 256GB of memory.
>>
>> Hm. I don't have machine that large at hands at the moment. And I have not
>> looked at the codepath before.
>>
>> I will try to look into the issue.
> 
> I'm not able to trigger the bug.
> 
> With help of vm.overcommit_memory=1, I was managed boot TDX guest to shell
> with 256G and 1T of guest memory just fine.
> 
> Any chance it is SEV-SNP specific?

There's always a chance. I'll do some more tracing and see what I can find 
to try and be certain.

> 
> Or maybe there some difference in kernel config? Could you share yours?

Yes, I'll send that to you off-list.

Thanks,
Tom

>
Kirill A. Shutemov Sept. 26, 2022, 3:42 p.m. UTC | #34
On Mon, Sep 26, 2022 at 08:38:34AM -0500, Tom Lendacky wrote:
> On 9/26/22 07:10, Kirill A. Shutemov wrote:
> > On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote:
> > > > On 9/8/22 14:28, Mike Rapoport wrote:
> > > > > On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
> > > > > > > 
> > > > > > > Looks like the first access to the memory map fails, although I think
> > > > > > > it's not in INIT_LIST_HEAD() but rather in init_page_count().
> > > > > > > 
> > > > > > > I'd start with making sure that page_alloc::memmap_alloc() actually returns
> > > > > > > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map
> > > > > > > will poisoned in this function, so my guess is it'd crash there.
> > > > > > > 
> > > > > > 
> > > > > > That's a wonderful hint, thank you! I did not run this test
> > > > > > CONFIG_DEBUG_VM set, but you think it's possible it could still be
> > > > > > here?
> > > > > 
> > > > > It depends on how you configured your kernel. Say, defconfig does not set
> > > > > it.
> > > > > 
> > > > 
> > > > I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP
> > > > and fails in memmap_init_range() when attempting to add the first PFN. It
> > > > looks like the underlying page that is backing the vmemmap has not been
> > > > accepted (I receive a #VC 0x404 => page not validated).
> > > > 
> > > > Kirill, is this a path that you've looked at? It would appear that somewhere
> > > > in the vmemmap_populate_hugepages() path, some memory acceptance needs to be
> > > > done for the pages that are used to back vmemmap. I'm not very familiar with
> > > > this code, so I'm not sure why everything works for a guest with 255GB of
> > > > memory, but then fails for a guest with 256GB of memory.
> > > 
> > > Hm. I don't have machine that large at hands at the moment. And I have not
> > > looked at the codepath before.
> > > 
> > > I will try to look into the issue.
> > 
> > I'm not able to trigger the bug.
> > 
> > With help of vm.overcommit_memory=1, I was managed boot TDX guest to shell
> > with 256G and 1T of guest memory just fine.
> > 
> > Any chance it is SEV-SNP specific?
> 
> There's always a chance. I'll do some more tracing and see what I can find
> to try and be certain.
> 
> > 
> > Or maybe there some difference in kernel config? Could you share yours?
> 
> Yes, I'll send that to you off-list.

Still nothing with your config :/
Tom Lendacky Sept. 26, 2022, 3:42 p.m. UTC | #35
On 9/26/22 08:38, Tom Lendacky wrote:
> On 9/26/22 07:10, Kirill A. Shutemov wrote:
>> On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote:
>>> On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote:
>>>> On 9/8/22 14:28, Mike Rapoport wrote:
>>>>> On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote:
>>>>>>>
>>>>>>> Looks like the first access to the memory map fails, although I think
>>>>>>> it's not in INIT_LIST_HEAD() but rather in init_page_count().
>>>>>>>
>>>>>>> I'd start with making sure that page_alloc::memmap_alloc() actually 
>>>>>>> returns
>>>>>>> accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the 
>>>>>>> memory map
>>>>>>> will poisoned in this function, so my guess is it'd crash there.
>>>>>>>
>>>>>>
>>>>>> That's a wonderful hint, thank you! I did not run this test
>>>>>> CONFIG_DEBUG_VM set, but you think it's possible it could still be
>>>>>> here?
>>>>>
>>>>> It depends on how you configured your kernel. Say, defconfig does not 
>>>>> set
>>>>> it.
>>>>>
>>>>
>>>> I also hit the issue at 256GB. My config is using 
>>>> CONFIG_SPARSEMEM_VMEMMAP
>>>> and fails in memmap_init_range() when attempting to add the first PFN. It
>>>> looks like the underlying page that is backing the vmemmap has not been
>>>> accepted (I receive a #VC 0x404 => page not validated).
>>>>
>>>> Kirill, is this a path that you've looked at? It would appear that 
>>>> somewhere
>>>> in the vmemmap_populate_hugepages() path, some memory acceptance needs 
>>>> to be
>>>> done for the pages that are used to back vmemmap. I'm not very 
>>>> familiar with
>>>> this code, so I'm not sure why everything works for a guest with 255GB of
>>>> memory, but then fails for a guest with 256GB of memory.
>>>
>>> Hm. I don't have machine that large at hands at the moment. And I have not
>>> looked at the codepath before.
>>>
>>> I will try to look into the issue.
>>
>> I'm not able to trigger the bug.
>>
>> With help of vm.overcommit_memory=1, I was managed boot TDX guest to shell
>> with 256G and 1T of guest memory just fine.
>>
>> Any chance it is SEV-SNP specific?
> 
> There's always a chance. I'll do some more tracing and see what I can find 
> to try and be certain.

Indeed, it was an issue in the SNP path, shifting the number of pages as 
an unsigned int by PAGE_SIZE and adding to an unsigned long. The value to 
be shifted was 0x100000 and resulted in 0. Changing the number of pages to 
an unsigned long fixed it.

There are a few places where that is being done, so I'll address those 
with some pre-patches as fixes.

Thanks for verifying it was working for you.

Tom

> 
>>
>> Or maybe there some difference in kernel config? Could you share yours?
> 
> Yes, I'll send that to you off-list.
> 
> Thanks,
> Tom
> 
>>
diff mbox series

Patch

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e66f7aa3191d..444ba8f4bfa0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -942,6 +942,14 @@  static inline bool is_page_hwpoison(struct page *page)
 #define PG_offline	0x00000100
 #define PG_table	0x00000200
 #define PG_guard	0x00000400
+#define PG_unaccepted	0x00000800
+
+/*
+ * Page types allowed at page_expected_state()
+ *
+ * PageUnaccepted() will get cleared in post_alloc_hook().
+ */
+#define PAGE_TYPES_EXPECTED	PG_unaccepted
 
 #define PageType(page, flag)						\
 	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
@@ -967,6 +975,18 @@  static __always_inline void __ClearPage##uname(struct page *page)	\
 	page->page_type |= PG_##lname;					\
 }
 
+#define PAGE_TYPE_OPS_FALSE(uname)					\
+static __always_inline int Page##uname(struct page *page)		\
+{									\
+	return false;							\
+}									\
+static __always_inline void __SetPage##uname(struct page *page)		\
+{									\
+}									\
+static __always_inline void __ClearPage##uname(struct page *page)	\
+{									\
+}
+
 /*
  * PageBuddy() indicates that the page is free and in the buddy system
  * (see mm/page_alloc.c).
@@ -997,6 +1017,17 @@  PAGE_TYPE_OPS(Buddy, buddy)
  */
 PAGE_TYPE_OPS(Offline, offline)
 
+/*
+ * PageUnaccepted() indicates that the page has to be "accepted" before it can
+ * be read or written. The page allocator must call accept_page() before
+ * touching the page or returning it to the caller.
+ */
+#ifdef CONFIG_UNACCEPTED_MEMORY
+PAGE_TYPE_OPS(Unaccepted, unaccepted)
+#else
+PAGE_TYPE_OPS_FALSE(Unaccepted)
+#endif
+
 extern void page_offline_freeze(void);
 extern void page_offline_thaw(void);
 extern void page_offline_begin(void);
diff --git a/mm/internal.h b/mm/internal.h
index c0f8fbe0445b..7c1a653edfc8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -861,4 +861,16 @@  struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
 
 DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
+#ifndef CONFIG_UNACCEPTED_MEMORY
+static inline bool range_contains_unaccepted_memory(phys_addr_t start,
+						    phys_addr_t end)
+{
+	return false;
+}
+
+static inline void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index e4f03a6e8e56..a1f7f8b304d5 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1405,6 +1405,15 @@  phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
 		 */
 		kmemleak_alloc_phys(found, size, 0, 0);
 
+	/*
+	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
+	 * require memory to be accepted before it can be used by the
+	 * guest.
+	 *
+	 * Accept the memory of the allocated buffer.
+	 */
+	accept_memory(found, found + size);
+
 	return found;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e008a3df0485..279c2746aaa8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -122,6 +122,12 @@  typedef int __bitwise fpi_t;
  */
 #define FPI_SKIP_KASAN_POISON	((__force fpi_t)BIT(2))
 
+/*
+ * Check if the page needs to be marked as PageUnaccepted().
+ * Used for the new pages added to the buddy allocator for the first time.
+ */
+#define FPI_UNACCEPTED_SLOWPATH	((__force fpi_t)BIT(3))
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -993,6 +999,35 @@  buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
 			NULL) != NULL;
 }
 
+/*
+ * Page acceptance can be very slow. Do not call under critical locks.
+ */
+static void accept_page(struct page *page, unsigned int order)
+{
+	phys_addr_t start = page_to_phys(page);
+	int i;
+
+	if (!PageUnaccepted(page))
+		return;
+
+	accept_memory(start, start + (PAGE_SIZE << order));
+	__ClearPageUnaccepted(page);
+
+	/* Assert that there is no PageUnaccepted() on tail pages */
+	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
+		for (i = 1; i < (1 << order); i++)
+			VM_BUG_ON_PAGE(PageUnaccepted(page + i), page + i);
+	}
+}
+
+static bool page_contains_unaccepted(struct page *page, unsigned int order)
+{
+	phys_addr_t start = page_to_phys(page);
+	phys_addr_t end = start + (PAGE_SIZE << order);
+
+	return range_contains_unaccepted_memory(start, end);
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -1027,6 +1062,7 @@  static inline void __free_one_page(struct page *page,
 	unsigned long combined_pfn;
 	struct page *buddy;
 	bool to_tail;
+	bool page_needs_acceptance = false;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
@@ -1038,6 +1074,11 @@  static inline void __free_one_page(struct page *page,
 	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
+	if (PageUnaccepted(page)) {
+		page_needs_acceptance = true;
+		__ClearPageUnaccepted(page);
+	}
+
 	while (order < MAX_ORDER - 1) {
 		if (compaction_capture(capc, page, order, migratetype)) {
 			__mod_zone_freepage_state(zone, -(1 << order),
@@ -1072,6 +1113,13 @@  static inline void __free_one_page(struct page *page,
 			clear_page_guard(zone, buddy, order, migratetype);
 		else
 			del_page_from_free_list(buddy, zone, order);
+
+		/* Mark page unaccepted if any of merged pages were unaccepted */
+		if (PageUnaccepted(buddy)) {
+			page_needs_acceptance = true;
+			__ClearPageUnaccepted(buddy);
+		}
+
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
@@ -1081,6 +1129,23 @@  static inline void __free_one_page(struct page *page,
 done_merging:
 	set_buddy_order(page, order);
 
+	/*
+	 * The page gets marked as PageUnaccepted() if any of merged-in pages
+	 * is PageUnaccepted().
+	 *
+	 * New pages, just being added to buddy allocator, do not have
+	 * PageUnaccepted() set. FPI_UNACCEPTED_SLOWPATH indicates that the
+	 * page is new and page_contains_unaccepted() check is required to
+	 * determinate if acceptance is required.
+	 *
+	 * Avoid calling page_contains_unaccepted() if it is known that the page
+	 * needs acceptance. It can be costly.
+	 */
+	if (!page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED_SLOWPATH))
+		page_needs_acceptance = page_contains_unaccepted(page, order);
+	if (page_needs_acceptance)
+		__SetPageUnaccepted(page);
+
 	if (fpi_flags & FPI_TO_TAIL)
 		to_tail = true;
 	else if (is_shuffle_order(order))
@@ -1164,7 +1229,13 @@  int split_free_page(struct page *free_page,
 static inline bool page_expected_state(struct page *page,
 					unsigned long check_flags)
 {
-	if (unlikely(atomic_read(&page->_mapcount) != -1))
+	/*
+	 * The page must not be mapped to userspace and must not have
+	 * a PageType other than listed in PAGE_TYPES_EXPECTED.
+	 *
+	 * Note, bit cleared means the page type is set.
+	 */
+	if (unlikely((atomic_read(&page->_mapcount) | PAGE_TYPES_EXPECTED) != -1))
 		return false;
 
 	if (unlikely((unsigned long)page->mapping |
@@ -1669,7 +1740,9 @@  void __free_pages_core(struct page *page, unsigned int order)
 	 * Bypass PCP and place fresh pages right to the tail, primarily
 	 * relevant for memory onlining.
 	 */
-	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
+	__free_pages_ok(page, order,
+			FPI_TO_TAIL | FPI_SKIP_KASAN_POISON |
+			FPI_UNACCEPTED_SLOWPATH);
 }
 
 #ifdef CONFIG_NUMA
@@ -1822,6 +1895,9 @@  static void __init deferred_free_range(unsigned long pfn,
 		return;
 	}
 
+	/* Accept chunks smaller than page-block upfront */
+	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
+
 	for (i = 0; i < nr_pages; i++, page++, pfn++) {
 		if ((pfn & (pageblock_nr_pages - 1)) == 0)
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
@@ -2281,6 +2357,13 @@  static inline void expand(struct zone *zone, struct page *page,
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
+		/*
+		 * Transfer PageUnaccepted() to the newly split pages so
+		 * they can be accepted after dropping the zone lock.
+		 */
+		if (PageUnaccepted(page))
+			__SetPageUnaccepted(&page[size]);
+
 		add_to_free_list(&page[size], zone, high, migratetype);
 		set_buddy_order(&page[size], high);
 	}
@@ -2411,6 +2494,8 @@  inline void post_alloc_hook(struct page *page, unsigned int order,
 	 */
 	kernel_unpoison_pages(page, 1 << order);
 
+	accept_page(page, order);
+
 	/*
 	 * As memory initialization might be integrated into KASAN,
 	 * KASAN unpoisoning and memory initializion code must be