diff mbox series

[Part2,v5,08/45] x86/fault: Add support to handle the RMP fault for user address

Message ID 20210820155918.7518-9-brijesh.singh@amd.com
State New
Headers show
Series [Part2,v5,01/45] x86/cpufeatures: Add SEV-SNP CPU feature | expand

Commit Message

Brijesh Singh Aug. 20, 2021, 3:58 p.m. UTC
When SEV-SNP is enabled globally, a write from the host goes through the
RMP check. When the host writes to pages, hardware checks the following
conditions at the end of page walk:

1. Assigned bit in the RMP table is zero (i.e page is shared).
2. If the page table entry that gives the sPA indicates that the target
   page size is a large page, then all RMP entries for the 4KB
   constituting pages of the target must have the assigned bit 0.
3. Immutable bit in the RMP table is not zero.

The hardware will raise page fault if one of the above conditions is not
met. Try resolving the fault instead of taking fault again and again. If
the host attempts to write to the guest private memory then send the
SIGBUS signal to kill the process. If the page level between the host and
RMP entry does not match, then split the address to keep the RMP and host
page levels in sync.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/mm/fault.c | 66 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h  |  6 ++++-
 mm/memory.c         | 13 +++++++++
 3 files changed, 84 insertions(+), 1 deletion(-)

Comments

Dave Hansen Aug. 23, 2021, 2:20 p.m. UTC | #1
On 8/20/21 8:58 AM, Brijesh Singh wrote:
> +static int handle_split_page_fault(struct vm_fault *vmf)

> +{

> +	if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))

> +		return VM_FAULT_SIGBUS;

> +

> +	__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);

> +	return 0;

> +}


We had a whole conversation the last time this was posted about huge
page types *other* than THP.  I don't see any comprehension of those
types or what would happen if one of those was used with SEV-SNP.

What was the result of those review comments?

I'm still worried that hugetlbfs (and others) are not properly handled
by this series.
Brijesh Singh Aug. 23, 2021, 2:36 p.m. UTC | #2
Hi Dave,

On 8/23/21 9:20 AM, Dave Hansen wrote:
> On 8/20/21 8:58 AM, Brijesh Singh wrote:

>> +static int handle_split_page_fault(struct vm_fault *vmf)

>> +{

>> +	if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))

>> +		return VM_FAULT_SIGBUS;

>> +

>> +	__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);

>> +	return 0;

>> +}

> 

> We had a whole conversation the last time this was posted about huge

> page types *other* than THP.  I don't see any comprehension of those

> types or what would happen if one of those was used with SEV-SNP.

> 

> What was the result of those review comments?

> 


Based on previous review comments Sean was not keen on KVM having 
perform this detection and abort the guest SEV-SNP VM launch. So, I 
didn't implemented the check and waiting for more discussion before 
going at it.

SEV-SNP guest requires the VMM to register the guest backing pages 
before the VM launch. Personally, I would prefer KVM to check the 
backing page type during the registration and fail to register if its 
hugetlbfs (and others) to avoid us get into situation where we could not 
split the hugepage.

thanks

> I'm still worried that hugetlbfs (and others) are not properly handled

> by this series.

>
Dave Hansen Aug. 23, 2021, 2:50 p.m. UTC | #3
On 8/23/21 7:36 AM, Brijesh Singh wrote:
> On 8/23/21 9:20 AM, Dave Hansen wrote:

>> On 8/20/21 8:58 AM, Brijesh Singh wrote:

>>> +static int handle_split_page_fault(struct vm_fault *vmf)

>>> +{

>>> +    if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))

>>> +        return VM_FAULT_SIGBUS;

>>> +

>>> +    __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);

>>> +    return 0;

>>> +}

>>

>> We had a whole conversation the last time this was posted about huge

>> page types *other* than THP.  I don't see any comprehension of those

>> types or what would happen if one of those was used with SEV-SNP.

>>

>> What was the result of those review comments?

> 

> Based on previous review comments Sean was not keen on KVM having

> perform this detection and abort the guest SEV-SNP VM launch. So, I

> didn't implemented the check and waiting for more discussion before

> going at it.


OK.  But, you need to *acknowledge* the situation somewhere.  Maybe the
cover letter of the series, maybe in this changelog.

As it stands, it looks like you're simply ignoring _some_ reviewer feedback.

> SEV-SNP guest requires the VMM to register the guest backing pages

> before the VM launch. Personally, I would prefer KVM to check the

> backing page type during the registration and fail to register if its

> hugetlbfs (and others) to avoid us get into situation where we could not

> split the hugepage.


It *has* to be done in KVM, IMNHO.

The core kernel really doesn't know much about SEV.  It *really* doesn't
know when its memory is being exposed to a virtualization architecture
that doesn't know how to split TLBs like every single one before it.

This essentially *must* be done at the time that the KVM code realizes
that it's being asked to shove a non-splittable page mapping into the
SEV hardware structures.

The only other alternative is raising a signal from the fault handler
when the page can't be split.  That's a *LOT* nastier because it's so
much later in the process.

It's either that, or figure out a way to split hugetlbfs (and DAX)
mappings in a failsafe way.
Joerg Roedel Aug. 24, 2021, 4:42 p.m. UTC | #4
On Mon, Aug 23, 2021 at 07:50:22AM -0700, Dave Hansen wrote:
> It *has* to be done in KVM, IMNHO.

> 

> The core kernel really doesn't know much about SEV.  It *really* doesn't

> know when its memory is being exposed to a virtualization architecture

> that doesn't know how to split TLBs like every single one before it.

> 

> This essentially *must* be done at the time that the KVM code realizes

> that it's being asked to shove a non-splittable page mapping into the

> SEV hardware structures.

> 

> The only other alternative is raising a signal from the fault handler

> when the page can't be split.  That's a *LOT* nastier because it's so

> much later in the process.

> 

> It's either that, or figure out a way to split hugetlbfs (and DAX)

> mappings in a failsafe way.


Yes, I agree with that. KVM needs a check to disallow HugeTLB pages in
SEV-SNP guests, at least as a temporary workaround. When HugeTLBfs
mappings can be split into smaller pages the check can be removed.

Regards,

	Joerg
Vlastimil Babka Aug. 25, 2021, 9:16 a.m. UTC | #5
On 8/24/21 18:42, Joerg Roedel wrote:
> On Mon, Aug 23, 2021 at 07:50:22AM -0700, Dave Hansen wrote:

>> It *has* to be done in KVM, IMNHO.

>> 

>> The core kernel really doesn't know much about SEV.  It *really* doesn't

>> know when its memory is being exposed to a virtualization architecture

>> that doesn't know how to split TLBs like every single one before it.

>> 

>> This essentially *must* be done at the time that the KVM code realizes

>> that it's being asked to shove a non-splittable page mapping into the

>> SEV hardware structures.

>> 

>> The only other alternative is raising a signal from the fault handler

>> when the page can't be split.  That's a *LOT* nastier because it's so

>> much later in the process.

>> 

>> It's either that, or figure out a way to split hugetlbfs (and DAX)

>> mappings in a failsafe way.

> 

> Yes, I agree with that. KVM needs a check to disallow HugeTLB pages in

> SEV-SNP guests, at least as a temporary workaround. When HugeTLBfs

> mappings can be split into smaller pages the check can be removed.


FTR, this is Sean's reply with concerns in v4:
https://lore.kernel.org/linux-coco/YPCuTiNET%2FhJHqOY@google.com/

I think there are two main arguments there:
- it's not KVM business to decide
- guest may do all page state changes with 2mb granularity so it might be fine
with hugetlb

The latter might become true, but I think it's more probable that sooner
hugetlbfs will learn to split the mappings to base pages - I know people plan to
work on that. At that point qemu will have to recognize if the host kernel is
the new one that can do this splitting vs older one that can't. Preferably
without relying on kernel version number, as backports exist. Thus, trying to
register a hugetlbfs range that either is rejected (kernel can't split) or
passes (kernel can split) seems like a straightforward way. So I'm also in favor
of adding that, hopefuly temporary, check.

Vlastimil

> Regards,

> 

> 	Joerg

>
Tom Lendacky Aug. 25, 2021, 1:50 p.m. UTC | #6
On 8/25/21 4:16 AM, Vlastimil Babka wrote:
> On 8/24/21 18:42, Joerg Roedel wrote:

>> On Mon, Aug 23, 2021 at 07:50:22AM -0700, Dave Hansen wrote:

>>> It *has* to be done in KVM, IMNHO.

>>>

>>> The core kernel really doesn't know much about SEV.  It *really* doesn't

>>> know when its memory is being exposed to a virtualization architecture

>>> that doesn't know how to split TLBs like every single one before it.

>>>

>>> This essentially *must* be done at the time that the KVM code realizes

>>> that it's being asked to shove a non-splittable page mapping into the

>>> SEV hardware structures.

>>>

>>> The only other alternative is raising a signal from the fault handler

>>> when the page can't be split.  That's a *LOT* nastier because it's so

>>> much later in the process.

>>>

>>> It's either that, or figure out a way to split hugetlbfs (and DAX)

>>> mappings in a failsafe way.

>>

>> Yes, I agree with that. KVM needs a check to disallow HugeTLB pages in

>> SEV-SNP guests, at least as a temporary workaround. When HugeTLBfs

>> mappings can be split into smaller pages the check can be removed.

> 

> FTR, this is Sean's reply with concerns in v4:

> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-coco%2FYPCuTiNET%252FhJHqOY%40google.com%2F&amp;data=04%7C01%7Cthomas.lendacky%40amd.com%7C692ea2e8bfd744e7ab5d08d967a918d3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637654798234874418%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=leZuMY0%2FX8xbHA%2FOrxkXNoLCGNoVUQpY5eB3EInM55A%3D&amp;reserved=0

> 

> I think there are two main arguments there:

> - it's not KVM business to decide

> - guest may do all page state changes with 2mb granularity so it might be fine

> with hugetlb

> 

> The latter might become true, but I think it's more probable that sooner

> hugetlbfs will learn to split the mappings to base pages - I know people plan to

> work on that. At that point qemu will have to recognize if the host kernel is

> the new one that can do this splitting vs older one that can't. Preferably

> without relying on kernel version number, as backports exist. Thus, trying to

> register a hugetlbfs range that either is rejected (kernel can't split) or

> passes (kernel can split) seems like a straightforward way. So I'm also in favor

> of adding that, hopefuly temporary, check.


If that's the direction taken, I think we'd be able to use a KVM_CAP_
value that can be queried by the VMM to make the determination.

Thanks,
Tom

> 

> Vlastimil

> 

>> Regards,

>>

>> 	Joerg

>>

>
Borislav Petkov Sept. 29, 2021, 6:19 p.m. UTC | #7
On Fri, Aug 20, 2021 at 10:58:41AM -0500, Brijesh Singh wrote:
> +static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned long error_code,

> +				      unsigned long address)

> +{


#ifdef CONFIG_AMD_MEM_ENCRYPT

> +	int rmp_level, level;

> +	pte_t *pte;

> +	u64 pfn;

> +

> +	pte = lookup_address_in_mm(current->mm, address, &level);

> +

> +	/*

> +	 * It can happen if there was a race between an unmap event and

> +	 * the RMP fault delivery.

> +	 */

> +	if (!pte || !pte_present(*pte))

> +		return 1;

> +

> +	pfn = pte_pfn(*pte);

> +

> +	/* If its large page then calculte the fault pfn */

> +	if (level > PG_LEVEL_4K) {

> +		unsigned long mask;

> +

> +		mask = pages_per_hpage(level) - pages_per_hpage(level - 1);


Just use two helper variables named properly instead of this oneliner:

		pages_level 	 = page_level_size(level) / PAGE_SIZE;
		pages_prev_level = page_level_size(level - 1) / PAGE_SIZE;

> +		pfn |= (address >> PAGE_SHIFT) & mask;

> +	}

> +

> +	/*

> +	 * If its a guest private page, then the fault cannot be resolved.

> +	 * Send a SIGBUS to terminate the process.

> +	 */

> +	if (snp_lookup_rmpentry(pfn, &rmp_level)) {

> +		do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);

> +		return 1;

> +	}

> +

> +	/*

> +	 * The backing page level is higher than the RMP page level, request

> +	 * to split the page.

> +	 */

> +	if (level > rmp_level)

> +		return 0;

> +

> +	return 1;


#else
	WARN_ONONCE(1);
	return -1;
#endif

and also handle that -1 negative value at the call site.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
diff mbox series

Patch

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8b7a5757440e..f2d543b92f43 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -19,6 +19,7 @@ 
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_crash_gracefully_on_page_fault()*/
 #include <linux/mm_types.h>
+#include <linux/sev.h>			/* snp_lookup_rmpentry()	*/
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1202,6 +1203,60 @@  do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 }
 NOKPROBE_SYMBOL(do_kern_addr_fault);
 
+static inline size_t pages_per_hpage(int level)
+{
+	return page_level_size(level) / PAGE_SIZE;
+}
+
+/*
+ * Return 1 if the caller need to retry, 0 if it the address need to be split
+ * in order to resolve the fault.
+ */
+static int handle_user_rmp_page_fault(struct pt_regs *regs, unsigned long error_code,
+				      unsigned long address)
+{
+	int rmp_level, level;
+	pte_t *pte;
+	u64 pfn;
+
+	pte = lookup_address_in_mm(current->mm, address, &level);
+
+	/*
+	 * It can happen if there was a race between an unmap event and
+	 * the RMP fault delivery.
+	 */
+	if (!pte || !pte_present(*pte))
+		return 1;
+
+	pfn = pte_pfn(*pte);
+
+	/* If its large page then calculte the fault pfn */
+	if (level > PG_LEVEL_4K) {
+		unsigned long mask;
+
+		mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
+		pfn |= (address >> PAGE_SHIFT) & mask;
+	}
+
+	/*
+	 * If its a guest private page, then the fault cannot be resolved.
+	 * Send a SIGBUS to terminate the process.
+	 */
+	if (snp_lookup_rmpentry(pfn, &rmp_level)) {
+		do_sigbus(regs, error_code, address, VM_FAULT_SIGBUS);
+		return 1;
+	}
+
+	/*
+	 * The backing page level is higher than the RMP page level, request
+	 * to split the page.
+	 */
+	if (level > rmp_level)
+		return 0;
+
+	return 1;
+}
+
 /*
  * Handle faults in the user portion of the address space.  Nothing in here
  * should check X86_PF_USER without a specific justification: for almost
@@ -1299,6 +1354,17 @@  void do_user_addr_fault(struct pt_regs *regs,
 	if (error_code & X86_PF_INSTR)
 		flags |= FAULT_FLAG_INSTRUCTION;
 
+	/*
+	 * If its an RMP violation, try resolving it.
+	 */
+	if (error_code & X86_PF_RMP) {
+		if (handle_user_rmp_page_fault(regs, error_code, address))
+			return;
+
+		/* Ask to split the page */
+		flags |= FAULT_FLAG_PAGE_SPLIT;
+	}
+
 #ifdef CONFIG_X86_64
 	/*
 	 * Faults in the vsyscall page might need emulation.  The
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..74a53c146365 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -447,6 +447,8 @@  extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the
+ *  region to smaller page size and retry.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -478,6 +480,7 @@  enum fault_flag {
 	FAULT_FLAG_REMOTE =		1 << 7,
 	FAULT_FLAG_INSTRUCTION =	1 << 8,
 	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
+	FAULT_FLAG_PAGE_SPLIT =		1 << 10,
 };
 
 /*
@@ -517,7 +520,8 @@  static inline bool fault_flag_allow_retry_first(enum fault_flag flags)
 	{ FAULT_FLAG_USER,		"USER" }, \
 	{ FAULT_FLAG_REMOTE,		"REMOTE" }, \
 	{ FAULT_FLAG_INSTRUCTION,	"INSTRUCTION" }, \
-	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }
+	{ FAULT_FLAG_INTERRUPTIBLE,	"INTERRUPTIBLE" }, \
+	{ FAULT_FLAG_PAGE_SPLIT,	"PAGESPLIT" }
 
 /*
  * vm_fault is filled by the pagefault handler and passed to the vma's
diff --git a/mm/memory.c b/mm/memory.c
index 747a01d495f2..27e6ccec3fc1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4589,6 +4589,15 @@  static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	return 0;
 }
 
+static int handle_split_page_fault(struct vm_fault *vmf)
+{
+	if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
+		return VM_FAULT_SIGBUS;
+
+	__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
+	return 0;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  *
@@ -4666,6 +4675,10 @@  static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
 		}
+
+		if (flags & FAULT_FLAG_PAGE_SPLIT)
+			return handle_split_page_fault(&vmf);
+
 		if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
 			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf);