mbox series

[V3,00/13] x86/Hyper-V: Add Hyper-V Isolation VM support

Message ID 20210809175620.720923-1-ltykernel@gmail.com
Headers show
Series x86/Hyper-V: Add Hyper-V Isolation VM support | expand

Message

Tianyu Lan Aug. 9, 2021, 5:56 p.m. UTC
From: Tianyu Lan <Tianyu.Lan@microsoft.com>

Hyper-V provides two kinds of Isolation VMs. VBS(Virtualization-based
security) and AMD SEV-SNP unenlightened Isolation VMs. This patchset
is to add support for these Isolation VM support in Linux.

The memory of these vms are encrypted and host can't access guest
memory directly. Hyper-V provides new host visibility hvcall and
the guest needs to call new hvcall to mark memory visible to host
before sharing memory with host. For security, all network/storage
stack memory should not be shared with host and so there is bounce
buffer requests.

Vmbus channel ring buffer already plays bounce buffer role because
all data from/to host needs to copy from/to between the ring buffer
and IO stack memory. So mark vmbus channel ring buffer visible.

There are two exceptions - packets sent by vmbus_sendpacket_
pagebuffer() and vmbus_sendpacket_mpb_desc(). These packets
contains IO stack memory address and host will access these memory.
So add allocation bounce buffer support in vmbus for these packets.

For SNP isolation VM, guest needs to access the shared memory via
extra address space which is specified by Hyper-V CPUID HYPERV_CPUID_
ISOLATION_CONFIG. The access physical address of the shared memory
should be bounce buffer memory GPA plus with shared_gpa_boundary
reported by CPUID.


Change since V2:
       - Drop x86_set_memory_enc static call and use platform check
         in the __set_memory_enc_dec() to run platform callback of
	 set memory encrypted or decrypted.

Change since V1:
       - Introduce x86_set_memory_enc static call and so platforms can
         override __set_memory_enc_dec() with their implementation
       - Introduce sev_es_ghcb_hv_call_simple() and share code
         between SEV and Hyper-V code.
       - Not remap monitor pages in the non-SNP isolation VM
       - Make swiotlb_init_io_tlb_mem() return error code and return
         error when dma_map_decrypted() fails.

Change since RFC V4:
       - Introduce dma map decrypted function to remap bounce buffer
          and provide dma map decrypted ops for platform to hook callback.        
       - Split swiotlb and dma map decrypted change into two patches
       - Replace vstart with vaddr in swiotlb changes.

Change since RFC v3:
       - Add interface set_memory_decrypted_map() to decrypt memory and
         map bounce buffer in extra address space
       - Remove swiotlb remap function and store the remap address
         returned by set_memory_decrypted_map() in swiotlb mem data structure.
       - Introduce hv_set_mem_enc() to make code more readable in the __set_memory_enc_dec().

Change since RFC v2:
       - Remove not UIO driver in Isolation VM patch
       - Use vmap_pfn() to replace ioremap_page_range function in
       order to avoid exposing symbol ioremap_page_range() and
       ioremap_page_range()
       - Call hv set mem host visibility hvcall in set_memory_encrypted/decrypted()
       - Enable swiotlb force mode instead of adding Hyper-V dma map/unmap hook
       - Fix code style


Tianyu Lan (13):
  x86/HV: Initialize GHCB page in Isolation VM
  x86/HV: Initialize shared memory boundary in the Isolation VM.
  x86/HV: Add new hvcall guest address host visibility support
  HV: Mark vmbus ring buffer visible to host in Isolation VM
  HV: Add Write/Read MSR registers via ghcb page
  HV: Add ghcb hvcall support for SNP VM
  HV/Vmbus: Add SNP support for VMbus channel initiate message
  HV/Vmbus: Initialize VMbus ring buffer for Isolation VM
  DMA: Add dma_map_decrypted/dma_unmap_encrypted() function
  x86/Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM
  HV/IOMMU: Enable swiotlb bounce buffer for Isolation VM
  HV/Netvsc: Add Isolation VM support for netvsc driver
  HV/Storvsc: Add Isolation VM support for storvsc driver

 arch/x86/hyperv/Makefile           |   2 +-
 arch/x86/hyperv/hv_init.c          |  75 ++++++--
 arch/x86/hyperv/ivm.c              | 295 +++++++++++++++++++++++++++++
 arch/x86/include/asm/hyperv-tlfs.h |  20 ++
 arch/x86/include/asm/mshyperv.h    |  87 ++++++++-
 arch/x86/include/asm/sev.h         |   3 +
 arch/x86/kernel/cpu/mshyperv.c     |   5 +
 arch/x86/kernel/sev-shared.c       |  63 +++---
 arch/x86/mm/pat/set_memory.c       |  19 +-
 arch/x86/xen/pci-swiotlb-xen.c     |   3 +-
 drivers/hv/Kconfig                 |   1 +
 drivers/hv/channel.c               |  54 +++++-
 drivers/hv/connection.c            |  71 ++++++-
 drivers/hv/hv.c                    | 129 +++++++++----
 drivers/hv/hyperv_vmbus.h          |   3 +
 drivers/hv/ring_buffer.c           |  84 ++++++--
 drivers/hv/vmbus_drv.c             |   3 +
 drivers/iommu/hyperv-iommu.c       |  65 +++++++
 drivers/net/hyperv/hyperv_net.h    |   6 +
 drivers/net/hyperv/netvsc.c        | 144 +++++++++++++-
 drivers/net/hyperv/rndis_filter.c  |   2 +
 drivers/scsi/storvsc_drv.c         |  68 ++++++-
 include/asm-generic/hyperv-tlfs.h  |   1 +
 include/asm-generic/mshyperv.h     |  54 +++++-
 include/linux/dma-map-ops.h        |   9 +
 include/linux/hyperv.h             |  17 ++
 include/linux/swiotlb.h            |   4 +
 kernel/dma/mapping.c               |  22 +++
 kernel/dma/swiotlb.c               |  32 +++-
 29 files changed, 1212 insertions(+), 129 deletions(-)
 create mode 100644 arch/x86/hyperv/ivm.c

Comments

Wei Liu Aug. 10, 2021, 10:56 a.m. UTC | #1
On Mon, Aug 09, 2021 at 01:56:05PM -0400, Tianyu Lan wrote:
[...]
>  static int hv_cpu_init(unsigned int cpu)

>  {

>  	union hv_vp_assist_msr_contents msr = { 0 };

> @@ -85,6 +111,8 @@ static int hv_cpu_init(unsigned int cpu)

>  		}

>  	}

>  

> +	hyperv_init_ghcb();

> +


Why is the return value not checked here? If that's not required, can
you leave a comment?

Wei.
Tianyu Lan Aug. 10, 2021, 12:17 p.m. UTC | #2
Hi Wei:
       Thanks for review.

On 8/10/2021 6:56 PM, Wei Liu wrote:
> On Mon, Aug 09, 2021 at 01:56:05PM -0400, Tianyu Lan wrote:

> [...]

>>   static int hv_cpu_init(unsigned int cpu)

>>   {

>>   	union hv_vp_assist_msr_contents msr = { 0 };

>> @@ -85,6 +111,8 @@ static int hv_cpu_init(unsigned int cpu)

>>   		}

>>   	}

>>   

>> +	hyperv_init_ghcb();

>> +

> 

> Why is the return value not checked here? If that's not required, can

> you leave a comment?

> 


The check is necessary here. Will update in the next version.

Thanks.
Christoph Hellwig Aug. 12, 2021, 12:26 p.m. UTC | #3
On Mon, Aug 09, 2021 at 01:56:13PM -0400, Tianyu Lan wrote:
> From: Tianyu Lan <Tianyu.Lan@microsoft.com>

> 

> In Hyper-V Isolation VM with AMD SEV, swiotlb boucne buffer

> needs to be mapped into address space above vTOM and so

> introduce dma_map_decrypted/dma_unmap_encrypted() to map/unmap

> bounce buffer memory. The platform can populate man/unmap callback

> in the dma memory decrypted ops.


Nothing here looks actually DMA related, and the names are horribly
confusing vs the actual dma_map_* calls.
Tianyu Lan Aug. 12, 2021, 3:38 p.m. UTC | #4
On 8/12/2021 8:26 PM, Christoph Hellwig wrote:
> On Mon, Aug 09, 2021 at 01:56:13PM -0400, Tianyu Lan wrote:

>> From: Tianyu Lan <Tianyu.Lan@microsoft.com>

>>

>> In Hyper-V Isolation VM with AMD SEV, swiotlb boucne buffer

>> needs to be mapped into address space above vTOM and so

>> introduce dma_map_decrypted/dma_unmap_encrypted() to map/unmap

>> bounce buffer memory. The platform can populate man/unmap callback

>> in the dma memory decrypted ops.

> 

> Nothing here looks actually DMA related, and the names are horribly

> confusing vs the actual dma_map_* calls.

> 


OK. So this still need to keep in the set_memory.c file.
Michael Kelley Aug. 12, 2021, 7:14 p.m. UTC | #5
From: Tianyu Lan <ltykernel@gmail.com> Sent: Monday, August 9, 2021 10:56 AM
> Subject: [PATCH V3 01/13] x86/HV: Initialize GHCB page in Isolation VM

The subject line tag on patches under arch/x86/hyperv is generally "x86/hyperv:".
There's some variation in the spelling of "hyperv", but let's go with the all
lowercase "hyperv".

> 
> Hyper-V exposes GHCB page via SEV ES GHCB MSR for SNP guest
> to communicate with hypervisor. Map GHCB page for all
> cpus to read/write MSR register and submit hvcall request
> via GHCB.
> 
> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
> ---
>  arch/x86/hyperv/hv_init.c       | 66 +++++++++++++++++++++++++++++++--
>  arch/x86/include/asm/mshyperv.h |  2 +
>  include/asm-generic/mshyperv.h  |  2 +
>  3 files changed, 66 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 708a2712a516..0bb4d9ca7a55 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -20,6 +20,7 @@
>  #include <linux/kexec.h>
>  #include <linux/version.h>
>  #include <linux/vmalloc.h>
> +#include <linux/io.h>
>  #include <linux/mm.h>
>  #include <linux/hyperv.h>
>  #include <linux/slab.h>
> @@ -42,6 +43,31 @@ static void *hv_hypercall_pg_saved;
>  struct hv_vp_assist_page **hv_vp_assist_page;
>  EXPORT_SYMBOL_GPL(hv_vp_assist_page);
> 
> +static int hyperv_init_ghcb(void)
> +{
> +	u64 ghcb_gpa;
> +	void *ghcb_va;
> +	void **ghcb_base;
> +
> +	if (!ms_hyperv.ghcb_base)
> +		return -EINVAL;
> +
> +	/*
> +	 * GHCB page is allocated by paravisor. The address
> +	 * returned by MSR_AMD64_SEV_ES_GHCB is above shared
> +	 * ghcb boundary and map it here.
> +	 */
> +	rdmsrl(MSR_AMD64_SEV_ES_GHCB, ghcb_gpa);
> +	ghcb_va = memremap(ghcb_gpa, HV_HYP_PAGE_SIZE, MEMREMAP_WB);
> +	if (!ghcb_va)
> +		return -ENOMEM;
> +
> +	ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
> +	*ghcb_base = ghcb_va;
> +
> +	return 0;
> +}
> +
>  static int hv_cpu_init(unsigned int cpu)
>  {
>  	union hv_vp_assist_msr_contents msr = { 0 };
> @@ -85,6 +111,8 @@ static int hv_cpu_init(unsigned int cpu)
>  		}
>  	}
> 
> +	hyperv_init_ghcb();
> +
>  	return 0;
>  }
> 
> @@ -177,6 +205,14 @@ static int hv_cpu_die(unsigned int cpu)
>  {
>  	struct hv_reenlightenment_control re_ctrl;
>  	unsigned int new_cpu;
> +	void **ghcb_va = NULL;

I'm not seeing any reason why this needs to be initialized.

> +
> +	if (ms_hyperv.ghcb_base) {
> +		ghcb_va = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);
> +		if (*ghcb_va)
> +			memunmap(*ghcb_va);
> +		*ghcb_va = NULL;
> +	}
> 
>  	hv_common_cpu_die(cpu);
> 
> @@ -383,9 +419,19 @@ void __init hyperv_init(void)
>  			VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_ROX,
>  			VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
>  			__builtin_return_address(0));
> -	if (hv_hypercall_pg == NULL) {
> -		wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
> -		goto remove_cpuhp_state;
> +	if (hv_hypercall_pg == NULL)
> +		goto clean_guest_os_id;
> +
> +	if (hv_isolation_type_snp()) {
> +		ms_hyperv.ghcb_base = alloc_percpu(void *);
> +		if (!ms_hyperv.ghcb_base)
> +			goto clean_guest_os_id;
> +
> +		if (hyperv_init_ghcb()) {
> +			free_percpu(ms_hyperv.ghcb_base);
> +			ms_hyperv.ghcb_base = NULL;
> +			goto clean_guest_os_id;
> +		}

Having the GHCB setup code here splits the hypercall page setup into
two parts, which is unexpected.  First the memory is allocated
for the hypercall page, then the GHCB stuff is done, then the hypercall
MSR is setup.  Is there a need to do this split?  Also, if the GHCB stuff
fails and you goto clean_guest_os_id, the memory allocated for the
hypercall page is never freed.

It's also unexpected to have hyperv_init_ghcb() called here and called
in hv_cpu_init().  Wouldn't it be possible to setup ghcb_base *before*
cpu_setup_state() is called, so that hv_cpu_init() would take care of
calling hyperv_init_ghcb() for the boot CPU?  That's the pattern used
by the VP assist page, the percpu input page, etc.

>  	}
> 
>  	rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> @@ -456,7 +502,8 @@ void __init hyperv_init(void)
>  	hv_query_ext_cap(0);
>  	return;
> 
> -remove_cpuhp_state:
> +clean_guest_os_id:
> +	wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
>  	cpuhp_remove_state(cpuhp);
>  free_vp_assist_page:
>  	kfree(hv_vp_assist_page);
> @@ -484,6 +531,9 @@ void hyperv_cleanup(void)
>  	 */
>  	hv_hypercall_pg = NULL;
> 
> +	if (ms_hyperv.ghcb_base)
> +		free_percpu(ms_hyperv.ghcb_base);
> +

I don't think this cleanup is necessary.  The primary purpose of
hyperv_cleanup() is to ensure that things like overlay pages are
properly reset in Hyper-V before doing a kexec(), or before
panic'ing and running the kdump kernel.  There's no need to do
general memory free'ing in Linux.  Doing so just adds to the risk
that the panic path could itself fail.

>  	/* Reset the hypercall page */
>  	hypercall_msr.as_uint64 = 0;
>  	wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> @@ -559,3 +609,11 @@ bool hv_is_isolation_supported(void)
>  {
>  	return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;
>  }
> +
> +DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
> +
> +bool hv_isolation_type_snp(void)
> +{
> +	return static_branch_unlikely(&isolation_type_snp);
> +}
> +EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index adccbc209169..6627cfd2bfba 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -11,6 +11,8 @@
>  #include <asm/paravirt.h>
>  #include <asm/mshyperv.h>
> 
> +DECLARE_STATIC_KEY_FALSE(isolation_type_snp);
> +
>  typedef int (*hyperv_fill_flush_list_func)(
>  		struct hv_guest_mapping_flush_list *flush,
>  		void *data);
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index c1ab6a6e72b5..4269f3174e58 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -36,6 +36,7 @@ struct ms_hyperv_info {
>  	u32 max_lp_index;
>  	u32 isolation_config_a;
>  	u32 isolation_config_b;
> +	void  __percpu **ghcb_base;

This doesn't feel like the right place to put this pointer.  The other
fields in the ms_hyperv_info structure are just fixed values obtained
from the CPUID instruction.   The existing patterns similar to ghcb_base
are the VP assist page and the percpu input and output args.  They are
all based on standalone global variables.  It would be more consistent
to do the same with the ghcb_base.

>  };
>  extern struct ms_hyperv_info ms_hyperv;
> 
> @@ -237,6 +238,7 @@ bool hv_is_hyperv_initialized(void);
>  bool hv_is_hibernation_supported(void);
>  enum hv_isolation_type hv_get_isolation_type(void);
>  bool hv_is_isolation_supported(void);
> +bool hv_isolation_type_snp(void);
>  void hyperv_cleanup(void);
>  bool hv_query_ext_cap(u64 cap_query);
>  #else /* CONFIG_HYPERV */
> --
> 2.25.1
Michael Kelley Aug. 12, 2021, 7:18 p.m. UTC | #6
From: Tianyu Lan <ltykernel@gmail.com> Sent: Monday, August 9, 2021 10:56 AM

> Subject: [PATCH V3 02/13] x86/HV: Initialize shared memory boundary in the Isolation VM.


As with Patch 1, use the "x86/hyperv:" tag in the Subject line.

> 

> From: Tianyu Lan <Tianyu.Lan@microsoft.com>

> 

> Hyper-V exposes shared memory boundary via cpuid

> HYPERV_CPUID_ISOLATION_CONFIG and store it in the

> shared_gpa_boundary of ms_hyperv struct. This prepares

> to share memory with host for SNP guest.

> 

> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>

> ---

>  arch/x86/kernel/cpu/mshyperv.c |  2 ++

>  include/asm-generic/mshyperv.h | 12 +++++++++++-

>  2 files changed, 13 insertions(+), 1 deletion(-)

> 

> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c

> index 6b5835a087a3..2b7f396ef1a5 100644

> --- a/arch/x86/kernel/cpu/mshyperv.c

> +++ b/arch/x86/kernel/cpu/mshyperv.c

> @@ -313,6 +313,8 @@ static void __init ms_hyperv_init_platform(void)

>  	if (ms_hyperv.priv_high & HV_ISOLATION) {

>  		ms_hyperv.isolation_config_a = cpuid_eax(HYPERV_CPUID_ISOLATION_CONFIG);

>  		ms_hyperv.isolation_config_b = cpuid_ebx(HYPERV_CPUID_ISOLATION_CONFIG);

> +		ms_hyperv.shared_gpa_boundary =

> +			(u64)1 << ms_hyperv.shared_gpa_boundary_bits;


You could use BIT_ULL() here, but it's kind of a shrug.

> 

>  		pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 0x%x\n",

>  			ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b);

> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h

> index 4269f3174e58..aa26d24a5ca9 100644

> --- a/include/asm-generic/mshyperv.h

> +++ b/include/asm-generic/mshyperv.h

> @@ -35,8 +35,18 @@ struct ms_hyperv_info {

>  	u32 max_vp_index;

>  	u32 max_lp_index;

>  	u32 isolation_config_a;

> -	u32 isolation_config_b;

> +	union {

> +		u32 isolation_config_b;

> +		struct {

> +			u32 cvm_type : 4;

> +			u32 Reserved11 : 1;

> +			u32 shared_gpa_boundary_active : 1;

> +			u32 shared_gpa_boundary_bits : 6;

> +			u32 Reserved12 : 20;


Any reason to name the reserved fields as "11" and "12"?  It
just looks a bit unusual.  And I'd suggest lowercase "r".

> +		};

> +	};

>  	void  __percpu **ghcb_base;

> +	u64 shared_gpa_boundary;

>  };

>  extern struct ms_hyperv_info ms_hyperv;

> 

> --

> 2.25.1
Tianyu Lan Aug. 13, 2021, 3:46 p.m. UTC | #7
Hi Michael:
      Thanks for your review.

On 8/13/2021 3:14 AM, Michael Kelley wrote:
> From: Tianyu Lan <ltykernel@gmail.com> Sent: Monday, August 9, 2021 10:56 AM

>> Subject: [PATCH V3 01/13] x86/HV: Initialize GHCB page in Isolation VM

> 

> The subject line tag on patches under arch/x86/hyperv is generally "x86/hyperv:".

> There's some variation in the spelling of "hyperv", but let's go with the all

> lowercase "hyperv".


OK. Will update.

> 

>>

>> Hyper-V exposes GHCB page via SEV ES GHCB MSR for SNP guest

>> to communicate with hypervisor. Map GHCB page for all

>> cpus to read/write MSR register and submit hvcall request

>> via GHCB.

>>

>> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>

>> ---

>>   arch/x86/hyperv/hv_init.c       | 66 +++++++++++++++++++++++++++++++--

>>   arch/x86/include/asm/mshyperv.h |  2 +

>>   include/asm-generic/mshyperv.h  |  2 +

>>   3 files changed, 66 insertions(+), 4 deletions(-)

>>

>> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c

>> index 708a2712a516..0bb4d9ca7a55 100644

>> --- a/arch/x86/hyperv/hv_init.c

>> +++ b/arch/x86/hyperv/hv_init.c

>> @@ -20,6 +20,7 @@

>>   #include <linux/kexec.h>

>>   #include <linux/version.h>

>>   #include <linux/vmalloc.h>

>> +#include <linux/io.h>

>>   #include <linux/mm.h>

>>   #include <linux/hyperv.h>

>>   #include <linux/slab.h>

>> @@ -42,6 +43,31 @@ static void *hv_hypercall_pg_saved;

>>   struct hv_vp_assist_page **hv_vp_assist_page;

>>   EXPORT_SYMBOL_GPL(hv_vp_assist_page);

>>

>> +static int hyperv_init_ghcb(void)

>> +{

>> +	u64 ghcb_gpa;

>> +	void *ghcb_va;

>> +	void **ghcb_base;

>> +

>> +	if (!ms_hyperv.ghcb_base)

>> +		return -EINVAL;

>> +

>> +	/*

>> +	 * GHCB page is allocated by paravisor. The address

>> +	 * returned by MSR_AMD64_SEV_ES_GHCB is above shared

>> +	 * ghcb boundary and map it here.

>> +	 */

>> +	rdmsrl(MSR_AMD64_SEV_ES_GHCB, ghcb_gpa);

>> +	ghcb_va = memremap(ghcb_gpa, HV_HYP_PAGE_SIZE, MEMREMAP_WB);

>> +	if (!ghcb_va)

>> +		return -ENOMEM;

>> +

>> +	ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);

>> +	*ghcb_base = ghcb_va;

>> +

>> +	return 0;

>> +}

>> +

>>   static int hv_cpu_init(unsigned int cpu)

>>   {

>>   	union hv_vp_assist_msr_contents msr = { 0 };

>> @@ -85,6 +111,8 @@ static int hv_cpu_init(unsigned int cpu)

>>   		}

>>   	}

>>

>> +	hyperv_init_ghcb();

>> +

>>   	return 0;

>>   }

>>

>> @@ -177,6 +205,14 @@ static int hv_cpu_die(unsigned int cpu)

>>   {

>>   	struct hv_reenlightenment_control re_ctrl;

>>   	unsigned int new_cpu;

>> +	void **ghcb_va = NULL;

> 

> I'm not seeing any reason why this needs to be initialized.

> 

>> +

>> +	if (ms_hyperv.ghcb_base) {

>> +		ghcb_va = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);

>> +		if (*ghcb_va)

>> +			memunmap(*ghcb_va);

>> +		*ghcb_va = NULL;

>> +	}

>>

>>   	hv_common_cpu_die(cpu);

>>

>> @@ -383,9 +419,19 @@ void __init hyperv_init(void)

>>   			VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_ROX,

>>   			VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,

>>   			__builtin_return_address(0));

>> -	if (hv_hypercall_pg == NULL) {

>> -		wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);

>> -		goto remove_cpuhp_state;

>> +	if (hv_hypercall_pg == NULL)

>> +		goto clean_guest_os_id;

>> +

>> +	if (hv_isolation_type_snp()) {

>> +		ms_hyperv.ghcb_base = alloc_percpu(void *);

>> +		if (!ms_hyperv.ghcb_base)

>> +			goto clean_guest_os_id;

>> +

>> +		if (hyperv_init_ghcb()) {

>> +			free_percpu(ms_hyperv.ghcb_base);

>> +			ms_hyperv.ghcb_base = NULL;

>> +			goto clean_guest_os_id;

>> +		}

> 

> Having the GHCB setup code here splits the hypercall page setup into

> two parts, which is unexpected.  First the memory is allocated

> for the hypercall page, then the GHCB stuff is done, then the hypercall

> MSR is setup.  Is there a need to do this split?  Also, if the GHCB stuff

> fails and you goto clean_guest_os_id, the memory allocated for the

> hypercall page is never freed.



Just not enable hypercall when fails to setup ghcb. Otherwise, we need 
to disable hypercall in the failure code path.

Yes,hypercall page should be freed in the clean_guest_os_id path.

> 

> It's also unexpected to have hyperv_init_ghcb() called here and called

> in hv_cpu_init().  Wouldn't it be possible to setup ghcb_base *before*

> cpu_setup_state() is called, so that hv_cpu_init() would take care of

> calling hyperv_init_ghcb() for the boot CPU?  That's the pattern used

> by the VP assist page, the percpu input page, etc.


I will have a try and report back. Thanks for suggestion.

> 

>>   	}

>>

>>   	rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);

>> @@ -456,7 +502,8 @@ void __init hyperv_init(void)

>>   	hv_query_ext_cap(0);

>>   	return;

>>

>> -remove_cpuhp_state:

>> +clean_guest_os_id:

>> +	wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);

>>   	cpuhp_remove_state(cpuhp);

>>   free_vp_assist_page:

>>   	kfree(hv_vp_assist_page);

>> @@ -484,6 +531,9 @@ void hyperv_cleanup(void)

>>   	 */

>>   	hv_hypercall_pg = NULL;

>>

>> +	if (ms_hyperv.ghcb_base)

>> +		free_percpu(ms_hyperv.ghcb_base);

>> +

> 

> I don't think this cleanup is necessary.  The primary purpose of

> hyperv_cleanup() is to ensure that things like overlay pages are

> properly reset in Hyper-V before doing a kexec(), or before

> panic'ing and running the kdump kernel.  There's no need to do

> general memory free'ing in Linux.  Doing so just adds to the risk

> that the panic path could itself fail.


Nice catch. I will remove this.

> 

>>   	/* Reset the hypercall page */

>>   	hypercall_msr.as_uint64 = 0;

>>   	wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);

>> @@ -559,3 +609,11 @@ bool hv_is_isolation_supported(void)

>>   {

>>   	return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;

>>   }

>> +

>> +DEFINE_STATIC_KEY_FALSE(isolation_type_snp);

>> +

>> +bool hv_isolation_type_snp(void)

>> +{

>> +	return static_branch_unlikely(&isolation_type_snp);

>> +}

>> +EXPORT_SYMBOL_GPL(hv_isolation_type_snp);

>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h

>> index adccbc209169..6627cfd2bfba 100644

>> --- a/arch/x86/include/asm/mshyperv.h

>> +++ b/arch/x86/include/asm/mshyperv.h

>> @@ -11,6 +11,8 @@

>>   #include <asm/paravirt.h>

>>   #include <asm/mshyperv.h>

>>

>> +DECLARE_STATIC_KEY_FALSE(isolation_type_snp);

>> +

>>   typedef int (*hyperv_fill_flush_list_func)(

>>   		struct hv_guest_mapping_flush_list *flush,

>>   		void *data);

>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h

>> index c1ab6a6e72b5..4269f3174e58 100644

>> --- a/include/asm-generic/mshyperv.h

>> +++ b/include/asm-generic/mshyperv.h

>> @@ -36,6 +36,7 @@ struct ms_hyperv_info {

>>   	u32 max_lp_index;

>>   	u32 isolation_config_a;

>>   	u32 isolation_config_b;

>> +	void  __percpu **ghcb_base;

> 

> This doesn't feel like the right place to put this pointer.  The other

> fields in the ms_hyperv_info structure are just fixed values obtained

> from the CPUID instruction.   The existing patterns similar to ghcb_base

> are the VP assist page and the percpu input and output args.  They are

> all based on standalone global variables.  It would be more consistent

> to do the same with the ghcb_base.


OK. I will update in the next version.

> 

>>   };

>>   extern struct ms_hyperv_info ms_hyperv;

>>

>> @@ -237,6 +238,7 @@ bool hv_is_hyperv_initialized(void);

>>   bool hv_is_hibernation_supported(void);

>>   enum hv_isolation_type hv_get_isolation_type(void);

>>   bool hv_is_isolation_supported(void);

>> +bool hv_isolation_type_snp(void);

>>   void hyperv_cleanup(void);

>>   bool hv_query_ext_cap(u64 cap_query);

>>   #else /* CONFIG_HYPERV */

>> --

>> 2.25.1

>
Michael Kelley Aug. 13, 2021, 7:31 p.m. UTC | #8
From: Tianyu Lan <ltykernel@gmail.com> Sent: Monday, August 9, 2021 10:56 AM

> Subject: [PATCH V3 05/13] HV: Add Write/Read MSR registers via ghcb page


See previous comments about tag in the Subject line.

> Hyper-V provides GHCB protocol to write Synthetic Interrupt

> Controller MSR registers in Isolation VM with AMD SEV SNP

> and these registers are emulated by hypervisor directly.

> Hyper-V requires to write SINTx MSR registers twice. First

> writes MSR via GHCB page to communicate with hypervisor

> and then writes wrmsr instruction to talk with paravisor

> which runs in VMPL0. Guest OS ID MSR also needs to be set

> via GHCB.

> 

> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>

> ---

> Change since v1:

>          * Introduce sev_es_ghcb_hv_call_simple() and share code

>            between SEV and Hyper-V code.

> ---

>  arch/x86/hyperv/hv_init.c       |  33 ++-------

>  arch/x86/hyperv/ivm.c           | 110 +++++++++++++++++++++++++++++

>  arch/x86/include/asm/mshyperv.h |  78 +++++++++++++++++++-

>  arch/x86/include/asm/sev.h      |   3 +

>  arch/x86/kernel/cpu/mshyperv.c  |   3 +

>  arch/x86/kernel/sev-shared.c    |  63 ++++++++++-------

>  drivers/hv/hv.c                 | 121 ++++++++++++++++++++++----------

>  include/asm-generic/mshyperv.h  |  12 +++-

>  8 files changed, 329 insertions(+), 94 deletions(-)

> 

> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c

> index b3683083208a..ab0b33f621e7 100644

> --- a/arch/x86/hyperv/hv_init.c

> +++ b/arch/x86/hyperv/hv_init.c

> @@ -423,7 +423,7 @@ void __init hyperv_init(void)

>  		goto clean_guest_os_id;

> 

>  	if (hv_isolation_type_snp()) {

> -		ms_hyperv.ghcb_base = alloc_percpu(void *);

> +		ms_hyperv.ghcb_base = alloc_percpu(union hv_ghcb __percpu *);


union hv_ghcb isn't defined.  It is not added until patch 6 of the series.

>  		if (!ms_hyperv.ghcb_base)

>  			goto clean_guest_os_id;

> 

> @@ -432,6 +432,9 @@ void __init hyperv_init(void)

>  			ms_hyperv.ghcb_base = NULL;

>  			goto clean_guest_os_id;

>  		}

> +

> +		/* Hyper-V requires to write guest os id via ghcb in SNP IVM. */

> +		hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, guest_id);

>  	}

> 

>  	rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);

> @@ -523,6 +526,7 @@ void hyperv_cleanup(void)

> 

>  	/* Reset our OS id */

>  	wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);

> +	hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);

> 

>  	/*

>  	 * Reset hypercall page reference before reset the page,

> @@ -596,30 +600,3 @@ bool hv_is_hyperv_initialized(void)

>  	return hypercall_msr.enable;

>  }

>  EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized);

> -

> -enum hv_isolation_type hv_get_isolation_type(void)

> -{

> -	if (!(ms_hyperv.priv_high & HV_ISOLATION))

> -		return HV_ISOLATION_TYPE_NONE;

> -	return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);

> -}

> -EXPORT_SYMBOL_GPL(hv_get_isolation_type);

> -

> -bool hv_is_isolation_supported(void)

> -{

> -	if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))

> -		return 0;

> -

> -	if (!hypervisor_is_type(X86_HYPER_MS_HYPERV))

> -		return 0;

> -

> -	return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;

> -}

> -

> -DEFINE_STATIC_KEY_FALSE(isolation_type_snp);

> -

> -bool hv_isolation_type_snp(void)

> -{

> -	return static_branch_unlikely(&isolation_type_snp);

> -}

> -EXPORT_SYMBOL_GPL(hv_isolation_type_snp);

> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c

> index 8c905ffdba7f..ec0e5c259740 100644

> --- a/arch/x86/hyperv/ivm.c

> +++ b/arch/x86/hyperv/ivm.c

> @@ -6,6 +6,8 @@

>   *  Tianyu Lan <Tianyu.Lan@microsoft.com>

>   */

> 

> +#include <linux/types.h>

> +#include <linux/bitfield.h>

>  #include <linux/hyperv.h>

>  #include <linux/types.h>

>  #include <linux/bitfield.h>

> @@ -13,6 +15,114 @@

>  #include <asm/io.h>

>  #include <asm/mshyperv.h>

> 

> +void hv_ghcb_msr_write(u64 msr, u64 value)

> +{

> +	union hv_ghcb *hv_ghcb;

> +	void **ghcb_base;

> +	unsigned long flags;

> +

> +	if (!ms_hyperv.ghcb_base)

> +		return;

> +

> +	WARN_ON(in_nmi());

> +

> +	local_irq_save(flags);

> +	ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);

> +	hv_ghcb = (union hv_ghcb *)*ghcb_base;

> +	if (!hv_ghcb) {

> +		local_irq_restore(flags);

> +		return;

> +	}

> +

> +	ghcb_set_rcx(&hv_ghcb->ghcb, msr);

> +	ghcb_set_rax(&hv_ghcb->ghcb, lower_32_bits(value));

> +	ghcb_set_rdx(&hv_ghcb->ghcb, value >> 32);


Having used lower_32_bits() in the previous line, perhaps use
upper_32_bits() here?

> +

> +	if (sev_es_ghcb_hv_call_simple(&hv_ghcb->ghcb, SVM_EXIT_MSR, 1, 0))

> +		pr_warn("Fail to write msr via ghcb %llx.\n", msr);

> +

> +	local_irq_restore(flags);

> +}

> +

> +void hv_ghcb_msr_read(u64 msr, u64 *value)

> +{

> +	union hv_ghcb *hv_ghcb;

> +	void **ghcb_base;

> +	unsigned long flags;

> +

> +	if (!ms_hyperv.ghcb_base)

> +		return;

> +

> +	WARN_ON(in_nmi());

> +

> +	local_irq_save(flags);

> +	ghcb_base = (void **)this_cpu_ptr(ms_hyperv.ghcb_base);

> +	hv_ghcb = (union hv_ghcb *)*ghcb_base;

> +	if (!hv_ghcb) {

> +		local_irq_restore(flags);

> +		return;

> +	}

> +

> +	ghcb_set_rcx(&hv_ghcb->ghcb, msr);

> +	if (sev_es_ghcb_hv_call_simple(&hv_ghcb->ghcb, SVM_EXIT_MSR, 0, 0))

> +		pr_warn("Fail to read msr via ghcb %llx.\n", msr);

> +	else

> +		*value = (u64)lower_32_bits(hv_ghcb->ghcb.save.rax)

> +			| ((u64)lower_32_bits(hv_ghcb->ghcb.save.rdx) << 32);

> +	local_irq_restore(flags);

> +}

> +

> +void hv_sint_rdmsrl_ghcb(u64 msr, u64 *value)

> +{

> +	hv_ghcb_msr_read(msr, value);

> +}

> +EXPORT_SYMBOL_GPL(hv_sint_rdmsrl_ghcb);

> +

> +void hv_sint_wrmsrl_ghcb(u64 msr, u64 value)

> +{

> +	hv_ghcb_msr_write(msr, value);

> +

> +	/* Write proxy bit vua wrmsrl instruction. */


s/vua/via/

> +	if (msr >= HV_X64_MSR_SINT0 && msr <= HV_X64_MSR_SINT15)

> +		wrmsrl(msr, value | 1 << 20);

> +}

> +EXPORT_SYMBOL_GPL(hv_sint_wrmsrl_ghcb);

> +

> +void hv_signal_eom_ghcb(void)

> +{

> +	hv_sint_wrmsrl_ghcb(HV_X64_MSR_EOM, 0);

> +}

> +EXPORT_SYMBOL_GPL(hv_signal_eom_ghcb);


Is there a reason for this to be a function instead of a #define?
Seemingly parallel calls such as  hv_set_synic_state_ghcb()
are #defines.

> +

> +enum hv_isolation_type hv_get_isolation_type(void)

> +{

> +	if (!(ms_hyperv.priv_high & HV_ISOLATION))

> +		return HV_ISOLATION_TYPE_NONE;

> +	return FIELD_GET(HV_ISOLATION_TYPE, ms_hyperv.isolation_config_b);

> +}

> +EXPORT_SYMBOL_GPL(hv_get_isolation_type);

> +

> +/*

> + * hv_is_isolation_supported - Check system runs in the Hyper-V

> + * isolation VM.

> + */

> +bool hv_is_isolation_supported(void)

> +{

> +	return hv_get_isolation_type() != HV_ISOLATION_TYPE_NONE;

> +}

> +

> +DEFINE_STATIC_KEY_FALSE(isolation_type_snp);

> +

> +/*

> + * hv_isolation_type_snp - Check system runs in the AMD SEV-SNP based

> + * isolation VM.

> + */

> +bool hv_isolation_type_snp(void)

> +{

> +	return static_branch_unlikely(&isolation_type_snp);

> +}

> +EXPORT_SYMBOL_GPL(hv_isolation_type_snp);

> +


hv_isolation_type_snp() is implemented here in a file under arch/x86,
but it is called from architecture independent code in drivers/hv, so it
needs to do the right thing on ARM64 as well as x86.  For an example,
see the handling of hv_is_isolation_supported() in the latest linux-next
tree.

>  /*

>   * hv_mark_gpa_visibility - Set pages visible to host via hvcall.

>   *

> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h

> index 87a386fa97f7..730985676ea3 100644

> --- a/arch/x86/include/asm/mshyperv.h

> +++ b/arch/x86/include/asm/mshyperv.h

> @@ -30,6 +30,63 @@ static inline u64 hv_get_register(unsigned int reg)

>  	return value;

>  }

> 

> +#define hv_get_sint_reg(val, reg) {		\

> +	if (hv_isolation_type_snp())		\

> +		hv_get_##reg##_ghcb(&val);	\

> +	else					\

> +		rdmsrl(HV_X64_MSR_##reg, val);	\

> +	}

> +

> +#define hv_set_sint_reg(val, reg) {		\

> +	if (hv_isolation_type_snp())		\

> +		hv_set_##reg##_ghcb(val);	\

> +	else					\

> +		wrmsrl(HV_X64_MSR_##reg, val);	\

> +	}

> +

> +

> +#define hv_get_simp(val) hv_get_sint_reg(val, SIMP)

> +#define hv_get_siefp(val) hv_get_sint_reg(val, SIEFP)

> +

> +#define hv_set_simp(val) hv_set_sint_reg(val, SIMP)

> +#define hv_set_siefp(val) hv_set_sint_reg(val, SIEFP)

> +

> +#define hv_get_synic_state(val) {			\

> +	if (hv_isolation_type_snp())			\

> +		hv_get_synic_state_ghcb(&val);		\

> +	else						\

> +		rdmsrl(HV_X64_MSR_SCONTROL, val);	\


Many of these registers that exist on x86 and ARM64 architectures
have new names without the "X64_MSR" portion.  For
example, include/asm-generic/hyperv-tlfs.h defines
HV_REGISTER_SCONTROL.  The x86-specific version of 
hyperv-tlfs.h currently has a #define for HV_X64_MSR_SCONTROL,
but we would like to get rid of these temporary aliases.
So prefer to use HV_REGISTER_SCONTROL.

Same comment applies several places in this code for other
similar registers.

> +	}

> +#define hv_set_synic_state(val) {			\

> +	if (hv_isolation_type_snp())			\

> +		hv_set_synic_state_ghcb(val);		\

> +	else						\

> +		wrmsrl(HV_X64_MSR_SCONTROL, val);	\

> +	}

> +

> +#define hv_get_vp_index(index) rdmsrl(HV_X64_MSR_VP_INDEX, index)

> +

> +#define hv_signal_eom() {			 \

> +	if (hv_isolation_type_snp() &&		 \

> +	    old_msg_type != HVMSG_TIMER_EXPIRED) \


Why is a test of the old_msg_type embedded in this macro?
And given that old_msg_type isn't a parameter of the
macro, its use is really weird.

> +		hv_signal_eom_ghcb();		 \

> +	else					 \

> +		wrmsrl(HV_X64_MSR_EOM, 0);	 \

> +	}

> +

> +#define hv_get_synint_state(int_num, val) {		\

> +	if (hv_isolation_type_snp())			\0

> +		hv_get_synint_state_ghcb(int_num, &val);\

> +	else						\

> +		rdmsrl(HV_X64_MSR_SINT0 + int_num, val);\

> +	}

> +#define hv_set_synint_state(int_num, val) {		\

> +	if (hv_isolation_type_snp())			\

> +		hv_set_synint_state_ghcb(int_num, val);	\

> +	else						\

> +		wrmsrl(HV_X64_MSR_SINT0 + int_num, val);\

> +	}

> +

>  #define hv_get_raw_timer() rdtsc_ordered()

> 

>  void hyperv_vector_handler(struct pt_regs *regs);

> @@ -193,6 +250,25 @@ int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);

>  int hv_mark_gpa_visibility(u16 count, const u64 pfn[],

>  			   enum hv_mem_host_visibility visibility);

>  int hv_set_mem_host_visibility(unsigned long addr, int numpages, bool visible);

> +void hv_sint_wrmsrl_ghcb(u64 msr, u64 value);

> +void hv_sint_rdmsrl_ghcb(u64 msr, u64 *value);

> +void hv_signal_eom_ghcb(void);

> +void hv_ghcb_msr_write(u64 msr, u64 value);

> +void hv_ghcb_msr_read(u64 msr, u64 *value);

> +

> +#define hv_get_synint_state_ghcb(int_num, val)			\

> +	hv_sint_rdmsrl_ghcb(HV_X64_MSR_SINT0 + int_num, val)

> +#define hv_set_synint_state_ghcb(int_num, val) \

> +	hv_sint_wrmsrl_ghcb(HV_X64_MSR_SINT0 + int_num, val)

> +

> +#define hv_get_SIMP_ghcb(val) hv_sint_rdmsrl_ghcb(HV_X64_MSR_SIMP, val)

> +#define hv_set_SIMP_ghcb(val) hv_sint_wrmsrl_ghcb(HV_X64_MSR_SIMP, val)

> +

> +#define hv_get_SIEFP_ghcb(val) hv_sint_rdmsrl_ghcb(HV_X64_MSR_SIEFP, val)

> +#define hv_set_SIEFP_ghcb(val) hv_sint_wrmsrl_ghcb(HV_X64_MSR_SIEFP, val)

> +

> +#define hv_get_synic_state_ghcb(val) hv_sint_rdmsrl_ghcb(HV_X64_MSR_SCONTROL, val)

> +#define hv_set_synic_state_ghcb(val) hv_sint_wrmsrl_ghcb(HV_X64_MSR_SCONTROL, val)

>  #else /* CONFIG_HYPERV */

>  static inline void hyperv_init(void) {}

>  static inline void hyperv_setup_mmu_ops(void) {}

> @@ -209,9 +285,9 @@ static inline int hyperv_flush_guest_mapping_range(u64 as,

>  {

>  	return -1;

>  }

> +static inline void hv_signal_eom_ghcb(void) { };

>  #endif /* CONFIG_HYPERV */

> 

> -

>  #include <asm-generic/mshyperv.h>

> 

>  #endif

> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h

> index fa5cd05d3b5b..81beb2a8031b 100644

> --- a/arch/x86/include/asm/sev.h

> +++ b/arch/x86/include/asm/sev.h

> @@ -81,6 +81,9 @@ static __always_inline void sev_es_nmi_complete(void)

>  		__sev_es_nmi_complete();

>  }

>  extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);

> +extern enum es_result sev_es_ghcb_hv_call_simple(struct ghcb *ghcb,

> +				   u64 exit_code, u64 exit_info_1,

> +				   u64 exit_info_2);

>  #else

>  static inline void sev_es_ist_enter(struct pt_regs *regs) { }

>  static inline void sev_es_ist_exit(void) { }

> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c

> index 2b7f396ef1a5..3633f871ac1e 100644

> --- a/arch/x86/kernel/cpu/mshyperv.c

> +++ b/arch/x86/kernel/cpu/mshyperv.c

> @@ -318,6 +318,9 @@ static void __init ms_hyperv_init_platform(void)

> 

>  		pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 0x%x\n",

>  			ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b);

> +

> +		if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP)

> +			static_branch_enable(&isolation_type_snp);

>  	}

> 

>  	if (hv_max_functions_eax >= HYPERV_CPUID_NESTED_FEATURES) {

> diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c

> index 9f90f460a28c..dd7f37de640b 100644

> --- a/arch/x86/kernel/sev-shared.c

> +++ b/arch/x86/kernel/sev-shared.c

> @@ -94,10 +94,9 @@ static void vc_finish_insn(struct es_em_ctxt *ctxt)

>  	ctxt->regs->ip += ctxt->insn.length;

>  }

> 

> -static enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,

> -					  struct es_em_ctxt *ctxt,

> -					  u64 exit_code, u64 exit_info_1,

> -					  u64 exit_info_2)

> +enum es_result sev_es_ghcb_hv_call_simple(struct ghcb *ghcb,

> +				   u64 exit_code, u64 exit_info_1,

> +				   u64 exit_info_2)

>  {

>  	enum es_result ret;

> 

> @@ -109,29 +108,45 @@ static enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,

>  	ghcb_set_sw_exit_info_1(ghcb, exit_info_1);

>  	ghcb_set_sw_exit_info_2(ghcb, exit_info_2);

> 

> -	sev_es_wr_ghcb_msr(__pa(ghcb));

>  	VMGEXIT();

> 

> -	if ((ghcb->save.sw_exit_info_1 & 0xffffffff) == 1) {

> -		u64 info = ghcb->save.sw_exit_info_2;

> -		unsigned long v;

> -

> -		info = ghcb->save.sw_exit_info_2;

> -		v = info & SVM_EVTINJ_VEC_MASK;

> -

> -		/* Check if exception information from hypervisor is sane. */

> -		if ((info & SVM_EVTINJ_VALID) &&

> -		    ((v == X86_TRAP_GP) || (v == X86_TRAP_UD)) &&

> -		    ((info & SVM_EVTINJ_TYPE_MASK) == SVM_EVTINJ_TYPE_EXEPT)) {

> -			ctxt->fi.vector = v;

> -			if (info & SVM_EVTINJ_VALID_ERR)

> -				ctxt->fi.error_code = info >> 32;

> -			ret = ES_EXCEPTION;

> -		} else {

> -			ret = ES_VMM_ERROR;

> -		}

> -	} else {

> +	if ((ghcb->save.sw_exit_info_1 & 0xffffffff) == 1)

> +		ret = ES_VMM_ERROR;

> +	else

>  		ret = ES_OK;

> +

> +	return ret;

> +}

> +

> +static enum es_result sev_es_ghcb_hv_call(struct ghcb *ghcb,

> +				   struct es_em_ctxt *ctxt,

> +				   u64 exit_code, u64 exit_info_1,

> +				   u64 exit_info_2)

> +{

> +	unsigned long v;

> +	enum es_result ret;

> +	u64 info;

> +

> +	sev_es_wr_ghcb_msr(__pa(ghcb));

> +

> +	ret = sev_es_ghcb_hv_call_simple(ghcb, exit_code, exit_info_1,

> +					 exit_info_2);

> +	if (ret == ES_OK)

> +		return ret;

> +

> +	info = ghcb->save.sw_exit_info_2;

> +	v = info & SVM_EVTINJ_VEC_MASK;

> +

> +	/* Check if exception information from hypervisor is sane. */

> +	if ((info & SVM_EVTINJ_VALID) &&

> +	    ((v == X86_TRAP_GP) || (v == X86_TRAP_UD)) &&

> +	    ((info & SVM_EVTINJ_TYPE_MASK) == SVM_EVTINJ_TYPE_EXEPT)) {

> +		ctxt->fi.vector = v;

> +		if (info & SVM_EVTINJ_VALID_ERR)

> +			ctxt->fi.error_code = info >> 32;

> +		ret = ES_EXCEPTION;

> +	} else {

> +		ret = ES_VMM_ERROR;

>  	}

> 

>  	return ret;

> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c

> index e83507f49676..59f7173c4d9f 100644

> --- a/drivers/hv/hv.c

> +++ b/drivers/hv/hv.c

> @@ -8,6 +8,7 @@

>   */

>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

> 

> +#include <linux/io.h>

>  #include <linux/kernel.h>

>  #include <linux/mm.h>

>  #include <linux/slab.h>

> @@ -136,17 +137,24 @@ int hv_synic_alloc(void)

>  		tasklet_init(&hv_cpu->msg_dpc,

>  			     vmbus_on_msg_dpc, (unsigned long) hv_cpu);

> 

> -		hv_cpu->synic_message_page =

> -			(void *)get_zeroed_page(GFP_ATOMIC);

> -		if (hv_cpu->synic_message_page == NULL) {

> -			pr_err("Unable to allocate SYNIC message page\n");

> -			goto err;

> -		}

> +		/*

> +		 * Synic message and event pages are allocated by paravisor.

> +		 * Skip these pages allocation here.

> +		 */

> +		if (!hv_isolation_type_snp()) {

> +			hv_cpu->synic_message_page =

> +				(void *)get_zeroed_page(GFP_ATOMIC);

> +			if (hv_cpu->synic_message_page == NULL) {

> +				pr_err("Unable to allocate SYNIC message page\n");

> +				goto err;

> +			}

> 

> -		hv_cpu->synic_event_page = (void *)get_zeroed_page(GFP_ATOMIC);

> -		if (hv_cpu->synic_event_page == NULL) {

> -			pr_err("Unable to allocate SYNIC event page\n");

> -			goto err;

> +			hv_cpu->synic_event_page =

> +				(void *)get_zeroed_page(GFP_ATOMIC);

> +			if (hv_cpu->synic_event_page == NULL) {

> +				pr_err("Unable to allocate SYNIC event page\n");

> +				goto err;

> +			}

>  		}

> 

>  		hv_cpu->post_msg_page = (void *)get_zeroed_page(GFP_ATOMIC);

> @@ -173,10 +181,17 @@ void hv_synic_free(void)

>  	for_each_present_cpu(cpu) {

>  		struct hv_per_cpu_context *hv_cpu

>  			= per_cpu_ptr(hv_context.cpu_context, cpu);

> +		free_page((unsigned long)hv_cpu->post_msg_page);

> +

> +		/*

> +		 * Synic message and event pages are allocated by paravisor.

> +		 * Skip free these pages here.

> +		 */

> +		if (hv_isolation_type_snp())

> +			continue;

> 

>  		free_page((unsigned long)hv_cpu->synic_event_page);

>  		free_page((unsigned long)hv_cpu->synic_message_page);

> -		free_page((unsigned long)hv_cpu->post_msg_page);


You could skip making these changes to hv_synic_free().  If the message
and event pages aren't allocated, the pointers will be NULL and
free_page() will happily do nothing.

>  	}

> 

>  	kfree(hv_context.hv_numa_map);

> @@ -199,26 +214,43 @@ void hv_synic_enable_regs(unsigned int cpu)

>  	union hv_synic_scontrol sctrl;

> 

>  	/* Setup the Synic's message page */

> -	simp.as_uint64 = hv_get_register(HV_REGISTER_SIMP);

> +	hv_get_simp(simp.as_uint64);


This code is intended to be architecture independent and builds for
x86 and for ARM64.  Changing the use of hv_get_register() and hv_set_register()
will fail badly when built for ARM64.   I haven't completely thought through
what the best solution might be, but the current set of mappings from
hv_get_simp() down to hv_ghcb_msr_read() isn't going to work on ARM64.

Is it possible to hide all the x86-side complexity in the implementation of
hv_get_register()?   Certain MSRs would have to be special-cased when
SNP isolation is enabled, but that might be easier than trying to no-op out
the ghcb machinery on the ARM64 side. 

>  	simp.simp_enabled = 1;

> -	simp.base_simp_gpa = virt_to_phys(hv_cpu->synic_message_page)

> -		>> HV_HYP_PAGE_SHIFT;

> 

> -	hv_set_register(HV_REGISTER_SIMP, simp.as_uint64);

> +	if (hv_isolation_type_snp()) {

> +		hv_cpu->synic_message_page

> +			= memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,

> +				   HV_HYP_PAGE_SIZE, MEMREMAP_WB);

> +		if (!hv_cpu->synic_message_page)

> +			pr_err("Fail to map syinc message page.\n");

> +	} else {

> +		simp.base_simp_gpa = virt_to_phys(hv_cpu->synic_message_page)

> +			>> HV_HYP_PAGE_SHIFT;

> +	}

> +

> +	hv_set_simp(simp.as_uint64);

> 

>  	/* Setup the Synic's event page */

> -	siefp.as_uint64 = hv_get_register(HV_REGISTER_SIEFP);

> +	hv_get_siefp(siefp.as_uint64);

>  	siefp.siefp_enabled = 1;

> -	siefp.base_siefp_gpa = virt_to_phys(hv_cpu->synic_event_page)

> -		>> HV_HYP_PAGE_SHIFT;

> 

> -	hv_set_register(HV_REGISTER_SIEFP, siefp.as_uint64);

> +	if (hv_isolation_type_snp()) {

> +		hv_cpu->synic_event_page =

> +			memremap(siefp.base_siefp_gpa << HV_HYP_PAGE_SHIFT,

> +				 HV_HYP_PAGE_SIZE, MEMREMAP_WB);

> +

> +		if (!hv_cpu->synic_event_page)

> +			pr_err("Fail to map syinc event page.\n");

> +	} else {

> +		siefp.base_siefp_gpa = virt_to_phys(hv_cpu->synic_event_page)

> +			>> HV_HYP_PAGE_SHIFT;

> +	}

> +	hv_set_siefp(siefp.as_uint64);

> 

>  	/* Setup the shared SINT. */

>  	if (vmbus_irq != -1)

>  		enable_percpu_irq(vmbus_irq, 0);

> -	shared_sint.as_uint64 = hv_get_register(HV_REGISTER_SINT0 +

> -					VMBUS_MESSAGE_SINT);

> +	hv_get_synint_state(VMBUS_MESSAGE_SINT, shared_sint.as_uint64);

> 

>  	shared_sint.vector = vmbus_interrupt;

>  	shared_sint.masked = false;

> @@ -233,14 +265,12 @@ void hv_synic_enable_regs(unsigned int cpu)

>  #else

>  	shared_sint.auto_eoi = 0;

>  #endif

> -	hv_set_register(HV_REGISTER_SINT0 + VMBUS_MESSAGE_SINT,

> -				shared_sint.as_uint64);

> +	hv_set_synint_state(VMBUS_MESSAGE_SINT, shared_sint.as_uint64);

> 

>  	/* Enable the global synic bit */

> -	sctrl.as_uint64 = hv_get_register(HV_REGISTER_SCONTROL);

> +	hv_get_synic_state(sctrl.as_uint64);

>  	sctrl.enable = 1;

> -

> -	hv_set_register(HV_REGISTER_SCONTROL, sctrl.as_uint64);

> +	hv_set_synic_state(sctrl.as_uint64);

>  }

> 

>  int hv_synic_init(unsigned int cpu)

> @@ -257,37 +287,50 @@ int hv_synic_init(unsigned int cpu)

>   */

>  void hv_synic_disable_regs(unsigned int cpu)

>  {

> +	struct hv_per_cpu_context *hv_cpu

> +		= per_cpu_ptr(hv_context.cpu_context, cpu);

>  	union hv_synic_sint shared_sint;

>  	union hv_synic_simp simp;

>  	union hv_synic_siefp siefp;

>  	union hv_synic_scontrol sctrl;

> 

> -	shared_sint.as_uint64 = hv_get_register(HV_REGISTER_SINT0 +

> -					VMBUS_MESSAGE_SINT);

> -

> +	hv_get_synint_state(VMBUS_MESSAGE_SINT, shared_sint.as_uint64);

>  	shared_sint.masked = 1;

> +	hv_set_synint_state(VMBUS_MESSAGE_SINT, shared_sint.as_uint64);

> +

> 

>  	/* Need to correctly cleanup in the case of SMP!!! */

>  	/* Disable the interrupt */

> -	hv_set_register(HV_REGISTER_SINT0 + VMBUS_MESSAGE_SINT,

> -				shared_sint.as_uint64);

> +	hv_get_simp(simp.as_uint64);

> 

> -	simp.as_uint64 = hv_get_register(HV_REGISTER_SIMP);

> +	/*

> +	 * In Isolation VM, sim and sief pages are allocated by

> +	 * paravisor. These pages also will be used by kdump

> +	 * kernel. So just reset enable bit here and keep page

> +	 * addresses.

> +	 */

>  	simp.simp_enabled = 0;

> -	simp.base_simp_gpa = 0;

> +	if (hv_isolation_type_snp())

> +		memunmap(hv_cpu->synic_message_page);

> +	else

> +		simp.base_simp_gpa = 0;

> 

> -	hv_set_register(HV_REGISTER_SIMP, simp.as_uint64);

> +	hv_set_simp(simp.as_uint64);

> 

> -	siefp.as_uint64 = hv_get_register(HV_REGISTER_SIEFP);

> +	hv_get_siefp(siefp.as_uint64);

>  	siefp.siefp_enabled = 0;

> -	siefp.base_siefp_gpa = 0;

> 

> -	hv_set_register(HV_REGISTER_SIEFP, siefp.as_uint64);

> +	if (hv_isolation_type_snp())

> +		memunmap(hv_cpu->synic_event_page);

> +	else

> +		siefp.base_siefp_gpa = 0;

> +

> +	hv_set_siefp(siefp.as_uint64);

> 

>  	/* Disable the global synic bit */

> -	sctrl.as_uint64 = hv_get_register(HV_REGISTER_SCONTROL);

> +	hv_get_synic_state(sctrl.as_uint64);

>  	sctrl.enable = 0;

> -	hv_set_register(HV_REGISTER_SCONTROL, sctrl.as_uint64);

> +	hv_set_synic_state(sctrl.as_uint64);

> 

>  	if (vmbus_irq != -1)

>  		disable_percpu_irq(vmbus_irq);

> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h

> index 079988ed45b9..90dac369a2dc 100644

> --- a/include/asm-generic/mshyperv.h

> +++ b/include/asm-generic/mshyperv.h

> @@ -23,9 +23,16 @@

>  #include <linux/bitops.h>

>  #include <linux/cpumask.h>

>  #include <linux/nmi.h>

> +#include <asm/svm.h>

> +#include <asm/sev.h>

>  #include <asm/ptrace.h>

> +#include <asm/mshyperv.h>

>  #include <asm/hyperv-tlfs.h>

> 

> +union hv_ghcb {

> +	struct ghcb ghcb;

> +} __packed __aligned(PAGE_SIZE);

> +

>  struct ms_hyperv_info {

>  	u32 features;

>  	u32 priv_high;

> @@ -45,7 +52,7 @@ struct ms_hyperv_info {

>  			u32 Reserved12 : 20;

>  		};

>  	};

> -	void  __percpu **ghcb_base;

> +	union hv_ghcb __percpu **ghcb_base;

>  	u64 shared_gpa_boundary;

>  };

>  extern struct ms_hyperv_info ms_hyperv;

> @@ -55,6 +62,7 @@ extern void  __percpu  **hyperv_pcpu_output_arg;

> 

>  extern u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);

>  extern u64 hv_do_fast_hypercall8(u16 control, u64 input8);

> +extern bool hv_isolation_type_snp(void);

> 

>  /* Helper functions that provide a consistent pattern for checking Hyper-V hypercall status. */

>  static inline int hv_result(u64 status)

> @@ -149,7 +157,7 @@ static inline void vmbus_signal_eom(struct hv_message *msg, u32 old_msg_type)

>  		 * possibly deliver another msg from the

>  		 * hypervisor

>  		 */

> -		hv_set_register(HV_REGISTER_EOM, 0);

> +		hv_signal_eom();

>  	}

>  }

> 

> --

> 2.25.1
Michael Kelley Aug. 13, 2021, 8:26 p.m. UTC | #9
From: Michael Kelley <mikelley@microsoft.com> Sent: Friday, August 13, 2021 12:31 PM

> To: Tianyu Lan <ltykernel@gmail.com>; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;

> Stephen Hemminger <sthemmin@microsoft.com>; wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>;

> tglx@linutronix.de; mingo@redhat.com; bp@alien8.de; x86@kernel.org; hpa@zytor.com; dave.hansen@linux.intel.com;

> luto@kernel.org; peterz@infradead.org; konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; jgross@suse.com;

> sstabellini@kernel.org; joro@8bytes.org; will@kernel.org; davem@davemloft.net; kuba@kernel.org; jejb@linux.ibm.com;

> martin.petersen@oracle.com; arnd@arndb.de; hch@lst.de; m.szyprowski@samsung.com; robin.murphy@arm.com;

> thomas.lendacky@amd.com; brijesh.singh@amd.com; ardb@kernel.org; Tianyu Lan <Tianyu.Lan@microsoft.com>;

> pgonda@google.com; martin.b.radev@gmail.com; akpm@linux-foundation.org; kirill.shutemov@linux.intel.com;

> rppt@kernel.org; sfr@canb.auug.org.au; saravanand@fb.com; krish.sadhukhan@oracle.com;

> aneesh.kumar@linux.ibm.com; xen-devel@lists.xenproject.org; rientjes@google.com; hannes@cmpxchg.org;

> tj@kernel.org

> Cc: iommu@lists.linux-foundation.org; linux-arch@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-

> kernel@vger.kernel.org; linux-scsi@vger.kernel.org; netdev@vger.kernel.org; vkuznets <vkuznets@redhat.com>;

> parri.andrea@gmail.com; dave.hansen@intel.com

> Subject: RE: [PATCH V3 05/13] HV: Add Write/Read MSR registers via ghcb page

> 

> From: Tianyu Lan <ltykernel@gmail.com> Sent: Monday, August 9, 2021 10:56 AM

> > Subject: [PATCH V3 05/13] HV: Add Write/Read MSR registers via ghcb page

> 

> See previous comments about tag in the Subject line.

> 

> > Hyper-V provides GHCB protocol to write Synthetic Interrupt

> > Controller MSR registers in Isolation VM with AMD SEV SNP

> > and these registers are emulated by hypervisor directly.

> > Hyper-V requires to write SINTx MSR registers twice. First

> > writes MSR via GHCB page to communicate with hypervisor

> > and then writes wrmsr instruction to talk with paravisor

> > which runs in VMPL0. Guest OS ID MSR also needs to be set

> > via GHCB.

> >

> > Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>

> > ---

> > Change since v1:

> >          * Introduce sev_es_ghcb_hv_call_simple() and share code

> >            between SEV and Hyper-V code.

> > ---

> >  arch/x86/hyperv/hv_init.c       |  33 ++-------

> >  arch/x86/hyperv/ivm.c           | 110 +++++++++++++++++++++++++++++

> >  arch/x86/include/asm/mshyperv.h |  78 +++++++++++++++++++-

> >  arch/x86/include/asm/sev.h      |   3 +

> >  arch/x86/kernel/cpu/mshyperv.c  |   3 +

> >  arch/x86/kernel/sev-shared.c    |  63 ++++++++++-------

> >  drivers/hv/hv.c                 | 121 ++++++++++++++++++++++----------

> >  include/asm-generic/mshyperv.h  |  12 +++-

> >  8 files changed, 329 insertions(+), 94 deletions(-)

> >

> > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c

> > index b3683083208a..ab0b33f621e7 100644

> > --- a/arch/x86/hyperv/hv_init.c

> > +++ b/arch/x86/hyperv/hv_init.c

> > @@ -423,7 +423,7 @@ void __init hyperv_init(void)

> >  		goto clean_guest_os_id;

> >

> >  	if (hv_isolation_type_snp()) {

> > -		ms_hyperv.ghcb_base = alloc_percpu(void *);

> > +		ms_hyperv.ghcb_base = alloc_percpu(union hv_ghcb __percpu *);

> 

> union hv_ghcb isn't defined.  It is not added until patch 6 of the series.

> 


Ignore this comment.  My mistake.

Michael
Tianyu Lan Aug. 14, 2021, 1:32 p.m. UTC | #10
On 8/13/2021 3:18 AM, Michael Kelley wrote:
> From: Tianyu Lan <ltykernel@gmail.com> Sent: Monday, August 9, 2021 10:56 AM

>> Subject: [PATCH V3 02/13] x86/HV: Initialize shared memory boundary in the Isolation VM.

> 

> As with Patch 1, use the "x86/hyperv:" tag in the Subject line.

> 

>>

>> From: Tianyu Lan <Tianyu.Lan@microsoft.com>

>>

>> Hyper-V exposes shared memory boundary via cpuid

>> HYPERV_CPUID_ISOLATION_CONFIG and store it in the

>> shared_gpa_boundary of ms_hyperv struct. This prepares

>> to share memory with host for SNP guest.

>>

>> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>

>> ---

>>   arch/x86/kernel/cpu/mshyperv.c |  2 ++

>>   include/asm-generic/mshyperv.h | 12 +++++++++++-

>>   2 files changed, 13 insertions(+), 1 deletion(-)

>>

>> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c

>> index 6b5835a087a3..2b7f396ef1a5 100644

>> --- a/arch/x86/kernel/cpu/mshyperv.c

>> +++ b/arch/x86/kernel/cpu/mshyperv.c

>> @@ -313,6 +313,8 @@ static void __init ms_hyperv_init_platform(void)

>>   	if (ms_hyperv.priv_high & HV_ISOLATION) {

>>   		ms_hyperv.isolation_config_a = cpuid_eax(HYPERV_CPUID_ISOLATION_CONFIG);

>>   		ms_hyperv.isolation_config_b = cpuid_ebx(HYPERV_CPUID_ISOLATION_CONFIG);

>> +		ms_hyperv.shared_gpa_boundary =

>> +			(u64)1 << ms_hyperv.shared_gpa_boundary_bits;

> 

> You could use BIT_ULL() here, but it's kind of a shrug.



Good suggestion. Thanks.

> 

>>

>>   		pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 0x%x\n",

>>   			ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b);

>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h

>> index 4269f3174e58..aa26d24a5ca9 100644

>> --- a/include/asm-generic/mshyperv.h

>> +++ b/include/asm-generic/mshyperv.h

>> @@ -35,8 +35,18 @@ struct ms_hyperv_info {

>>   	u32 max_vp_index;

>>   	u32 max_lp_index;

>>   	u32 isolation_config_a;

>> -	u32 isolation_config_b;

>> +	union {

>> +		u32 isolation_config_b;

>> +		struct {

>> +			u32 cvm_type : 4;

>> +			u32 Reserved11 : 1;

>> +			u32 shared_gpa_boundary_active : 1;

>> +			u32 shared_gpa_boundary_bits : 6;

>> +			u32 Reserved12 : 20;

> 

> Any reason to name the reserved fields as "11" and "12"?  It

> just looks a bit unusual.  And I'd suggest lowercase "r".

> 


Yes, will update in the next version.

>> +		};

>> +	};

>>   	void  __percpu **ghcb_base;

>> +	u64 shared_gpa_boundary;

>>   };

>>   extern struct ms_hyperv_info ms_hyperv;

>>

>> --

>> 2.25.1

>
Michael Kelley Aug. 16, 2021, 2:55 p.m. UTC | #11
From: Tianyu Lan <ltykernel@gmail.com> Sent: Monday, August 9, 2021 10:56 AM

> 

> Hyper-V provides two kinds of Isolation VMs. VBS(Virtualization-based

> security) and AMD SEV-SNP unenlightened Isolation VMs. This patchset

> is to add support for these Isolation VM support in Linux.

> 


A general comment about this series:  I have not seen any statements
made about whether either type of Isolated VM is supported for 32-bit
Linux guests.   arch/x86/Kconfig has CONFIG_AMD_MEM_ENCRYPT as
64-bit only, so evidently SEV-SNP Isolated VMs would be 64-bit only.
But I don't know if VBS VMs are any different.

I didn't track down what happens if a 32-bit Linux is booted in
a VM that supports SEV-SNP.  Presumably some kind of message
is output that no encryption is being done.  But at a slightly
higher level, the Hyper-V initialization path should probably
also check for 32-bit and output a clear message that no isolation
is being provided.  At that point, I don't know if it is possible to
continue in non-isolated mode or whether the only choice is to
panic.  Continuing in non-isolated mode might be a bad idea
anyway since presumably the user has explicitly requested an
Isolated VM.

Related, I noticed usage of "unsigned long" for holding physical
addresses, which works when running 64-bit, but not when running
32-bit.  But even if Isolated VMs are always 64-bit, it would be still be
better to clean this up and use phys_addr_t instead.  Unfortunately,
more generic functions like set_memory_encrypted() and
set_memory_decrypted() have physical address arguments that
are of type unsigned long.

Michael
Michael Kelley Aug. 16, 2021, 5:28 p.m. UTC | #12
From: Tianyu Lan <ltykernel@gmail.com> Sent: Monday, August 9, 2021 10:56 AM

> 

> VMbus ring buffer are shared with host and it's need to


s/it's need/it needs/

> be accessed via extra address space of Isolation VM with

> SNP support. This patch is to map the ring buffer

> address in extra address space via ioremap(). HV host


It's actually using vmap_pfn(), not ioremap().

> visibility hvcall smears data in the ring buffer and

> so reset the ring buffer memory to zero after calling

> visibility hvcall.

> 

> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>

> ---

>  drivers/hv/Kconfig        |  1 +

>  drivers/hv/channel.c      | 10 +++++

>  drivers/hv/hyperv_vmbus.h |  2 +

>  drivers/hv/ring_buffer.c  | 84 ++++++++++++++++++++++++++++++---------

>  4 files changed, 79 insertions(+), 18 deletions(-)

> 

> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig

> index d1123ceb38f3..dd12af20e467 100644

> --- a/drivers/hv/Kconfig

> +++ b/drivers/hv/Kconfig

> @@ -8,6 +8,7 @@ config HYPERV

>  		|| (ARM64 && !CPU_BIG_ENDIAN))

>  	select PARAVIRT

>  	select X86_HV_CALLBACK_VECTOR if X86

> +	select VMAP_PFN

>  	help

>  	  Select this option to run Linux as a Hyper-V client operating

>  	  system.

> diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c

> index 4c4717c26240..60ef881a700c 100644

> --- a/drivers/hv/channel.c

> +++ b/drivers/hv/channel.c

> @@ -712,6 +712,16 @@ static int __vmbus_open(struct vmbus_channel *newchannel,

>  	if (err)

>  		goto error_clean_ring;

> 

> +	err = hv_ringbuffer_post_init(&newchannel->outbound,

> +				      page, send_pages);

> +	if (err)

> +		goto error_free_gpadl;

> +

> +	err = hv_ringbuffer_post_init(&newchannel->inbound,

> +				      &page[send_pages], recv_pages);

> +	if (err)

> +		goto error_free_gpadl;

> +

>  	/* Create and init the channel open message */

>  	open_info = kzalloc(sizeof(*open_info) +

>  			   sizeof(struct vmbus_channel_open_channel),

> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h

> index 40bc0eff6665..15cd23a561f3 100644

> --- a/drivers/hv/hyperv_vmbus.h

> +++ b/drivers/hv/hyperv_vmbus.h

> @@ -172,6 +172,8 @@ extern int hv_synic_cleanup(unsigned int cpu);

>  /* Interface */

> 

>  void hv_ringbuffer_pre_init(struct vmbus_channel *channel);

> +int hv_ringbuffer_post_init(struct hv_ring_buffer_info *ring_info,

> +		struct page *pages, u32 page_cnt);

> 

>  int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,

>  		       struct page *pages, u32 pagecnt, u32 max_pkt_size);

> diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c

> index 2aee356840a2..d4f93fca1108 100644

> --- a/drivers/hv/ring_buffer.c

> +++ b/drivers/hv/ring_buffer.c

> @@ -17,6 +17,8 @@

>  #include <linux/vmalloc.h>

>  #include <linux/slab.h>

>  #include <linux/prefetch.h>

> +#include <linux/io.h>

> +#include <asm/mshyperv.h>

> 

>  #include "hyperv_vmbus.h"

> 

> @@ -179,43 +181,89 @@ void hv_ringbuffer_pre_init(struct vmbus_channel *channel)

>  	mutex_init(&channel->outbound.ring_buffer_mutex);

>  }

> 

> -/* Initialize the ring buffer. */

> -int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,

> -		       struct page *pages, u32 page_cnt, u32 max_pkt_size)

> +int hv_ringbuffer_post_init(struct hv_ring_buffer_info *ring_info,

> +		       struct page *pages, u32 page_cnt)

>  {

> +	u64 physic_addr = page_to_pfn(pages) << PAGE_SHIFT;

> +	unsigned long *pfns_wraparound;

> +	void *vaddr;

>  	int i;

> -	struct page **pages_wraparound;

> 

> -	BUILD_BUG_ON((sizeof(struct hv_ring_buffer) != PAGE_SIZE));

> +	if (!hv_isolation_type_snp())

> +		return 0;

> +

> +	physic_addr += ms_hyperv.shared_gpa_boundary;

> 

>  	/*

>  	 * First page holds struct hv_ring_buffer, do wraparound mapping for

>  	 * the rest.

>  	 */

> -	pages_wraparound = kcalloc(page_cnt * 2 - 1, sizeof(struct page *),

> +	pfns_wraparound = kcalloc(page_cnt * 2 - 1, sizeof(unsigned long),

>  				   GFP_KERNEL);

> -	if (!pages_wraparound)

> +	if (!pfns_wraparound)

>  		return -ENOMEM;

> 

> -	pages_wraparound[0] = pages;

> +	pfns_wraparound[0] = physic_addr >> PAGE_SHIFT;

>  	for (i = 0; i < 2 * (page_cnt - 1); i++)

> -		pages_wraparound[i + 1] = &pages[i % (page_cnt - 1) + 1];

> -

> -	ring_info->ring_buffer = (struct hv_ring_buffer *)

> -		vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP, PAGE_KERNEL);

> -

> -	kfree(pages_wraparound);

> +		pfns_wraparound[i + 1] = (physic_addr >> PAGE_SHIFT) +

> +			i % (page_cnt - 1) + 1;

> 

> -

> -	if (!ring_info->ring_buffer)

> +	vaddr = vmap_pfn(pfns_wraparound, page_cnt * 2 - 1, PAGE_KERNEL_IO);

> +	kfree(pfns_wraparound);

> +	if (!vaddr)

>  		return -ENOMEM;

> 

> -	ring_info->ring_buffer->read_index =

> -		ring_info->ring_buffer->write_index = 0;

> +	/* Clean memory after setting host visibility. */

> +	memset((void *)vaddr, 0x00, page_cnt * PAGE_SIZE);

> +

> +	ring_info->ring_buffer = (struct hv_ring_buffer *)vaddr;

> +	ring_info->ring_buffer->read_index = 0;

> +	ring_info->ring_buffer->write_index = 0;

> 

>  	/* Set the feature bit for enabling flow control. */

>  	ring_info->ring_buffer->feature_bits.value = 1;

> 

> +	return 0;

> +}

> +

> +/* Initialize the ring buffer. */

> +int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,

> +		       struct page *pages, u32 page_cnt, u32 max_pkt_size)

> +{

> +	int i;

> +	struct page **pages_wraparound;

> +

> +	BUILD_BUG_ON((sizeof(struct hv_ring_buffer) != PAGE_SIZE));

> +

> +	if (!hv_isolation_type_snp()) {

> +		/*

> +		 * First page holds struct hv_ring_buffer, do wraparound mapping for

> +		 * the rest.

> +		 */

> +		pages_wraparound = kcalloc(page_cnt * 2 - 1, sizeof(struct page *),

> +					   GFP_KERNEL);

> +		if (!pages_wraparound)

> +			return -ENOMEM;

> +

> +		pages_wraparound[0] = pages;

> +		for (i = 0; i < 2 * (page_cnt - 1); i++)

> +			pages_wraparound[i + 1] = &pages[i % (page_cnt - 1) + 1];

> +

> +		ring_info->ring_buffer = (struct hv_ring_buffer *)

> +			vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP, PAGE_KERNEL);

> +

> +		kfree(pages_wraparound);

> +

> +		if (!ring_info->ring_buffer)

> +			return -ENOMEM;

> +

> +		ring_info->ring_buffer->read_index =

> +			ring_info->ring_buffer->write_index = 0;

> +

> +		/* Set the feature bit for enabling flow control. */

> +		ring_info->ring_buffer->feature_bits.value = 1;

> +	}

> +

>  	ring_info->ring_size = page_cnt << PAGE_SHIFT;

>  	ring_info->ring_size_div10_reciprocal =

>  		reciprocal_value(ring_info->ring_size / 10);

> --

> 2.25.1


This patch does the following:

1) The existing ring buffer wrap-around mapping functionality is still
executed in hv_ringbuffer_init() when not doing SNP isolation.
This mapping is based on an array of struct page's that describe the
contiguous physical memory.

2) New ring buffer wrap-around mapping functionality is added in
hv_ringbuffer_post_init() for the SNP isolation case.  The case is
handled in hv_ringbuffer_post_init() because it must be done after
the GPADL is established, since that's where the host visibility
is set.  What's interesting is that this case is exactly the same
as #1 above, except that the mapping is based on physical
memory addresses instead of struct page's.  We have to use physical
addresses because of applying the GPA boundary, and there are no
struct page's for those physical addresses.

Unfortunately, this duplicates a lot of logic in #1 and #2, except
for the struct page vs. physical address difference.

Proposal:  Couldn't we always do #2, even for the normal case
where SNP isolation is not being used?   The difference would
only be in whether the GPA boundary is added.  And it looks like
the normal case could be done after the GPADL is established,
as setting up the GPADL doesn't have any dependencies on
having the ring buffer mapped.  This approach would remove
a lot of duplication.  Just move the calls to hv_ringbuffer_init()
to after the GPADL is established, and do all the work there for
both cases.

Michael
Tianyu Lan Aug. 17, 2021, 3:36 p.m. UTC | #13
On 8/17/2021 1:28 AM, Michael Kelley wrote:
> This patch does the following:
> 
> 1) The existing ring buffer wrap-around mapping functionality is still
> executed in hv_ringbuffer_init() when not doing SNP isolation.
> This mapping is based on an array of struct page's that describe the
> contiguous physical memory.
> 
> 2) New ring buffer wrap-around mapping functionality is added in
> hv_ringbuffer_post_init() for the SNP isolation case.  The case is
> handled in hv_ringbuffer_post_init() because it must be done after
> the GPADL is established, since that's where the host visibility
> is set.  What's interesting is that this case is exactly the same
> as #1 above, except that the mapping is based on physical
> memory addresses instead of struct page's.  We have to use physical
> addresses because of applying the GPA boundary, and there are no
> struct page's for those physical addresses.
> 
> Unfortunately, this duplicates a lot of logic in #1 and #2, except
> for the struct page vs. physical address difference.
> 
> Proposal:  Couldn't we always do #2, even for the normal case
> where SNP isolation is not being used?   The difference would
> only be in whether the GPA boundary is added.  And it looks like
> the normal case could be done after the GPADL is established,
> as setting up the GPADL doesn't have any dependencies on
> having the ring buffer mapped.  This approach would remove
> a lot of duplication.  Just move the calls to hv_ringbuffer_init()
> to after the GPADL is established, and do all the work there for
> both cases.
> 

Hi Michael:
     Thanks for suggestion. I just keep the original logic in current
code. I will try combining these two functions and report back.

Thanks.
Michael Kelley Aug. 19, 2021, 6:11 p.m. UTC | #14
From: Tianyu Lan <ltykernel@gmail.com> Sent: Monday, August 9, 2021 10:56 AM

> 

> Hyper-V Isolation VM requires bounce buffer support to copy

> data from/to encrypted memory and so enable swiotlb force

> mode to use swiotlb bounce buffer for DMA transaction.

> 

> In Isolation VM with AMD SEV, the bounce buffer needs to be

> accessed via extra address space which is above shared_gpa_boundary

> (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG.

> The access physical address will be original physical address +

> shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP

> spec is called virtual top of memory(vTOM). Memory addresses below

> vTOM are automatically treated as private while memory above

> vTOM is treated as shared.

> 

> Swiotlb bounce buffer code calls dma_map_decrypted()

> to mark bounce buffer visible to host and map it in extra

> address space. Populate dma memory decrypted ops with hv

> map/unmap function.

> 

> Hyper-V initalizes swiotlb bounce buffer and default swiotlb

> needs to be disabled. pci_swiotlb_detect_override() and

> pci_swiotlb_detect_4gb() enable the default one. To override

> the setting, hyperv_swiotlb_detect() needs to run before

> these detect functions which depends on the pci_xen_swiotlb_

> init(). Make pci_xen_swiotlb_init() depends on the hyperv_swiotlb

> _detect() to keep the order.

> 

> The map function vmap_pfn() can't work in the early place

> hyperv_iommu_swiotlb_init() and so initialize swiotlb bounce

> buffer in the hyperv_iommu_swiotlb_later_init().

> 

> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>

> ---

>  arch/x86/hyperv/ivm.c           | 28 ++++++++++++++

>  arch/x86/include/asm/mshyperv.h |  2 +

>  arch/x86/xen/pci-swiotlb-xen.c  |  3 +-

>  drivers/hv/vmbus_drv.c          |  3 ++

>  drivers/iommu/hyperv-iommu.c    | 65 +++++++++++++++++++++++++++++++++

>  include/linux/hyperv.h          |  1 +

>  6 files changed, 101 insertions(+), 1 deletion(-)

> 

> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c

> index c13ec5560d73..0f05e4d6fc62 100644

> --- a/arch/x86/hyperv/ivm.c

> +++ b/arch/x86/hyperv/ivm.c

> @@ -265,3 +265,31 @@ int hv_set_mem_host_visibility(unsigned long addr, int numpages, bool visible)

> 

>  	return __hv_set_mem_host_visibility((void *)addr, numpages, visibility);

>  }

> +

> +/*

> + * hv_map_memory - map memory to extra space in the AMD SEV-SNP Isolation VM.

> + */

> +void *hv_map_memory(void *addr, unsigned long size)

> +{

> +	unsigned long *pfns = kcalloc(size / HV_HYP_PAGE_SIZE,

> +				      sizeof(unsigned long), GFP_KERNEL);

> +	void *vaddr;

> +	int i;

> +

> +	if (!pfns)

> +		return NULL;

> +

> +	for (i = 0; i < size / HV_HYP_PAGE_SIZE; i++)

> +		pfns[i] = virt_to_hvpfn(addr + i * HV_HYP_PAGE_SIZE) +

> +			(ms_hyperv.shared_gpa_boundary >> HV_HYP_PAGE_SHIFT);

> +

> +	vaddr = vmap_pfn(pfns, size / HV_HYP_PAGE_SIZE,	PAGE_KERNEL_IO);

> +	kfree(pfns);

> +

> +	return vaddr;

> +}


This function is manipulating page tables in the guest VM.  It is not involved
in communicating with Hyper-V, or passing PFNs to Hyper-V.  The pfn array
contains guest PFNs, not Hyper-V PFNs.  So it should use PAGE_SIZE
instead of HV_HYP_PAGE_SIZE, and similarly PAGE_SHIFT and virt_to_pfn().
If this code were ever to run on ARM64 in the future with PAGE_SIZE other
than 4 Kbytes, the use of PAGE_SIZE is correct choice.

> +

> +void hv_unmap_memory(void *addr)

> +{

> +	vunmap(addr);

> +}

> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h

> index a30c60f189a3..b247739f57ac 100644

> --- a/arch/x86/include/asm/mshyperv.h

> +++ b/arch/x86/include/asm/mshyperv.h

> @@ -250,6 +250,8 @@ int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);

>  int hv_mark_gpa_visibility(u16 count, const u64 pfn[],

>  			   enum hv_mem_host_visibility visibility);

>  int hv_set_mem_host_visibility(unsigned long addr, int numpages, bool visible);

> +void *hv_map_memory(void *addr, unsigned long size);

> +void hv_unmap_memory(void *addr);

>  void hv_sint_wrmsrl_ghcb(u64 msr, u64 value);

>  void hv_sint_rdmsrl_ghcb(u64 msr, u64 *value);

>  void hv_signal_eom_ghcb(void);

> diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c

> index 54f9aa7e8457..43bd031aa332 100644

> --- a/arch/x86/xen/pci-swiotlb-xen.c

> +++ b/arch/x86/xen/pci-swiotlb-xen.c

> @@ -4,6 +4,7 @@

> 

>  #include <linux/dma-map-ops.h>

>  #include <linux/pci.h>

> +#include <linux/hyperv.h>

>  #include <xen/swiotlb-xen.h>

> 

>  #include <asm/xen/hypervisor.h>

> @@ -91,6 +92,6 @@ int pci_xen_swiotlb_init_late(void)

>  EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);

> 

>  IOMMU_INIT_FINISH(pci_xen_swiotlb_detect,

> -		  NULL,

> +		  hyperv_swiotlb_detect,

>  		  pci_xen_swiotlb_init,

>  		  NULL);

> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c

> index 57bbbaa4e8f7..f068e22a5636 100644

> --- a/drivers/hv/vmbus_drv.c

> +++ b/drivers/hv/vmbus_drv.c

> @@ -23,6 +23,7 @@

>  #include <linux/cpu.h>

>  #include <linux/sched/task_stack.h>

> 

> +#include <linux/dma-map-ops.h>

>  #include <linux/delay.h>

>  #include <linux/notifier.h>

>  #include <linux/panic_notifier.h>

> @@ -2081,6 +2082,7 @@ struct hv_device *vmbus_device_create(const guid_t *type,

>  	return child_device_obj;

>  }

> 

> +static u64 vmbus_dma_mask = DMA_BIT_MASK(64);

>  /*

>   * vmbus_device_register - Register the child device

>   */

> @@ -2121,6 +2123,7 @@ int vmbus_device_register(struct hv_device *child_device_obj)

>  	}

>  	hv_debug_add_dev_dir(child_device_obj);

> 

> +	child_device_obj->device.dma_mask = &vmbus_dma_mask;

>  	return 0;

> 

>  err_kset_unregister:

> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c

> index e285a220c913..01e874b3b43a 100644

> --- a/drivers/iommu/hyperv-iommu.c

> +++ b/drivers/iommu/hyperv-iommu.c

> @@ -13,14 +13,22 @@

>  #include <linux/irq.h>

>  #include <linux/iommu.h>

>  #include <linux/module.h>

> +#include <linux/hyperv.h>

> +#include <linux/io.h>

> 

>  #include <asm/apic.h>

>  #include <asm/cpu.h>

>  #include <asm/hw_irq.h>

>  #include <asm/io_apic.h>

> +#include <asm/iommu.h>

> +#include <asm/iommu_table.h>

>  #include <asm/irq_remapping.h>

>  #include <asm/hypervisor.h>

>  #include <asm/mshyperv.h>

> +#include <asm/swiotlb.h>

> +#include <linux/dma-map-ops.h>

> +#include <linux/dma-direct.h>

> +#include <linux/set_memory.h>

> 

>  #include "irq_remapping.h"

> 

> @@ -36,6 +44,9 @@

>  static cpumask_t ioapic_max_cpumask = { CPU_BITS_NONE };

>  static struct irq_domain *ioapic_ir_domain;

> 

> +static unsigned long hyperv_io_tlb_size;

> +static void *hyperv_io_tlb_start;

> +

>  static int hyperv_ir_set_affinity(struct irq_data *data,

>  		const struct cpumask *mask, bool force)

>  {

> @@ -337,4 +348,58 @@ static const struct irq_domain_ops hyperv_root_ir_domain_ops = {

>  	.free = hyperv_root_irq_remapping_free,

>  };

> 

> +void __init hyperv_iommu_swiotlb_init(void)

> +{

> +	unsigned long bytes;

> +

> +	/*

> +	 * Allocate Hyper-V swiotlb bounce buffer at early place

> +	 * to reserve large contiguous memory.

> +	 */

> +	hyperv_io_tlb_size = 256 * 1024 * 1024;


A hard coded size here seems problematic.   The memory size of
Isolated VMs can vary by orders of magnitude.  I see that
xen_swiotlb_init() uses swiotlb_size_or_default(), which at least
pays attention to the value specified on the kernel boot line.

Another example is sev_setup_arch(), which in the native case sets
the size to 6% of main memory, with a max of 1 Gbyte.  This is
the case that's closer to Isolated VMs, so doing something
similar could be a good approach.

> +	hyperv_io_tlb_start =

> +			memblock_alloc_low(

> +				  PAGE_ALIGN(hyperv_io_tlb_size),

> +				  HV_HYP_PAGE_SIZE);

> +

> +	if (!hyperv_io_tlb_start) {

> +		pr_warn("Fail to allocate Hyper-V swiotlb buffer.\n");

> +		return;

> +	}

> +}

> +

> +int __init hyperv_swiotlb_detect(void)

> +{

> +	if (hypervisor_is_type(X86_HYPER_MS_HYPERV)

> +	    && hv_is_isolation_supported()) {

> +		/*

> +		 * Enable swiotlb force mode in Isolation VM to

> +		 * use swiotlb bounce buffer for dma transaction.

> +		 */

> +		swiotlb_force = SWIOTLB_FORCE;

> +

> +		dma_memory_generic_decrypted_ops.map = hv_map_memory;

> +		dma_memory_generic_decrypted_ops.unmap = hv_unmap_memory;

> +		return 1;

> +	}

> +

> +	return 0;

> +}

> +

> +void __init hyperv_iommu_swiotlb_later_init(void)

> +{

> +	/*

> +	 * Swiotlb bounce buffer needs to be mapped in extra address

> +	 * space. Map function doesn't work in the early place and so

> +	 * call swiotlb_late_init_with_tbl() here.

> +	 */

> +	if (swiotlb_late_init_with_tbl(hyperv_io_tlb_start,

> +				       hyperv_io_tlb_size >> IO_TLB_SHIFT))

> +		panic("Fail to initialize hyperv swiotlb.\n");

> +}

> +

> +IOMMU_INIT_FINISH(hyperv_swiotlb_detect,

> +		  NULL, hyperv_iommu_swiotlb_init,

> +		  hyperv_iommu_swiotlb_later_init);

> +

>  #endif

> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h

> index 90b542597143..83fa567ad594 100644

> --- a/include/linux/hyperv.h

> +++ b/include/linux/hyperv.h

> @@ -1744,6 +1744,7 @@ int hyperv_write_cfg_blk(struct pci_dev *dev, void *buf, unsigned int len,

>  int hyperv_reg_block_invalidate(struct pci_dev *dev, void *context,

>  				void (*block_invalidate)(void *context,

>  							 u64 block_mask));

> +int __init hyperv_swiotlb_detect(void);

> 

>  struct hyperv_pci_block_ops {

>  	int (*read_block)(struct pci_dev *dev, void *buf, unsigned int buf_len,

> --

> 2.25.1
Michael Kelley Aug. 19, 2021, 6:17 p.m. UTC | #15
From: Tianyu Lan <ltykernel@gmail.com> Sent: Monday, August 9, 2021 10:56 AM

> 


Subject line tag should be "scsi: storvsc:"

> In Isolation VM, all shared memory with host needs to mark visible

> to host via hvcall. vmbus_establish_gpadl() has already done it for

> storvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_

> mpb_desc() still need to handle. Use DMA API to map/umap these


s/need to handle/needs to be handled/

> memory during sending/receiving packet and Hyper-V DMA ops callback

> will use swiotlb function to allocate bounce buffer and copy data

> from/to bounce buffer.

> 

> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>

> ---

>  drivers/scsi/storvsc_drv.c | 68 +++++++++++++++++++++++++++++++++++---

>  1 file changed, 63 insertions(+), 5 deletions(-)

> 

> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c

> index 328bb961c281..78320719bdd8 100644

> --- a/drivers/scsi/storvsc_drv.c

> +++ b/drivers/scsi/storvsc_drv.c

> @@ -21,6 +21,8 @@

>  #include <linux/device.h>

>  #include <linux/hyperv.h>

>  #include <linux/blkdev.h>

> +#include <linux/io.h>

> +#include <linux/dma-mapping.h>

>  #include <scsi/scsi.h>

>  #include <scsi/scsi_cmnd.h>

>  #include <scsi/scsi_host.h>

> @@ -427,6 +429,8 @@ struct storvsc_cmd_request {

>  	u32 payload_sz;

> 

>  	struct vstor_packet vstor_packet;

> +	u32 hvpg_count;


This count is really the number of entries in the dma_range
array, right?  If so, perhaps "dma_range_count" would be
a better name so that it is more tightly associated.

> +	struct hv_dma_range *dma_range;

>  };

> 

> 

> @@ -509,6 +513,14 @@ struct storvsc_scan_work {

>  	u8 tgt_id;

>  };

> 

> +#define storvsc_dma_map(dev, page, offset, size, dir) \

> +	dma_map_page(dev, page, offset, size, dir)

> +

> +#define storvsc_dma_unmap(dev, dma_range, dir)		\

> +		dma_unmap_page(dev, dma_range.dma,	\

> +			       dma_range.mapping_size,	\

> +			       dir ? DMA_FROM_DEVICE : DMA_TO_DEVICE)

> +


Each of these macros is used only once.  IMHO, they don't
add a lot of value.  Just coding dma_map/unmap_page()
inline would be fine and eliminate these lines of code.

>  static void storvsc_device_scan(struct work_struct *work)

>  {

>  	struct storvsc_scan_work *wrk;

> @@ -1260,6 +1272,7 @@ static void storvsc_on_channel_callback(void *context)

>  	struct hv_device *device;

>  	struct storvsc_device *stor_device;

>  	struct Scsi_Host *shost;

> +	int i;

> 

>  	if (channel->primary_channel != NULL)

>  		device = channel->primary_channel->device_obj;

> @@ -1314,6 +1327,15 @@ static void storvsc_on_channel_callback(void *context)

>  				request = (struct storvsc_cmd_request *)scsi_cmd_priv(scmnd);

>  			}

> 

> +			if (request->dma_range) {

> +				for (i = 0; i < request->hvpg_count; i++)

> +					storvsc_dma_unmap(&device->device,

> +						request->dma_range[i],

> +						request->vstor_packet.vm_srb.data_in == READ_TYPE);


I think you can directly get the DMA direction as request->cmd->sc_data_direction.

> +

> +				kfree(request->dma_range);

> +			}

> +

>  			storvsc_on_receive(stor_device, packet, request);

>  			continue;

>  		}

> @@ -1810,7 +1832,9 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)

>  		unsigned int hvpgoff, hvpfns_to_add;

>  		unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);

>  		unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);

> +		dma_addr_t dma;

>  		u64 hvpfn;

> +		u32 size;

> 

>  		if (hvpg_count > MAX_PAGE_BUFFER_COUNT) {

> 

> @@ -1824,6 +1848,13 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)

>  		payload->range.len = length;

>  		payload->range.offset = offset_in_hvpg;

> 

> +		cmd_request->dma_range = kcalloc(hvpg_count,

> +				 sizeof(*cmd_request->dma_range),

> +				 GFP_ATOMIC);


With this patch, it appears that storvsc_queuecommand() is always
doing bounce buffering, even when running in a non-isolated VM.
The dma_range is always allocated, and the inner loop below does
the dma mapping for every I/O page.  The corresponding code in
storvsc_on_channel_callback() that does the dma unmap allows for
the dma_range to be NULL, but that never happens.

> +		if (!cmd_request->dma_range) {

> +			ret = -ENOMEM;


The other memory allocation failure in this function returns
SCSI_MLQUEUE_DEVICE_BUSY.   It may be debatable as to whether
that's the best approach, but that's a topic for a different patch.  I
would suggest being consistent and using the same return code
here.

> +			goto free_payload;

> +		}

> 

>  		for (i = 0; sgl != NULL; sgl = sg_next(sgl)) {

>  			/*

> @@ -1847,9 +1878,29 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)

>  			 * last sgl should be reached at the same time that

>  			 * the PFN array is filled.

>  			 */

> -			while (hvpfns_to_add--)

> -				payload->range.pfn_array[i++] =	hvpfn++;

> +			while (hvpfns_to_add--) {

> +				size = min(HV_HYP_PAGE_SIZE - offset_in_hvpg,

> +					   (unsigned long)length);

> +				dma = storvsc_dma_map(&dev->device, pfn_to_page(hvpfn++),

> +						      offset_in_hvpg, size,

> +						      scmnd->sc_data_direction);

> +				if (dma_mapping_error(&dev->device, dma)) {

> +					ret = -ENOMEM;


The typical error from dma_map_page() will be running out of
bounce buffer memory.   This is a transient condition that should be
retried at the higher levels.  So make sure to return an error code
that indicates the I/O should be resubmitted.

> +					goto free_dma_range;

> +				}

> +

> +				if (offset_in_hvpg) {

> +					payload->range.offset = dma & ~HV_HYP_PAGE_MASK;

> +					offset_in_hvpg = 0;

> +				}


I'm not clear on why payload->range.offset needs to be set again.
Even after the dma mapping is done, doesn't the offset in the first
page have to be the same?  If it wasn't the same, Hyper-V wouldn't
be able to process the PFN list correctly.  In fact, couldn't the above
code just always set offset_in_hvpg = 0?

> +

> +				cmd_request->dma_range[i].dma = dma;

> +				cmd_request->dma_range[i].mapping_size = size;

> +				payload->range.pfn_array[i++] = dma >> HV_HYP_PAGE_SHIFT;

> +				length -= size;

> +			}

>  		}

> +		cmd_request->hvpg_count = hvpg_count;


This line just saves the size of the dma_range array.  Could
it be moved up with the code that allocates the dma_range
array?  To me, it would make more sense to have all that
code together in one place.

>  	}


The whole approach here is to do dma remapping on each individual page
of the I/O buffer.  But wouldn't it be possible to use dma_map_sg() to map
each scatterlist entry as a unit?  Each scatterlist entry describes a range of
physically contiguous memory.  After dma_map_sg(), the resulting dma
address must also refer to a physically contiguous range in the swiotlb
bounce buffer memory.   So at the top of the "for" loop over the scatterlist
entries, do dma_map_sg() if we're in an isolated VM.  Then compute the
hvpfn value based on the dma address instead of sg_page().  But everything
else is the same, and the inner loop for populating the pfn_arry is unmodified.
Furthermore, the dma_range array that you've added is not needed, since
scatterlist entries already have a dma_address field for saving the mapped
address, and dma_unmap_sg() uses that field.

One thing:  There's a maximum swiotlb mapping size, which I think works
out to be 256 Kbytes.  See swiotlb_max_mapping_size().  We need to make
sure that we don't get a scatterlist entry bigger than this size.  But I think
this already happens because you set the device->dma_mask field in
Patch 11 of this series.  __scsi_init_queue checks for this setting and
sets max_sectors to limits transfers to the max mapping size.

> 

>  	cmd_request->payload = payload;

> @@ -1860,13 +1911,20 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)

>  	put_cpu();

> 

>  	if (ret == -EAGAIN) {

> -		if (payload_sz > sizeof(cmd_request->mpb))

> -			kfree(payload);

>  		/* no more space */

> -		return SCSI_MLQUEUE_DEVICE_BUSY;

> +		ret = SCSI_MLQUEUE_DEVICE_BUSY;

> +		goto free_dma_range;

>  	}

> 

>  	return 0;

> +

> +free_dma_range:

> +	kfree(cmd_request->dma_range);

> +

> +free_payload:

> +	if (payload_sz > sizeof(cmd_request->mpb))

> +		kfree(payload);

> +	return ret;

>  }

> 

>  static struct scsi_host_template scsi_driver = {

> --

> 2.25.1
Christoph Hellwig Aug. 20, 2021, 4:13 a.m. UTC | #16
On Thu, Aug 19, 2021 at 06:11:30PM +0000, Michael Kelley wrote:
> This function is manipulating page tables in the guest VM.  It is not involved

> in communicating with Hyper-V, or passing PFNs to Hyper-V.  The pfn array

> contains guest PFNs, not Hyper-V PFNs.  So it should use PAGE_SIZE

> instead of HV_HYP_PAGE_SIZE, and similarly PAGE_SHIFT and virt_to_pfn().

> If this code were ever to run on ARM64 in the future with PAGE_SIZE other

> than 4 Kbytes, the use of PAGE_SIZE is correct choice.


Yes.  I just stumled over this yesterday.  I think we can actually use a
nicer helper ased around vmap_range() to simplify this exactly because it
always uses the host page size.  I hope I can draft up a RFC today.
Christoph Hellwig Aug. 20, 2021, 4:32 a.m. UTC | #17
On Thu, Aug 19, 2021 at 06:17:40PM +0000, Michael Kelley wrote:
> > +#define storvsc_dma_map(dev, page, offset, size, dir) \

> > +	dma_map_page(dev, page, offset, size, dir)

> > +

> > +#define storvsc_dma_unmap(dev, dma_range, dir)		\

> > +		dma_unmap_page(dev, dma_range.dma,	\

> > +			       dma_range.mapping_size,	\

> > +			       dir ? DMA_FROM_DEVICE : DMA_TO_DEVICE)

> > +

> 

> Each of these macros is used only once.  IMHO, they don't

> add a lot of value.  Just coding dma_map/unmap_page()

> inline would be fine and eliminate these lines of code.


Yes, I had the same thought when looking over the code.  Especially
as macros tend to further obsfucate the code (compared to actual helper
functions).

> > +				for (i = 0; i < request->hvpg_count; i++)

> > +					storvsc_dma_unmap(&device->device,

> > +						request->dma_range[i],

> > +						request->vstor_packet.vm_srb.data_in == READ_TYPE);

> 

> I think you can directly get the DMA direction as request->cmd->sc_data_direction.


Yes.

> > 

> > @@ -1824,6 +1848,13 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)

> >  		payload->range.len = length;

> >  		payload->range.offset = offset_in_hvpg;

> > 

> > +		cmd_request->dma_range = kcalloc(hvpg_count,

> > +				 sizeof(*cmd_request->dma_range),

> > +				 GFP_ATOMIC);

> 

> With this patch, it appears that storvsc_queuecommand() is always

> doing bounce buffering, even when running in a non-isolated VM.

> The dma_range is always allocated, and the inner loop below does

> the dma mapping for every I/O page.  The corresponding code in

> storvsc_on_channel_callback() that does the dma unmap allows for

> the dma_range to be NULL, but that never happens.


Maybe I'm missing something in the hyperv code, but I don't think
dma_map_page would bounce buffer for the non-isolated case.  It
will just return the physical address.

> > +		if (!cmd_request->dma_range) {

> > +			ret = -ENOMEM;

> 

> The other memory allocation failure in this function returns

> SCSI_MLQUEUE_DEVICE_BUSY.   It may be debatable as to whether

> that's the best approach, but that's a topic for a different patch.  I

> would suggest being consistent and using the same return code

> here.


Independent of if SCSI_MLQUEUE_DEVICE_BUSY is good (it it a common
pattern in SCSI drivers), ->queuecommand can't return normal
negative errnos.  It must return the SCSI_MLQUEUE_* codes or 0.
We should probably change the return type of the method definition
to a suitable enum to make this more clear.

> > +				if (offset_in_hvpg) {

> > +					payload->range.offset = dma & ~HV_HYP_PAGE_MASK;

> > +					offset_in_hvpg = 0;

> > +				}

> 

> I'm not clear on why payload->range.offset needs to be set again.

> Even after the dma mapping is done, doesn't the offset in the first

> page have to be the same?  If it wasn't the same, Hyper-V wouldn't

> be able to process the PFN list correctly.  In fact, couldn't the above

> code just always set offset_in_hvpg = 0?


Careful.  DMA mapping is supposed to keep the offset in the page, but
for that the DMA mapping code needs to know what the device considers a
"page".  For that the driver needs to set the min_align_mask field in
struct device_dma_parameters.

> 

> The whole approach here is to do dma remapping on each individual page

> of the I/O buffer.  But wouldn't it be possible to use dma_map_sg() to map

> each scatterlist entry as a unit?  Each scatterlist entry describes a range of

> physically contiguous memory.  After dma_map_sg(), the resulting dma

> address must also refer to a physically contiguous range in the swiotlb

> bounce buffer memory.   So at the top of the "for" loop over the scatterlist

> entries, do dma_map_sg() if we're in an isolated VM.  Then compute the

> hvpfn value based on the dma address instead of sg_page().  But everything

> else is the same, and the inner loop for populating the pfn_arry is unmodified.

> Furthermore, the dma_range array that you've added is not needed, since

> scatterlist entries already have a dma_address field for saving the mapped

> address, and dma_unmap_sg() uses that field.


Yes, I think dma_map_sg is the right thing to use here, probably even
for the non-isolated case so that we can get the hv drivers out of their
little corner and into being more like a normal kernel driver.  That
is, use the scsi_dma_map/scsi_dma_unmap helpers, and then iterate over
the dma addresses one page at a time using for_each_sg_dma_page.

> 

> One thing:  There's a maximum swiotlb mapping size, which I think works

> out to be 256 Kbytes.  See swiotlb_max_mapping_size().  We need to make

> sure that we don't get a scatterlist entry bigger than this size.  But I think

> this already happens because you set the device->dma_mask field in

> Patch 11 of this series.  __scsi_init_queue checks for this setting and

> sets max_sectors to limits transfers to the max mapping size.


Indeed.
Tianyu Lan Aug. 20, 2021, 3:20 p.m. UTC | #18
On 8/20/2021 2:17 AM, Michael Kelley wrote:
> From: Tianyu Lan <ltykernel@gmail.com> Sent: Monday, August 9, 2021 10:56 AM

>>

> 

> Subject line tag should be "scsi: storvsc:"

> 

>> In Isolation VM, all shared memory with host needs to mark visible

>> to host via hvcall. vmbus_establish_gpadl() has already done it for

>> storvsc rx/tx ring buffer. The page buffer used by vmbus_sendpacket_

>> mpb_desc() still need to handle. Use DMA API to map/umap these

> 

> s/need to handle/needs to be handled/

> 

>> memory during sending/receiving packet and Hyper-V DMA ops callback

>> will use swiotlb function to allocate bounce buffer and copy data

>> from/to bounce buffer.

>>

>> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>

>> ---

>>   drivers/scsi/storvsc_drv.c | 68 +++++++++++++++++++++++++++++++++++---

>>   1 file changed, 63 insertions(+), 5 deletions(-)

>>

>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c

>> index 328bb961c281..78320719bdd8 100644

>> --- a/drivers/scsi/storvsc_drv.c

>> +++ b/drivers/scsi/storvsc_drv.c

>> @@ -21,6 +21,8 @@

>>   #include <linux/device.h>

>>   #include <linux/hyperv.h>

>>   #include <linux/blkdev.h>

>> +#include <linux/io.h>

>> +#include <linux/dma-mapping.h>

>>   #include <scsi/scsi.h>

>>   #include <scsi/scsi_cmnd.h>

>>   #include <scsi/scsi_host.h>

>> @@ -427,6 +429,8 @@ struct storvsc_cmd_request {

>>   	u32 payload_sz;

>>

>>   	struct vstor_packet vstor_packet;

>> +	u32 hvpg_count;

> 

> This count is really the number of entries in the dma_range

> array, right?  If so, perhaps "dma_range_count" would be

> a better name so that it is more tightly associated.


Yes, will update.

> 

>> +	struct hv_dma_range *dma_range;

>>   };

>>

>>

>> @@ -509,6 +513,14 @@ struct storvsc_scan_work {

>>   	u8 tgt_id;

>>   };

>>

>> +#define storvsc_dma_map(dev, page, offset, size, dir) \

>> +	dma_map_page(dev, page, offset, size, dir)

>> +

>> +#define storvsc_dma_unmap(dev, dma_range, dir)		\

>> +		dma_unmap_page(dev, dma_range.dma,	\

>> +			       dma_range.mapping_size,	\

>> +			       dir ? DMA_FROM_DEVICE : DMA_TO_DEVICE)

>> +

> 

> Each of these macros is used only once.  IMHO, they don't

> add a lot of value.  Just coding dma_map/unmap_page()

> inline would be fine and eliminate these lines of code.


OK. Will update.

> 

>>   static void storvsc_device_scan(struct work_struct *work)

>>   {

>>   	struct storvsc_scan_work *wrk;

>> @@ -1260,6 +1272,7 @@ static void storvsc_on_channel_callback(void *context)

>>   	struct hv_device *device;

>>   	struct storvsc_device *stor_device;

>>   	struct Scsi_Host *shost;

>> +	int i;

>>

>>   	if (channel->primary_channel != NULL)

>>   		device = channel->primary_channel->device_obj;

>> @@ -1314,6 +1327,15 @@ static void storvsc_on_channel_callback(void *context)

>>   				request = (struct storvsc_cmd_request *)scsi_cmd_priv(scmnd);

>>   			}

>>

>> +			if (request->dma_range) {

>> +				for (i = 0; i < request->hvpg_count; i++)

>> +					storvsc_dma_unmap(&device->device,

>> +						request->dma_range[i],

>> +						request->vstor_packet.vm_srb.data_in == READ_TYPE);

> 

> I think you can directly get the DMA direction as request->cmd->sc_data_direction.

> 

>> +

>> +				kfree(request->dma_range);

>> +			}

>> +

>>   			storvsc_on_receive(stor_device, packet, request);

>>   			continue;

>>   		}

>> @@ -1810,7 +1832,9 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)

>>   		unsigned int hvpgoff, hvpfns_to_add;

>>   		unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);

>>   		unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);

>> +		dma_addr_t dma;

>>   		u64 hvpfn;

>> +		u32 size;

>>

>>   		if (hvpg_count > MAX_PAGE_BUFFER_COUNT) {

>>

>> @@ -1824,6 +1848,13 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)

>>   		payload->range.len = length;

>>   		payload->range.offset = offset_in_hvpg;

>>

>> +		cmd_request->dma_range = kcalloc(hvpg_count,

>> +				 sizeof(*cmd_request->dma_range),

>> +				 GFP_ATOMIC);

> 

> With this patch, it appears that storvsc_queuecommand() is always

> doing bounce buffering, even when running in a non-isolated VM.


In the non-isolated VM, SWIOTLB_FORCE mode isn't enabled and so
the swiotlb bounce buffer will not work.

> The dma_range is always allocated, and the inner loop below does

> the dma mapping for every I/O page.  The corresponding code in

> storvsc_on_channel_callback() that does the dma unmap allows for

> the dma_range to be NULL, but that never happens.


Yes, dma mapping function will return PA directly in non-isolated VM.

> 

>> +		if (!cmd_request->dma_range) {

>> +			ret = -ENOMEM;

> 

> The other memory allocation failure in this function returns

> SCSI_MLQUEUE_DEVICE_BUSY.   It may be debatable as to whether

> that's the best approach, but that's a topic for a different patch.  I

> would suggest being consistent and using the same return code

> here.


OK. I will keep to return SCSI_MLQUEUE_DEVICE_BUSY here.

> 

>> +			goto free_payload;

>> +		}

>>

>>   		for (i = 0; sgl != NULL; sgl = sg_next(sgl)) {

>>   			/*

>> @@ -1847,9 +1878,29 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)

>>   			 * last sgl should be reached at the same time that

>>   			 * the PFN array is filled.

>>   			 */

>> -			while (hvpfns_to_add--)

>> -				payload->range.pfn_array[i++] =	hvpfn++;

>> +			while (hvpfns_to_add--) {

>> +				size = min(HV_HYP_PAGE_SIZE - offset_in_hvpg,

>> +					   (unsigned long)length);

>> +				dma = storvsc_dma_map(&dev->device, pfn_to_page(hvpfn++),

>> +						      offset_in_hvpg, size,

>> +						      scmnd->sc_data_direction);

>> +				if (dma_mapping_error(&dev->device, dma)) {

>> +					ret = -ENOMEM;

> 

> The typical error from dma_map_page() will be running out of

> bounce buffer memory.   This is a transient condition that should be

> retried at the higher levels.  So make sure to return an error code

> that indicates the I/O should be resubmitted.


OK. It looks like error code should be SCSI_MLQUEUE_DEVICE_BUSY here.

> 

>> +					goto free_dma_range;

>> +				}

>> +

>> +				if (offset_in_hvpg) {

>> +					payload->range.offset = dma & ~HV_HYP_PAGE_MASK;

>> +					offset_in_hvpg = 0;

>> +				}

> 

> I'm not clear on why payload->range.offset needs to be set again.

> Even after the dma mapping is done, doesn't the offset in the first

> page have to be the same?  If it wasn't the same, Hyper-V wouldn't

> be able to process the PFN list correctly.  In fact, couldn't the above

> code just always set offset_in_hvpg = 0?


The offset will be changed. The swiotlb bounce buffer is allocated with 
IO_TLB_SIZE(2K) as unit. So the offset here may be changed.

> 

>> +

>> +				cmd_request->dma_range[i].dma = dma;

>> +				cmd_request->dma_range[i].mapping_size = size;

>> +				payload->range.pfn_array[i++] = dma >> HV_HYP_PAGE_SHIFT;

>> +				length -= size;

>> +			}

>>   		}

>> +		cmd_request->hvpg_count = hvpg_count;

> 

> This line just saves the size of the dma_range array.  Could

> it be moved up with the code that allocates the dma_range

> array?  To me, it would make more sense to have all that

> code together in one place.


Sure. Will update.

> 

>>   	}

> 

> The whole approach here is to do dma remapping on each individual page

> of the I/O buffer.  But wouldn't it be possible to use dma_map_sg() to map

> each scatterlist entry as a unit?  Each scatterlist entry describes a range of

> physically contiguous memory.  After dma_map_sg(), the resulting dma

> address must also refer to a physically contiguous range in the swiotlb

> bounce buffer memory.   So at the top of the "for" loop over the scatterlist

> entries, do dma_map_sg() if we're in an isolated VM.  Then compute the

> hvpfn value based on the dma address instead of sg_page().  But everything

> else is the same, and the inner loop for populating the pfn_arry is unmodified.

> Furthermore, the dma_range array that you've added is not needed, since

> scatterlist entries already have a dma_address field for saving the mapped

> address, and dma_unmap_sg() uses that field.


I don't use dma_map_sg() here in order to avoid introducing one more 
loop(e,g dma_map_sg()). We already have a loop to populate 
cmd_request->dma_range[] and so do the dma map in the same loop.

> 

> One thing:  There's a maximum swiotlb mapping size, which I think works

> out to be 256 Kbytes.  See swiotlb_max_mapping_size().  We need to make

> sure that we don't get a scatterlist entry bigger than this size.  But I think

> this already happens because you set the device->dma_mask field in

> Patch 11 of this series.  __scsi_init_queue checks for this setting and

> sets max_sectors to limits transfers to the max mapping size.


I will double check.

> 

>>

>>   	cmd_request->payload = payload;

>> @@ -1860,13 +1911,20 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)

>>   	put_cpu();

>>

>>   	if (ret == -EAGAIN) {

>> -		if (payload_sz > sizeof(cmd_request->mpb))

>> -			kfree(payload);

>>   		/* no more space */

>> -		return SCSI_MLQUEUE_DEVICE_BUSY;

>> +		ret = SCSI_MLQUEUE_DEVICE_BUSY;

>> +		goto free_dma_range;

>>   	}

>>

>>   	return 0;

>> +

>> +free_dma_range:

>> +	kfree(cmd_request->dma_range);

>> +

>> +free_payload:

>> +	if (payload_sz > sizeof(cmd_request->mpb))

>> +		kfree(payload);

>> +	return ret;

>>   }

>>

>>   static struct scsi_host_template scsi_driver = {

>> --

>> 2.25.1

>
Tianyu Lan Aug. 20, 2021, 3:37 p.m. UTC | #19
On 8/20/2021 11:20 PM, Tianyu Lan wrote:
>> The whole approach here is to do dma remapping on each individual page

>> of the I/O buffer.  But wouldn't it be possible to use dma_map_sg() to 

>> map

>> each scatterlist entry as a unit?  Each scatterlist entry describes a 

>> range of

>> physically contiguous memory.  After dma_map_sg(), the resulting dma

>> address must also refer to a physically contiguous range in the swiotlb

>> bounce buffer memory.   So at the top of the "for" loop over the 

>> scatterlist

>> entries, do dma_map_sg() if we're in an isolated VM.  Then compute the

>> hvpfn value based on the dma address instead of sg_page().  But 

>> everything

>> else is the same, and the inner loop for populating the pfn_arry is 

>> unmodified.

>> Furthermore, the dma_range array that you've added is not needed, since

>> scatterlist entries already have a dma_address field for saving the 

>> mapped

>> address, and dma_unmap_sg() uses that field.

> 

> I don't use dma_map_sg() here in order to avoid introducing one more 

> loop(e,g dma_map_sg()). We already have a loop to populate 

> cmd_request->dma_range[] and so do the dma map in the same loop.


Sorry for a typo. s/cmd_request->dma_range[]/payload->range.pfn_array[]/
Michael Kelley Aug. 20, 2021, 3:40 p.m. UTC | #20
From: hch@lst.de <hch@lst.de> Sent: Thursday, August 19, 2021 9:33 PM

> 

> On Thu, Aug 19, 2021 at 06:17:40PM +0000, Michael Kelley wrote:

> > >

> > > @@ -1824,6 +1848,13 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)

> > >  		payload->range.len = length;

> > >  		payload->range.offset = offset_in_hvpg;

> > >

> > > +		cmd_request->dma_range = kcalloc(hvpg_count,

> > > +				 sizeof(*cmd_request->dma_range),

> > > +				 GFP_ATOMIC);

> >

> > With this patch, it appears that storvsc_queuecommand() is always

> > doing bounce buffering, even when running in a non-isolated VM.

> > The dma_range is always allocated, and the inner loop below does

> > the dma mapping for every I/O page.  The corresponding code in

> > storvsc_on_channel_callback() that does the dma unmap allows for

> > the dma_range to be NULL, but that never happens.

> 

> Maybe I'm missing something in the hyperv code, but I don't think

> dma_map_page would bounce buffer for the non-isolated case.  It

> will just return the physical address.


OK, right.  In the isolated VM case, the swiotlb is in force mode
and will do bounce buffering.  In the non-isolated case,
dma_map_page_attrs() -> dma_direct_map_page() does a lot of
checking but eventually just returns the physical address.  As this
patch is currently coded, it adds a fair amount of overhead
here in storvsc_queuecommand(), plus the overhead of the dma
mapping function deciding to use the identity mapping.  But if
dma_map_sg() is used and the code is simplified a bit, the overhead
will be less in general and will be per sgl entry instead of per page.

> 

> > > +				if (offset_in_hvpg) {

> > > +					payload->range.offset = dma & ~HV_HYP_PAGE_MASK;

> > > +					offset_in_hvpg = 0;

> > > +				}

> >

> > I'm not clear on why payload->range.offset needs to be set again.

> > Even after the dma mapping is done, doesn't the offset in the first

> > page have to be the same?  If it wasn't the same, Hyper-V wouldn't

> > be able to process the PFN list correctly.  In fact, couldn't the above

> > code just always set offset_in_hvpg = 0?

> 

> Careful.  DMA mapping is supposed to keep the offset in the page, but

> for that the DMA mapping code needs to know what the device considers a

> "page".  For that the driver needs to set the min_align_mask field in

> struct device_dma_parameters.

> 


I see that the swiotlb code gets and uses the min_align_mask field.  But
the NVME driver is the only driver that ever sets it, so the value is zero
in all other cases.  Does swiotlb just use PAGE_SIZE in that that case?  I
couldn't tell from a quick glance at the swiotlb code.

> >

> > The whole approach here is to do dma remapping on each individual page

> > of the I/O buffer.  But wouldn't it be possible to use dma_map_sg() to map

> > each scatterlist entry as a unit?  Each scatterlist entry describes a range of

> > physically contiguous memory.  After dma_map_sg(), the resulting dma

> > address must also refer to a physically contiguous range in the swiotlb

> > bounce buffer memory.   So at the top of the "for" loop over the scatterlist

> > entries, do dma_map_sg() if we're in an isolated VM.  Then compute the

> > hvpfn value based on the dma address instead of sg_page().  But everything

> > else is the same, and the inner loop for populating the pfn_arry is unmodified.

> > Furthermore, the dma_range array that you've added is not needed, since

> > scatterlist entries already have a dma_address field for saving the mapped

> > address, and dma_unmap_sg() uses that field.

> 

> Yes, I think dma_map_sg is the right thing to use here, probably even

> for the non-isolated case so that we can get the hv drivers out of their

> little corner and into being more like a normal kernel driver.  That

> is, use the scsi_dma_map/scsi_dma_unmap helpers, and then iterate over

> the dma addresses one page at a time using for_each_sg_dma_page.

> 


Doing some broader revisions to the Hyper-V storvsc driver is up next on
my to-do list.  Rather than significantly modifying the non-isolated case in
this patch set, I'd suggest factoring it into my broader revisions.

Michael
Michael Kelley Aug. 20, 2021, 4:08 p.m. UTC | #21
From: Tianyu Lan <ltykernel@gmail.com> Sent: Friday, August 20, 2021 8:20 AM

> 

> On 8/20/2021 2:17 AM, Michael Kelley wrote:

> > From: Tianyu Lan <ltykernel@gmail.com> Sent: Monday, August 9, 2021 10:56 AM

> >

> > I'm not clear on why payload->range.offset needs to be set again.

> > Even after the dma mapping is done, doesn't the offset in the first

> > page have to be the same?  If it wasn't the same, Hyper-V wouldn't

> > be able to process the PFN list correctly.  In fact, couldn't the above

> > code just always set offset_in_hvpg = 0?

> 

> The offset will be changed. The swiotlb bounce buffer is allocated with

> IO_TLB_SIZE(2K) as unit. So the offset here may be changed.

> 


We need to prevent the offset from changing.  The storvsc driver passes
just a PFN list to Hyper-V, plus an overall starting offset and length.  Unlike
the netvsc driver, each entry in the PFN list does *not* have its own offset
and length.  Hyper-V assumes that the list is "dense" and that there are
no holes (i.e., unused memory areas).

For example, consider an original buffer passed into storvsc_queuecommand()
of 8 Kbytes, but aligned with 1 Kbytes at the end of the first page, then
4 Kbytes in the second page, and 3 Kbytes in the beginning of the third page.
The offset of that first 1 Kbytes has to remain as 3 Kbytes.  If bounce buffering
moves it to a different offset, there's no way to tell Hyper-V to ignore the
remaining bytes in the first page (at least not without using a different
method to communicate with Hyper-V).   In such a case, the wrong
data will get transferred.  Presumably the easier solution is to set the
min_align_mask field as Christop suggested.

> 

> >

> >>   	}

> >

> > The whole approach here is to do dma remapping on each individual page

> > of the I/O buffer.  But wouldn't it be possible to use dma_map_sg() to map

> > each scatterlist entry as a unit?  Each scatterlist entry describes a range of

> > physically contiguous memory.  After dma_map_sg(), the resulting dma

> > address must also refer to a physically contiguous range in the swiotlb

> > bounce buffer memory.   So at the top of the "for" loop over the scatterlist

> > entries, do dma_map_sg() if we're in an isolated VM.  Then compute the

> > hvpfn value based on the dma address instead of sg_page().  But everything

> > else is the same, and the inner loop for populating the pfn_arry is unmodified.

> > Furthermore, the dma_range array that you've added is not needed, since

> > scatterlist entries already have a dma_address field for saving the mapped

> > address, and dma_unmap_sg() uses that field.

> 

> I don't use dma_map_sg() here in order to avoid introducing one more

> loop(e,g dma_map_sg()). We already have a loop to populate

> cmd_request->dma_range[] and so do the dma map in the same loop.

> 


I'm not seeing where the additional loop comes from.  Storvsc
already has a loop through the sgl entries.  Retain that loop and call
dma_map_sg() with nents set to 1.  Then the sequence is
dma_map_sg() --> dma_map_sg_attrs() --> dma_direct_map_sg() ->
dma_direct_map_page().  The latter function will call swiotlb_map()
to map all pages of the sgl entry as a single operation.

Michael
Tianyu Lan Aug. 20, 2021, 6:04 p.m. UTC | #22
On 8/21/2021 12:08 AM, Michael Kelley wrote:
>>>>    	}

>>> The whole approach here is to do dma remapping on each individual page

>>> of the I/O buffer.  But wouldn't it be possible to use dma_map_sg() to map

>>> each scatterlist entry as a unit?  Each scatterlist entry describes a range of

>>> physically contiguous memory.  After dma_map_sg(), the resulting dma

>>> address must also refer to a physically contiguous range in the swiotlb

>>> bounce buffer memory.   So at the top of the "for" loop over the scatterlist

>>> entries, do dma_map_sg() if we're in an isolated VM.  Then compute the

>>> hvpfn value based on the dma address instead of sg_page().  But everything

>>> else is the same, and the inner loop for populating the pfn_arry is unmodified.

>>> Furthermore, the dma_range array that you've added is not needed, since

>>> scatterlist entries already have a dma_address field for saving the mapped

>>> address, and dma_unmap_sg() uses that field.

>> I don't use dma_map_sg() here in order to avoid introducing one more

>> loop(e,g dma_map_sg()). We already have a loop to populate

>> cmd_request->dma_range[] and so do the dma map in the same loop.

>>

> I'm not seeing where the additional loop comes from.  Storvsc

> already has a loop through the sgl entries.  Retain that loop and call

> dma_map_sg() with nents set to 1.  Then the sequence is

> dma_map_sg() --> dma_map_sg_attrs() --> dma_direct_map_sg() ->

> dma_direct_map_page().  The latter function will call swiotlb_map()

> to map all pages of the sgl entry as a single operation.


After dma_map_sg(), we still need to go through scatter list again to 
populate payload->rrange.pfn_array. We may just go through the scatter 
list just once if dma_map_sg() accepts a callback and run it during go
through scatter list.
Michael Kelley Aug. 20, 2021, 7:22 p.m. UTC | #23
From: Tianyu Lan <ltykernel@gmail.com> Sent: Friday, August 20, 2021 11:04 AM

> 

> On 8/21/2021 12:08 AM, Michael Kelley wrote:

> >>>>    	}

> >>> The whole approach here is to do dma remapping on each individual page

> >>> of the I/O buffer.  But wouldn't it be possible to use dma_map_sg() to map

> >>> each scatterlist entry as a unit?  Each scatterlist entry describes a range of

> >>> physically contiguous memory.  After dma_map_sg(), the resulting dma

> >>> address must also refer to a physically contiguous range in the swiotlb

> >>> bounce buffer memory.   So at the top of the "for" loop over the scatterlist

> >>> entries, do dma_map_sg() if we're in an isolated VM.  Then compute the

> >>> hvpfn value based on the dma address instead of sg_page().  But everything

> >>> else is the same, and the inner loop for populating the pfn_arry is unmodified.

> >>> Furthermore, the dma_range array that you've added is not needed, since

> >>> scatterlist entries already have a dma_address field for saving the mapped

> >>> address, and dma_unmap_sg() uses that field.

> >> I don't use dma_map_sg() here in order to avoid introducing one more

> >> loop(e,g dma_map_sg()). We already have a loop to populate

> >> cmd_request->dma_range[] and so do the dma map in the same loop.

> >>

> > I'm not seeing where the additional loop comes from.  Storvsc

> > already has a loop through the sgl entries.  Retain that loop and call

> > dma_map_sg() with nents set to 1.  Then the sequence is

> > dma_map_sg() --> dma_map_sg_attrs() --> dma_direct_map_sg() ->

> > dma_direct_map_page().  The latter function will call swiotlb_map()

> > to map all pages of the sgl entry as a single operation.

> 

> After dma_map_sg(), we still need to go through scatter list again to

> populate payload->rrange.pfn_array. We may just go through the scatter

> list just once if dma_map_sg() accepts a callback and run it during go

> through scatter list.


Here's some code for what I'm suggesting (not even compile tested).
The only change is what's in the "if" clause of the SNP test.  dma_map_sg()
is called with the nents parameter set to one so that it only
processes one sgl entry each time it is called, and doesn't walk the
entire sgl.  Arguably, we don't even need the SNP test and the else
clause -- just always do what's in the if clause.

The corresponding code in storvsc_on_channel_callback would also
have to be changed.   And we still have to set the min_align_mask
so swiotlb will preserve any offset.  Storsvsc already has things set up
so that higher levels ensure there are no holes between sgl entries,
and that needs to stay true.

	if (sg_count) {
		unsigned int hvpgoff, hvpfns_to_add;
		unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
		unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
		u64 hvpfn;
		int nents;

		if (hvpg_count > MAX_PAGE_BUFFER_COUNT) {

			payload_sz = (hvpg_count * sizeof(u64) +
				      sizeof(struct vmbus_packet_mpb_array));
			payload = kzalloc(payload_sz, GFP_ATOMIC);
			if (!payload)
				return SCSI_MLQUEUE_DEVICE_BUSY;
		}

		payload->range.len = length;
		payload->range.offset = offset_in_hvpg;


		for (i = 0; sgl != NULL; sgl = sg_next(sgl)) {
			/*
			 * Init values for the current sgl entry. hvpgoff
			 * and hvpfns_to_add are in units of Hyper-V size
			 * pages. Handling the PAGE_SIZE != HV_HYP_PAGE_SIZE
			 * case also handles values of sgl->offset that are
			 * larger than PAGE_SIZE. Such offsets are handled
			 * even on other than the first sgl entry, provided
			 * they are a multiple of PAGE_SIZE.
			 */
			hvpgoff = HVPFN_DOWN(sgl->offset);

			if (hv_isolation_type_snp()) {
				nents = dma_map_sg(dev->device, sgl, 1, scmnd->sc_data_direction);
				if (nents != 1)
					<return error code so higher levels will retry>
				hvpfn = HVPFN_DOWN(sg_dma_address(sgl)) + hvpgoff;
			} else {
				hvpfn = page_to_hvpfn(sg_page(sgl)) + hvpgoff;
			}

			hvpfns_to_add = HVPFN_UP(sgl->offset + sgl->length) -
						hvpgoff;

			/*
			 * Fill the next portion of the PFN array with
			 * sequential Hyper-V PFNs for the contiguous physical
			 * memory described by the sgl entry. The end of the
			 * last sgl should be reached at the same time that
			 * the PFN array is filled.
			 */
			while (hvpfns_to_add--)
				payload->range.pfn_array[i++] = hvpfn++;
		}
	}
Christoph Hellwig Aug. 24, 2021, 8:45 a.m. UTC | #24
This patch fails to compile when CONFIG_AMD_MEM_ENCRYPT is not enabled,
in which case there is sev_es_ghcb_hv_call_simple is not defined.
Christoph Hellwig Aug. 24, 2021, 8:46 a.m. UTC | #25
On Sat, Aug 21, 2021 at 02:04:11AM +0800, Tianyu Lan wrote:
> After dma_map_sg(), we still need to go through scatter list again to 

> populate payload->rrange.pfn_array. We may just go through the scatter list 

> just once if dma_map_sg() accepts a callback and run it during go

> through scatter list.


Iterating a cache hot array is way faster than doing lots of indirect
calls.
Christoph Hellwig Aug. 24, 2021, 8:49 a.m. UTC | #26
On Fri, Aug 20, 2021 at 03:40:08PM +0000, Michael Kelley wrote:
> I see that the swiotlb code gets and uses the min_align_mask field.  But
> the NVME driver is the only driver that ever sets it, so the value is zero
> in all other cases.  Does swiotlb just use PAGE_SIZE in that that case?  I
> couldn't tell from a quick glance at the swiotlb code.

The encoding isn't all that common.  I only know it for the RDMA memory
registration format, and RDMA in general does not interact very well
with swiotlb (although the in-kernel drivers should work fine, it is
userspace RDMA that is the problem).  It seems recently a new driver
using the format (mpi3mr) also showed up.  All these should probably set
the min_align_mask.