mbox series

[V4,00/13] x86/Hyper-V: Add Hyper-V Isolation VM support

Message ID 20210827172114.414281-1-ltykernel@gmail.com
Headers show
Series x86/Hyper-V: Add Hyper-V Isolation VM support | expand

Message

Tianyu Lan Aug. 27, 2021, 5:20 p.m. UTC
From: Tianyu Lan <Tianyu.Lan@microsoft.com>

Hyper-V provides two kinds of Isolation VMs. VBS(Virtualization-based
security) and AMD SEV-SNP unenlightened Isolation VMs. This patchset
is to add support for these Isolation VM support in Linux.

The memory of these vms are encrypted and host can't access guest
memory directly. Hyper-V provides new host visibility hvcall and
the guest needs to call new hvcall to mark memory visible to host
before sharing memory with host. For security, all network/storage
stack memory should not be shared with host and so there is bounce
buffer requests.

Vmbus channel ring buffer already plays bounce buffer role because
all data from/to host needs to copy from/to between the ring buffer
and IO stack memory. So mark vmbus channel ring buffer visible.

There are two exceptions - packets sent by vmbus_sendpacket_
pagebuffer() and vmbus_sendpacket_mpb_desc(). These packets
contains IO stack memory address and host will access these memory.
So add allocation bounce buffer support in vmbus for these packets.

For SNP isolation VM, guest needs to access the shared memory via
extra address space which is specified by Hyper-V CPUID HYPERV_CPUID_
ISOLATION_CONFIG. The access physical address of the shared memory
should be bounce buffer memory GPA plus with shared_gpa_boundary
reported by CPUID.

This patchset is based on the Hyper-V next branch.

Change since V3:
	- Initalize GHCB page in the cpu init callbac.
	- Change vmbus_teardown_gpadl() parameter in order to
	  mask the memory back to non-visible to host.
	- Merge hv_ringbuffer_post_init() into hv_ringbuffer_init().
	- Keep Hyper-V bounce buffer size as same as AMD SEV VM
	- Use dma_map_sg() instead of dm_map_page() in the storvsc driver.

Change since V2:
       - Drop x86_set_memory_enc static call and use platform check
         in the __set_memory_enc_dec() to run platform callback of
	 set memory encrypted or decrypted.

Change since V1:
       - Introduce x86_set_memory_enc static call and so platforms can
         override __set_memory_enc_dec() with their implementation
       - Introduce sev_es_ghcb_hv_call_simple() and share code
         between SEV and Hyper-V code.
       - Not remap monitor pages in the non-SNP isolation VM
       - Make swiotlb_init_io_tlb_mem() return error code and return
         error when dma_map_decrypted() fails.

Change since RFC V4:
       - Introduce dma map decrypted function to remap bounce buffer
          and provide dma map decrypted ops for platform to hook callback.        
       - Split swiotlb and dma map decrypted change into two patches
       - Replace vstart with vaddr in swiotlb changes.

Change since RFC v3:
       - Add interface set_memory_decrypted_map() to decrypt memory and
         map bounce buffer in extra address space
       - Remove swiotlb remap function and store the remap address
         returned by set_memory_decrypted_map() in swiotlb mem data structure.
       - Introduce hv_set_mem_enc() to make code more readable in the __set_memory_enc_dec().

Change since RFC v2:
       - Remove not UIO driver in Isolation VM patch
       - Use vmap_pfn() to replace ioremap_page_range function in
       order to avoid exposing symbol ioremap_page_range() and
       ioremap_page_range()
       - Call hv set mem host visibility hvcall in set_memory_encrypted/decrypted()
       - Enable swiotlb force mode instead of adding Hyper-V dma map/unmap hook
       - Fix code style


Tianyu Lan (13):
  x86/hyperv: Initialize GHCB page in Isolation VM
  x86/hyperv: Initialize shared memory boundary in the Isolation VM.
  x86/hyperv: Add new hvcall guest address host visibility support
  hyperv: Mark vmbus ring buffer visible to host in Isolation VM
  hyperv: Add Write/Read MSR registers via ghcb page
  hyperv: Add ghcb hvcall support for SNP VM
  hyperv/Vmbus: Add SNP support for VMbus channel initiate  message
  hyperv/vmbus: Initialize VMbus ring buffer for Isolation VM
  DMA: Add dma_map_decrypted/dma_unmap_encrypted() function
  x86/Swiotlb: Add Swiotlb bounce buffer remap function for HV IVM
  hyperv/IOMMU: Enable swiotlb bounce buffer for Isolation VM
  hv_netvsc: Add Isolation VM support for netvsc driver
  hv_storvsc: Add Isolation VM support for storvsc driver

 arch/arm64/include/asm/mshyperv.h  |  23 ++
 arch/x86/hyperv/Makefile           |   2 +-
 arch/x86/hyperv/hv_init.c          |  78 +++++--
 arch/x86/hyperv/ivm.c              | 325 +++++++++++++++++++++++++++++
 arch/x86/include/asm/hyperv-tlfs.h |  17 ++
 arch/x86/include/asm/mshyperv.h    |  88 +++++++-
 arch/x86/include/asm/sev.h         |   3 +
 arch/x86/kernel/cpu/mshyperv.c     |   5 +
 arch/x86/kernel/sev-shared.c       |  63 +++---
 arch/x86/mm/mem_encrypt.c          |   3 +-
 arch/x86/mm/pat/set_memory.c       |  19 +-
 arch/x86/xen/pci-swiotlb-xen.c     |   3 +-
 drivers/hv/Kconfig                 |   1 +
 drivers/hv/channel.c               |  55 +++--
 drivers/hv/connection.c            |  81 ++++++-
 drivers/hv/hv.c                    | 120 +++++++----
 drivers/hv/hv_common.c             |  12 ++
 drivers/hv/hyperv_vmbus.h          |   1 +
 drivers/hv/ring_buffer.c           |  56 +++--
 drivers/hv/vmbus_drv.c             |   4 +
 drivers/iommu/hyperv-iommu.c       |  61 ++++++
 drivers/net/hyperv/hyperv_net.h    |   6 +
 drivers/net/hyperv/netvsc.c        | 151 +++++++++++++-
 drivers/net/hyperv/rndis_filter.c  |   2 +
 drivers/scsi/storvsc_drv.c         |  41 ++--
 drivers/uio/uio_hv_generic.c       |  14 +-
 include/asm-generic/hyperv-tlfs.h  |   1 +
 include/asm-generic/mshyperv.h     |  19 +-
 include/linux/dma-map-ops.h        |   9 +
 include/linux/hyperv.h             |  15 +-
 include/linux/swiotlb.h            |   4 +
 kernel/dma/mapping.c               |  22 ++
 kernel/dma/swiotlb.c               |  32 ++-
 33 files changed, 1166 insertions(+), 170 deletions(-)
 create mode 100644 arch/x86/hyperv/ivm.c

Comments

Greg KH Aug. 27, 2021, 5:41 p.m. UTC | #1
On Fri, Aug 27, 2021 at 01:21:03PM -0400, Tianyu Lan wrote:
> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
> 
> Hyperv provides GHCB protocol to write Synthetic Interrupt
> Controller MSR registers in Isolation VM with AMD SEV SNP
> and these registers are emulated by hypervisor directly.
> Hyperv requires to write SINTx MSR registers twice. First
> writes MSR via GHCB page to communicate with hypervisor
> and then writes wrmsr instruction to talk with paravisor
> which runs in VMPL0. Guest OS ID MSR also needs to be set
> via GHCB page.
> 
> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
> ---
> Change since v1:
>          * Introduce sev_es_ghcb_hv_call_simple() and share code
>            between SEV and Hyper-V code.
> Change since v3:
>          * Pass old_msg_type to hv_signal_eom() as parameter.
> 	 * Use HV_REGISTER_* marcro instead of HV_X64_MSR_*
> 	 * Add hv_isolation_type_snp() weak function.
> 	 * Add maros to set syinc register in ARM code.
> ---
>  arch/arm64/include/asm/mshyperv.h |  23 ++++++
>  arch/x86/hyperv/hv_init.c         |  36 ++--------
>  arch/x86/hyperv/ivm.c             | 112 ++++++++++++++++++++++++++++++
>  arch/x86/include/asm/mshyperv.h   |  80 ++++++++++++++++++++-
>  arch/x86/include/asm/sev.h        |   3 +
>  arch/x86/kernel/sev-shared.c      |  63 ++++++++++-------
>  drivers/hv/hv.c                   | 112 ++++++++++++++++++++----------
>  drivers/hv/hv_common.c            |   6 ++
>  include/asm-generic/mshyperv.h    |   4 +-
>  9 files changed, 345 insertions(+), 94 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
> index 20070a847304..ced83297e009 100644
> --- a/arch/arm64/include/asm/mshyperv.h
> +++ b/arch/arm64/include/asm/mshyperv.h
> @@ -41,6 +41,29 @@ static inline u64 hv_get_register(unsigned int reg)
>  	return hv_get_vpreg(reg);
>  }
>  
> +#define hv_get_simp(val)	{ val = hv_get_register(HV_REGISTER_SIMP); }
> +#define hv_set_simp(val)	hv_set_register(HV_REGISTER_SIMP, val)
> +
> +#define hv_get_siefp(val)	{ val = hv_get_register(HV_REGISTER_SIEFP); }
> +#define hv_set_siefp(val)	hv_set_register(HV_REGISTER_SIEFP, val)
> +
> +#define hv_get_synint_state(int_num, val) {			\
> +	val = hv_get_register(HV_REGISTER_SINT0 + int_num);	\
> +	}
> +
> +#define hv_set_synint_state(int_num, val)			\
> +	hv_set_register(HV_REGISTER_SINT0 + int_num, val)
> +
> +#define hv_get_synic_state(val) {			\
> +	val = hv_get_register(HV_REGISTER_SCONTROL);	\
> +	}
> +
> +#define hv_set_synic_state(val)			\
> +	hv_set_register(HV_REGISTER_SCONTROL, val)
> +
> +#define hv_signal_eom(old_msg_type)		 \
> +	hv_set_register(HV_REGISTER_EOM, 0)

Please just use real inline functions and not #defines if you really
need it.

thanks,

greg k-h
Tianyu Lan Aug. 27, 2021, 5:46 p.m. UTC | #2
On 8/28/2021 1:41 AM, Greg KH wrote:
> On Fri, Aug 27, 2021 at 01:21:03PM -0400, Tianyu Lan wrote:
>> From: Tianyu Lan <Tianyu.Lan@microsoft.com>
>>
>> Hyperv provides GHCB protocol to write Synthetic Interrupt
>> Controller MSR registers in Isolation VM with AMD SEV SNP
>> and these registers are emulated by hypervisor directly.
>> Hyperv requires to write SINTx MSR registers twice. First
>> writes MSR via GHCB page to communicate with hypervisor
>> and then writes wrmsr instruction to talk with paravisor
>> which runs in VMPL0. Guest OS ID MSR also needs to be set
>> via GHCB page.
>>
>> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
>> ---
>> Change since v1:
>>           * Introduce sev_es_ghcb_hv_call_simple() and share code
>>             between SEV and Hyper-V code.
>> Change since v3:
>>           * Pass old_msg_type to hv_signal_eom() as parameter.
>> 	 * Use HV_REGISTER_* marcro instead of HV_X64_MSR_*
>> 	 * Add hv_isolation_type_snp() weak function.
>> 	 * Add maros to set syinc register in ARM code.
>> ---
>>   arch/arm64/include/asm/mshyperv.h |  23 ++++++
>>   arch/x86/hyperv/hv_init.c         |  36 ++--------
>>   arch/x86/hyperv/ivm.c             | 112 ++++++++++++++++++++++++++++++
>>   arch/x86/include/asm/mshyperv.h   |  80 ++++++++++++++++++++-
>>   arch/x86/include/asm/sev.h        |   3 +
>>   arch/x86/kernel/sev-shared.c      |  63 ++++++++++-------
>>   drivers/hv/hv.c                   | 112 ++++++++++++++++++++----------
>>   drivers/hv/hv_common.c            |   6 ++
>>   include/asm-generic/mshyperv.h    |   4 +-
>>   9 files changed, 345 insertions(+), 94 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
>> index 20070a847304..ced83297e009 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -41,6 +41,29 @@ static inline u64 hv_get_register(unsigned int reg)
>>   	return hv_get_vpreg(reg);
>>   }
>>   
>> +#define hv_get_simp(val)	{ val = hv_get_register(HV_REGISTER_SIMP); }
>> +#define hv_set_simp(val)	hv_set_register(HV_REGISTER_SIMP, val)
>> +
>> +#define hv_get_siefp(val)	{ val = hv_get_register(HV_REGISTER_SIEFP); }
>> +#define hv_set_siefp(val)	hv_set_register(HV_REGISTER_SIEFP, val)
>> +
>> +#define hv_get_synint_state(int_num, val) {			\
>> +	val = hv_get_register(HV_REGISTER_SINT0 + int_num);	\
>> +	}
>> +
>> +#define hv_set_synint_state(int_num, val)			\
>> +	hv_set_register(HV_REGISTER_SINT0 + int_num, val)
>> +
>> +#define hv_get_synic_state(val) {			\
>> +	val = hv_get_register(HV_REGISTER_SCONTROL);	\
>> +	}
>> +
>> +#define hv_set_synic_state(val)			\
>> +	hv_set_register(HV_REGISTER_SCONTROL, val)
>> +
>> +#define hv_signal_eom(old_msg_type)		 \
>> +	hv_set_register(HV_REGISTER_EOM, 0)
> 
> Please just use real inline functions and not #defines if you really
> need it.
> 

OK. Will update. Thanks.
Christoph Hellwig Aug. 30, 2021, noon UTC | #3
Sorry for the delayed answer, but I look at the vmap_pfn usage in the
previous version and tried to come up with a better version.  This
mostly untested branch:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/hyperv-vmap

get us there for swiotlb and the channel infrastructure  I've started
looking at the network driver and didn't get anywhere due to other work.

As far as I can tell the network driver does gigantic multi-megabyte
vmalloc allocation for the send and receive buffers, which are then
passed to the hardware, but always copied to/from when interacting
with the networking stack.  Did I see that right?  Are these big
buffers actually required unlike the normal buffer management schemes
in other Linux network drivers?

If so I suspect the best way to allocate them is by not using vmalloc
but just discontiguous pages, and then use kmap_local_pfn where the
PFN includes the share_gpa offset when actually copying from/to the
skbs.
Tianyu Lan Aug. 31, 2021, 3:20 p.m. UTC | #4
Hi Christoph:

On 8/30/2021 8:00 PM, Christoph Hellwig wrote:
> Sorry for the delayed answer, but I look at the vmap_pfn usage in the

> previous version and tried to come up with a better version.  This

> mostly untested branch:

> 

> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/hyperv-vmap


No problem. Thank you very much for your suggestion patches and they are 
very helpful.


> 

> get us there for swiotlb and the channel infrastructure  I've started

> looking at the network driver and didn't get anywhere due to other work.

> 

> As far as I can tell the network driver does gigantic multi-megabyte

> vmalloc allocation for the send and receive buffers, which are then

> passed to the hardware, but always copied to/from when interacting

> with the networking stack.  Did I see that right?  Are these big

> buffers actually required unlike the normal buffer management schemes

> in other Linux network drivers?



For send packet, netvsc tries batching packet in send buffer if 
possible. It passes the original skb pages directly to
hypervisor when send buffer is not enough or packet length is larger 
than section size. These packets are sent via 
vmbus_sendpacket_pagebuffer() finally. Please see netvsc_send() for 
detail. The following code is to check whether the packet could be 
copied into send buffer. If not, the packet will be sent with original 
skb pages.

1239        /* batch packets in send buffer if possible */
1240        msdp = &nvchan->msd;
1241        if (msdp->pkt)
1242                msd_len = msdp->pkt->total_data_buflen;
1243
1244        try_batch =  msd_len > 0 && msdp->count < net_device->max_pkt;
1245        if (try_batch && msd_len + pktlen + net_device->pkt_align <
1246            net_device->send_section_size) {
1247                section_index = msdp->pkt->send_buf_index;
1248
1249        } else if (try_batch && msd_len + packet->rmsg_size <
1250                   net_device->send_section_size) {
1251                section_index = msdp->pkt->send_buf_index;
1252                packet->cp_partial = true;
1253
1254        } else if (pktlen + net_device->pkt_align <
1255                   net_device->send_section_size) {
1256                section_index = 
netvsc_get_next_send_section(net_device);
1257                if (unlikely(section_index == NETVSC_INVALID_INDEX)) {
1258                        ++ndev_ctx->eth_stats.tx_send_full;
1259                } else {
1260                        move_pkt_msd(&msd_send, &msd_skb, msdp);
1261                        msd_len = 0;
1262                }
1263        }
1264



For receive packet, the data is always copied from recv buffer.

> 

> If so I suspect the best way to allocate them is by not using vmalloc

> but just discontiguous pages, and then use kmap_local_pfn where the

> PFN includes the share_gpa offset when actually copying from/to the

> skbs.

> 

When netvsc needs to copy packet data to send buffer, it needs to 
caculate position with section_index and send_section_size.
Please seee netvsc_copy_to_send_buf() detail. So the contiguous virtual 
address of send buffer is necessary to copy data and batch packets.
Michael Kelley Aug. 31, 2021, 5:16 p.m. UTC | #5
From: Christoph Hellwig <hch@lst.de> Sent: Monday, August 30, 2021 5:01 AM
> 
> Sorry for the delayed answer, but I look at the vmap_pfn usage in the
> previous version and tried to come up with a better version.  This
> mostly untested branch:
> 
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/hyperv-vmap
> 
> get us there for swiotlb and the channel infrastructure  I've started
> looking at the network driver and didn't get anywhere due to other work.
> 
> As far as I can tell the network driver does gigantic multi-megabyte
> vmalloc allocation for the send and receive buffers, which are then
> passed to the hardware, but always copied to/from when interacting
> with the networking stack.  Did I see that right?  Are these big
> buffers actually required unlike the normal buffer management schemes
> in other Linux network drivers?
> 
> If so I suspect the best way to allocate them is by not using vmalloc
> but just discontiguous pages, and then use kmap_local_pfn where the
> PFN includes the share_gpa offset when actually copying from/to the
> skbs.

As a quick overview, I think there are four places where the
shared_gpa_boundary must be applied to adjust the guest physical
address that is used.  Each requires mapping a corresponding
virtual address range.  Here are the four places:

1)  The so-called "monitor pages" that are a core communication
mechanism between the guest and Hyper-V.  These are two single
pages, and the mapping is handled by calling memremap() for
each of the two pages.  See Patch 7 of Tianyu's series.

2)  The VMbus channel ring buffers.  You have proposed using
your new  vmap_phys_range() helper, but I don't think that works
here.  More details below.

3)  The network driver send and receive buffers.  vmap_phys_range()
should work here.

4) The swiotlb memory used for bounce buffers.  vmap_phys_range()
should work here as well.

Case #2 above does unusual mapping.  The ring buffer consists of a ring
buffer header page, followed by one or more pages that are the actual
ring buffer.  The pages making up the actual ring buffer are mapped
twice in succession.  For example, if the ring buffer has 4 pages
(one header page and three ring buffer pages), the contiguous
virtual mapping must cover these seven pages:  0, 1, 2, 3, 1, 2, 3.
The duplicate contiguous mapping allows the code that is reading
or writing the actual ring buffer to not be concerned about wrap-around
because writing off the end of the ring buffer is automatically
wrapped-around by the mapping.  The amount of data read or
written in one batch never exceeds the size of the ring buffer, and
after a batch is read or written, the read or write indices are adjusted
to put them back into the range of the first mapping of the actual
ring buffer pages.  So there's method to the madness, and the
technique works pretty well.  But this kind of mapping is not
amenable to using vmap_phys_range().

Michael
Christoph Hellwig Sept. 2, 2021, 7:51 a.m. UTC | #6
On Tue, Aug 31, 2021 at 11:20:06PM +0800, Tianyu Lan wrote:
>> If so I suspect the best way to allocate them is by not using vmalloc
>> but just discontiguous pages, and then use kmap_local_pfn where the
>> PFN includes the share_gpa offset when actually copying from/to the
>> skbs.
>>
> When netvsc needs to copy packet data to send buffer, it needs to caculate 
> position with section_index and send_section_size.
> Please seee netvsc_copy_to_send_buf() detail. So the contiguous virtual 
> address of send buffer is necessary to copy data and batch packets.

Actually that makes the kmap approach much easier.  The phys_to_virt
can just be replaced with a kmap_local_pfn and the unmap needs to
be added.  I've been mostly focussing on the receive path, which
would need a similar treatment.
Christoph Hellwig Sept. 2, 2021, 7:59 a.m. UTC | #7
On Tue, Aug 31, 2021 at 05:16:19PM +0000, Michael Kelley wrote:
> As a quick overview, I think there are four places where the
> shared_gpa_boundary must be applied to adjust the guest physical
> address that is used.  Each requires mapping a corresponding
> virtual address range.  Here are the four places:
> 
> 1)  The so-called "monitor pages" that are a core communication
> mechanism between the guest and Hyper-V.  These are two single
> pages, and the mapping is handled by calling memremap() for
> each of the two pages.  See Patch 7 of Tianyu's series.

Ah, interesting.

> 3)  The network driver send and receive buffers.  vmap_phys_range()
> should work here.

Actually it won't.  The problem with these buffers is that they are
physically non-contiguous allocations.  We really have two sensible
options:

 1) use vmap_pfn as in the current series.  But in that case I think
    we should get rid of the other mapping created by vmalloc.  I
    though a bit about finding a way to apply the offset in vmalloc
    itself, but I think it would be too invasive to the normal fast
    path.  So the other sub-option would be to allocate the pages
    manually (maybe even using high order allocations to reduce TLB
    pressure) and then remap them
 2) do away with the contiguous kernel mapping entirely.  This means
    the simple memcpy calls become loops over kmap_local_pfn.  As
    I just found out for the send side that would be pretty easy,
    but the receive side would be more work.  We'd also need to check
    the performance implications.

> 4) The swiotlb memory used for bounce buffers.  vmap_phys_range()
> should work here as well.

Or memremap if it works for 1.

> Case #2 above does unusual mapping.  The ring buffer consists of a ring
> buffer header page, followed by one or more pages that are the actual
> ring buffer.  The pages making up the actual ring buffer are mapped
> twice in succession.  For example, if the ring buffer has 4 pages
> (one header page and three ring buffer pages), the contiguous
> virtual mapping must cover these seven pages:  0, 1, 2, 3, 1, 2, 3.
> The duplicate contiguous mapping allows the code that is reading
> or writing the actual ring buffer to not be concerned about wrap-around
> because writing off the end of the ring buffer is automatically
> wrapped-around by the mapping.  The amount of data read or
> written in one batch never exceeds the size of the ring buffer, and
> after a batch is read or written, the read or write indices are adjusted
> to put them back into the range of the first mapping of the actual
> ring buffer pages.  So there's method to the madness, and the
> technique works pretty well.  But this kind of mapping is not
> amenable to using vmap_phys_range().

Hmm.  Can you point me to where this is mapped?  Especially for the
classic non-isolated case where no vmap/vmalloc mapping is involved
at all?
Tianyu Lan Sept. 2, 2021, 11:21 a.m. UTC | #8
On 9/2/2021 3:59 PM, Christoph Hellwig wrote:
> On Tue, Aug 31, 2021 at 05:16:19PM +0000, Michael Kelley wrote:

>> As a quick overview, I think there are four places where the

>> shared_gpa_boundary must be applied to adjust the guest physical

>> address that is used.  Each requires mapping a corresponding

>> virtual address range.  Here are the four places:

>>

>> 1)  The so-called "monitor pages" that are a core communication

>> mechanism between the guest and Hyper-V.  These are two single

>> pages, and the mapping is handled by calling memremap() for

>> each of the two pages.  See Patch 7 of Tianyu's series.

> 

> Ah, interesting.

> 

>> 3)  The network driver send and receive buffers.  vmap_phys_range()

>> should work here.

> 

> Actually it won't.  The problem with these buffers is that they are

> physically non-contiguous allocations.  We really have two sensible

> options:

> 

>   1) use vmap_pfn as in the current series.  But in that case I think

>      we should get rid of the other mapping created by vmalloc.  I

>      though a bit about finding a way to apply the offset in vmalloc

>      itself, but I think it would be too invasive to the normal fast

>      path.  So the other sub-option would be to allocate the pages

>      manually (maybe even using high order allocations to reduce TLB

>      pressure) and then remap them


Agree. In such case, the map for memory below shared_gpa_boundary is not 
necessary. allocate_pages() is limited by MAX_ORDER and needs to be 
called repeatedly to get enough memory.

>   2) do away with the contiguous kernel mapping entirely.  This means

>      the simple memcpy calls become loops over kmap_local_pfn.  As

>      I just found out for the send side that would be pretty easy,

>      but the receive side would be more work.  We'd also need to check

>      the performance implications.


kmap_local_pfn() requires pfn with backing struct page and this doesn't 
work pfn above shared_gpa_boundary.
> 

>> 4) The swiotlb memory used for bounce buffers.  vmap_phys_range()

>> should work here as well.

> 

> Or memremap if it works for 1.


Now use vmap_pfn() and the hv map function is reused in the netvsc driver.

> 

>> Case #2 above does unusual mapping.  The ring buffer consists of a ring

>> buffer header page, followed by one or more pages that are the actual

>> ring buffer.  The pages making up the actual ring buffer are mapped

>> twice in succession.  For example, if the ring buffer has 4 pages

>> (one header page and three ring buffer pages), the contiguous

>> virtual mapping must cover these seven pages:  0, 1, 2, 3, 1, 2, 3.

>> The duplicate contiguous mapping allows the code that is reading

>> or writing the actual ring buffer to not be concerned about wrap-around

>> because writing off the end of the ring buffer is automatically

>> wrapped-around by the mapping.  The amount of data read or

>> written in one batch never exceeds the size of the ring buffer, and

>> after a batch is read or written, the read or write indices are adjusted

>> to put them back into the range of the first mapping of the actual

>> ring buffer pages.  So there's method to the madness, and the

>> technique works pretty well.  But this kind of mapping is not

>> amenable to using vmap_phys_range().

> 

> Hmm.  Can you point me to where this is mapped?  Especially for the

> classic non-isolated case where no vmap/vmalloc mapping is involved

> at all?

> 


This is done via vmap() in the hv_ringbuffer_init()

182/* Initialize the ring buffer. */
183int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info,
184                       struct page *pages, u32 page_cnt, u32 
max_pkt_size)
185{
186        int i;
187        struct page **pages_wraparound;
188
189        BUILD_BUG_ON((sizeof(struct hv_ring_buffer) != PAGE_SIZE));
190
191        /*
192         * First page holds struct hv_ring_buffer, do wraparound 
mapping for
193         * the rest.
194         */
195        pages_wraparound = kcalloc(page_cnt * 2 - 1, sizeof(struct 
page *),
196                                   GFP_KERNEL);
197        if (!pages_wraparound)
198                return -ENOMEM;
199
/* prepare to wrap page array */
200        pages_wraparound[0] = pages;
201        for (i = 0; i < 2 * (page_cnt - 1); i++)
202                pages_wraparound[i + 1] = &pages[i % (page_cnt - 1) + 1];
203
/* map */
204        ring_info->ring_buffer = (struct hv_ring_buffer *)
205                vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP, 
PAGE_KERNEL);
206
207        kfree(pages_wraparound);
208
209
210        if (!ring_info->ring_buffer)
211                return -ENOMEM;
212
213        ring_info->ring_buffer->read_index =
214                ring_info->ring_buffer->write_index = 0;
Michael Kelley Sept. 2, 2021, 3:57 p.m. UTC | #9
From: Christoph Hellwig <hch@lst.de> Sent: Thursday, September 2, 2021 1:00 AM
> 
> On Tue, Aug 31, 2021 at 05:16:19PM +0000, Michael Kelley wrote:
> > As a quick overview, I think there are four places where the
> > shared_gpa_boundary must be applied to adjust the guest physical
> > address that is used.  Each requires mapping a corresponding
> > virtual address range.  Here are the four places:
> >
> > 1)  The so-called "monitor pages" that are a core communication
> > mechanism between the guest and Hyper-V.  These are two single
> > pages, and the mapping is handled by calling memremap() for
> > each of the two pages.  See Patch 7 of Tianyu's series.
> 
> Ah, interesting.
> 
> > 3)  The network driver send and receive buffers.  vmap_phys_range()
> > should work here.
> 
> Actually it won't.  The problem with these buffers is that they are
> physically non-contiguous allocations.  

Indeed you are right.  These buffers are allocated with vzalloc().

> We really have two sensible options:
> 
>  1) use vmap_pfn as in the current series.  But in that case I think
>     we should get rid of the other mapping created by vmalloc.  I
>     though a bit about finding a way to apply the offset in vmalloc
>     itself, but I think it would be too invasive to the normal fast
>     path.  So the other sub-option would be to allocate the pages
>     manually (maybe even using high order allocations to reduce TLB
>     pressure) and then remap them

What's the benefit of getting rid of the other mapping created by
vmalloc if it isn't referenced?  Just page table space?  The default sizes
are a 16 Meg receive buffer and a 1 Meg send buffer for each VMbus
channel used by netvsc, and usually the max number of channels
is 8.  So there's 128 Meg of virtual space to be saved on the receive
buffers,  which could be worth it.

Allocating the pages manually is also an option, but we have to
be careful about high order allocations.  While typically these buffers
are allocated during system boot, these synthetic NICs can be hot
added and removed while the VM is running.   The channel count
can also be changed while the VM is running.  So multiple 16 Meg
receive buffer allocations may need to be done after the system has
been running a long time.

>  2) do away with the contiguous kernel mapping entirely.  This means
>     the simple memcpy calls become loops over kmap_local_pfn.  As
>     I just found out for the send side that would be pretty easy,
>     but the receive side would be more work.  We'd also need to check
>     the performance implications.

Doing away with the contiguous kernel mapping entirely seems like
it would result in fairly messy code to access the buffer.  What's the
benefit of doing away with the mapping?  I'm not an expert on the
netvsc driver, but decoding the incoming packets is already fraught
with complexities because of the nature of the protocol with Hyper-V.
The contiguous kernel mapping at least keeps the basics sane.

> 
> > 4) The swiotlb memory used for bounce buffers.  vmap_phys_range()
> > should work here as well.
> 
> Or memremap if it works for 1.
> 
> > Case #2 above does unusual mapping.  The ring buffer consists of a ring
> > buffer header page, followed by one or more pages that are the actual
> > ring buffer.  The pages making up the actual ring buffer are mapped
> > twice in succession.  For example, if the ring buffer has 4 pages
> > (one header page and three ring buffer pages), the contiguous
> > virtual mapping must cover these seven pages:  0, 1, 2, 3, 1, 2, 3.
> > The duplicate contiguous mapping allows the code that is reading
> > or writing the actual ring buffer to not be concerned about wrap-around
> > because writing off the end of the ring buffer is automatically
> > wrapped-around by the mapping.  The amount of data read or
> > written in one batch never exceeds the size of the ring buffer, and
> > after a batch is read or written, the read or write indices are adjusted
> > to put them back into the range of the first mapping of the actual
> > ring buffer pages.  So there's method to the madness, and the
> > technique works pretty well.  But this kind of mapping is not
> > amenable to using vmap_phys_range().
> 
> Hmm.  Can you point me to where this is mapped?  Especially for the
> classic non-isolated case where no vmap/vmalloc mapping is involved
> at all?

The existing code is in hv_ringbuffer_init() in drivers/hv/ring_buffer.c.
The code hasn't changed in a while, so any recent upstream code tree
is valid to look at.  The memory pages are typically allocated
in vmbus_alloc_ring() in drivers/hv/channel.c.

Michael