diff mbox series

[PROTOTYPE,3/6] vfio: Implement support for sparse RAM memory regions

Message ID 20200924160423.106747-4-david@redhat.com
State New
Headers show
Series virtio-mem: vfio support | expand

Commit Message

David Hildenbrand Sept. 24, 2020, 4:04 p.m. UTC
Implement support for sparse RAM, to be used by virtio-mem. Handling
is somewhat-similar to memory_region_is_iommu() handling, which also
notifies on changes.

Instead of mapping the whole region, we only map selected pieces (and
unmap previously selected pieces) when notified by the SparseRAMHandler.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/vfio/common.c              | 155 ++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |  12 +++
 2 files changed, 167 insertions(+)

Comments

Peter Xu Oct. 20, 2020, 7:44 p.m. UTC | #1
On Thu, Sep 24, 2020 at 06:04:20PM +0200, David Hildenbrand wrote:
> Implement support for sparse RAM, to be used by virtio-mem. Handling

> is somewhat-similar to memory_region_is_iommu() handling, which also

> notifies on changes.

> 

> Instead of mapping the whole region, we only map selected pieces (and

> unmap previously selected pieces) when notified by the SparseRAMHandler.


It works with vIOMMU too, right? :) As long as vfio_get_vaddr() works as
expected with virtio-mem plugging new memories, then I think the answer should
be yes.

If it's true, maybe worth mention it somewhere either in the commit message or
in the code comment, because it seems not that obvious.

So if you have plan to do some real IOs in your future tests, may also worth
try with the "-device intel-iommu" and intel_iommu=on in the guest against the
same test.

Thanks,

-- 
Peter Xu
David Hildenbrand Oct. 20, 2020, 8:01 p.m. UTC | #2
On 20.10.20 21:44, Peter Xu wrote:
> On Thu, Sep 24, 2020 at 06:04:20PM +0200, David Hildenbrand wrote:

>> Implement support for sparse RAM, to be used by virtio-mem. Handling

>> is somewhat-similar to memory_region_is_iommu() handling, which also

>> notifies on changes.

>>

>> Instead of mapping the whole region, we only map selected pieces (and

>> unmap previously selected pieces) when notified by the SparseRAMHandler.

> 

> It works with vIOMMU too, right? :) As long as vfio_get_vaddr() works as

> expected with virtio-mem plugging new memories, then I think the answer should

> be yes.

> 


I haven't tried, but I guess as it's simply mapped into
&address_space_memory, it should work just fine.

> If it's true, maybe worth mention it somewhere either in the commit message or

> in the code comment, because it seems not that obvious.


I will test and add, thanks for the hint!

> 

> So if you have plan to do some real IOs in your future tests, may also worth

> try with the "-device intel-iommu" and intel_iommu=on in the guest against the

> same test.


Thanks ... but I have an AMD system. Will try to find out how to get
that running with AMD :)

-- 
Thanks,

David / dhildenb
Peter Xu Oct. 20, 2020, 8:44 p.m. UTC | #3
On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote:
> Thanks ... but I have an AMD system. Will try to find out how to get

> that running with AMD :)


May still start with trying intel-iommu first. :) I think it should work for
amd hosts too.

Just another FYI - Wei is working on amd-iommu for vfio [1], but it's still
during review.

[1] https://lore.kernel.org/qemu-devel/20201002145907.1294353-1-wei.huang2@amd.com/

-- 
Peter Xu
David Hildenbrand Nov. 12, 2020, 10:11 a.m. UTC | #4
On 20.10.20 22:44, Peter Xu wrote:
> On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote:

>> Thanks ... but I have an AMD system. Will try to find out how to get

>> that running with AMD :)


I just did some more testing with the oldish GPU I have for that
purpose. Seems to work, at least I get video output that keeps
on working - did not try advanced things yet.

I use
-device vfio-pci,host=05:00.0,x-vga=on
-device vfio-pci,host=05:00.1

when adding "-device intel-iommu", I got

"qemu-system-x86_64: -device vfio-pci,host=05:00.1: vfio 0000:05:00.1: group 23 used in multiple address spaces"

... so I poked around the internet a bit and got it running with

-device intel-iommu,caching-mode=on \
-device pcie-pci-bridge,addr=1e.0,id=pci.1 \
-device vfio-pci,host=05:00.0,xvga=on,bus=pci.1,addr=1.0,multifunction=on \
-device vfio-pci,host=05:00.1,bus=pci.1,addr=1.1 \

Things still seem to be working, so I assume it works
(I guess ?!).

-- 
Thanks,

David / dhildenb
David Hildenbrand Nov. 18, 2020, 1:04 p.m. UTC | #5
On 20.10.20 22:44, Peter Xu wrote:
> On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote:

>> Thanks ... but I have an AMD system. Will try to find out how to get

>> that running with AMD :)

> 

> May still start with trying intel-iommu first. :) I think it should work for

> amd hosts too.

> 

> Just another FYI - Wei is working on amd-iommu for vfio [1], but it's still

> during review.

> 

> [1] https://lore.kernel.org/qemu-devel/20201002145907.1294353-1-wei.huang2@amd.com/

> 


I'm trying to get an iommu setup running (without virtio-mem!),
but it's a big mess.

Essential parts of my QEMU cmdline are:

sudo build/qemu-system-x86_64 \
    -accel kvm,kernel-irqchip=split \
    ...
     device pcie-pci-bridge,addr=1e.0,id=pci.1 \
    -device vfio-pci,host=0c:00.0,x-vga=on,bus=pci.1,addr=1.0,multifunction=on \
    -device vfio-pci,host=0c:00.1,bus=pci.1,addr=1.1 \
    -device intel-iommu,caching-mode=on,intremap=on \

I am running upstream QEMU + Linux -next kernel inside the
guest on an AMD Ryzen 9 3900X 12-Core Processor.
I am using SeaBios.

I tried faking an Intel CPU without luck.
("-cpu Skylake-Client,kvm=off,vendor=GenuineIntel")

As soon as I enable "intel_iommu=on" in my guest kernel, graphics
stop working (random mess on graphics output) and I get
  vfio-pci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0023 address=0xff924000 flags=0x0000]
in the hypervisor, along with other nice messages.

I can spot no vfio DMA mappings coming from an iommu, just as if the
guest wouldn't even try to setup the iommu.

I tried with
1. AMD Radeon RX Vega 56
2. Nvidia GT220
resulting in similar issues.

I also tried with "-device amd-iommu" with other issues
(guest won't even boot up). Are my graphics card missing some support or
is there a fundamental flaw in my setup?

Any clues appreciated.


-- 
Thanks,

David / dhildenb
Peter Xu Nov. 18, 2020, 3:23 p.m. UTC | #6
David,

On Wed, Nov 18, 2020 at 02:04:00PM +0100, David Hildenbrand wrote:
> On 20.10.20 22:44, Peter Xu wrote:

> > On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote:

> >> Thanks ... but I have an AMD system. Will try to find out how to get

> >> that running with AMD :)

> > 

> > May still start with trying intel-iommu first. :) I think it should work for

> > amd hosts too.

> > 

> > Just another FYI - Wei is working on amd-iommu for vfio [1], but it's still

> > during review.

> > 

> > [1] https://lore.kernel.org/qemu-devel/20201002145907.1294353-1-wei.huang2@amd.com/

> > 

> 

> I'm trying to get an iommu setup running (without virtio-mem!),

> but it's a big mess.

> 

> Essential parts of my QEMU cmdline are:

> 

> sudo build/qemu-system-x86_64 \

>     -accel kvm,kernel-irqchip=split \

>     ...

>      device pcie-pci-bridge,addr=1e.0,id=pci.1 \

>     -device vfio-pci,host=0c:00.0,x-vga=on,bus=pci.1,addr=1.0,multifunction=on \

>     -device vfio-pci,host=0c:00.1,bus=pci.1,addr=1.1 \

>     -device intel-iommu,caching-mode=on,intremap=on \


The intel-iommu device needs to be created before the rest of devices.  I
forgot the reason behind, should be related to how the device address spaces
are created.  This rule should apply to all the rest of vIOMMUs, afaiu.

Libvirt guarantees that ordering when VT-d enabled, though when using qemu
cmdline indeed that's hard to identify from the first glance... iirc we tried
to fix this, but I forgot the details, it's just not trivial.

I noticed that this ordering constraint is also missing in the qemu wiki page
of vt-d, so I updated there too, hopefully..

https://wiki.qemu.org/Features/VT-d#Command_Line_Example

> 

> I am running upstream QEMU + Linux -next kernel inside the

> guest on an AMD Ryzen 9 3900X 12-Core Processor.

> I am using SeaBios.

> 

> I tried faking an Intel CPU without luck.

> ("-cpu Skylake-Client,kvm=off,vendor=GenuineIntel")

> 

> As soon as I enable "intel_iommu=on" in my guest kernel, graphics

> stop working (random mess on graphics output) and I get

>   vfio-pci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0023 address=0xff924000 flags=0x0000]

> in the hypervisor, along with other nice messages.

> 

> I can spot no vfio DMA mappings coming from an iommu, just as if the

> guest wouldn't even try to setup the iommu.

> 

> I tried with

> 1. AMD Radeon RX Vega 56

> 2. Nvidia GT220

> resulting in similar issues.

> 

> I also tried with "-device amd-iommu" with other issues

> (guest won't even boot up). Are my graphics card missing some support or

> is there a fundamental flaw in my setup?


I guess amd-iommu won't work if without Wei Huang's series applied.

> 

> Any clues appreciated.


Please try with above and see whether it works.  Thanks,

-- 
Peter Xu
David Hildenbrand Nov. 18, 2020, 4:14 p.m. UTC | #7
On 18.11.20 16:23, Peter Xu wrote:
> David,

> 

> On Wed, Nov 18, 2020 at 02:04:00PM +0100, David Hildenbrand wrote:

>> On 20.10.20 22:44, Peter Xu wrote:

>>> On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote:

>>>> Thanks ... but I have an AMD system. Will try to find out how to get

>>>> that running with AMD :)

>>>

>>> May still start with trying intel-iommu first. :) I think it should work for

>>> amd hosts too.

>>>

>>> Just another FYI - Wei is working on amd-iommu for vfio [1], but it's still

>>> during review.

>>>

>>> [1] https://lore.kernel.org/qemu-devel/20201002145907.1294353-1-wei.huang2@amd.com/

>>>

>>

>> I'm trying to get an iommu setup running (without virtio-mem!),

>> but it's a big mess.

>>

>> Essential parts of my QEMU cmdline are:

>>

>> sudo build/qemu-system-x86_64 \

>>      -accel kvm,kernel-irqchip=split \

>>      ...

>>       device pcie-pci-bridge,addr=1e.0,id=pci.1 \

>>      -device vfio-pci,host=0c:00.0,x-vga=on,bus=pci.1,addr=1.0,multifunction=on \

>>      -device vfio-pci,host=0c:00.1,bus=pci.1,addr=1.1 \

>>      -device intel-iommu,caching-mode=on,intremap=on \

> 

> The intel-iommu device needs to be created before the rest of devices.  I

> forgot the reason behind, should be related to how the device address spaces

> are created.  This rule should apply to all the rest of vIOMMUs, afaiu.

> 

> Libvirt guarantees that ordering when VT-d enabled, though when using qemu

> cmdline indeed that's hard to identify from the first glance... iirc we tried

> to fix this, but I forgot the details, it's just not trivial.

> 

> I noticed that this ordering constraint is also missing in the qemu wiki page

> of vt-d, so I updated there too, hopefully..

> 

> https://wiki.qemu.org/Features/VT-d#Command_Line_Example

> 


That did the trick! Thanks!!!

virtio-mem + vfio + iommu seems to work. More testing to be done.

However, malicious guests can play nasty tricks like

a) Unplugging plugged virtio-mem blocks while they are mapped via an
    IOMMU

1. Guest: map memory location X located on a virtio-mem device inside a
    plugged block into the IOMMU
    -> QEMU IOMMU notifier: create vfio DMA mapping
    -> VFIO pins memory of unplugged blocks (populating memory)
2. Guest: Request to unplug memory location X via virtio-mem device
    -> QEMU virtio-mem: discards the memory.
    -> VFIO still has the memory pinned

We consume more memory than intended. In case virtio-memory would get 
replugged and used, we would have an inconsistency. IOMMU device resets/ 
fix it (whereby all VFIO mappings are removed via the IOMMU notifier).


b) Mapping unplugged virtio-mem blocks via an IOMMU

1. Guest: map memory location X located on a virtio-mem device inside an
    unplugged block
    -> QEMU IOMMU notifier: create vfio DMA mapping
    -> VFIO pins memory of unplugged blocks (populating memory)

Memory that's supposed to be discarded now consumes memory. This is 
similar to a malicious guest simply writing to unplugged memory blocks 
(to be tackled with "protection of unplugged memory" in the future) - 
however memory will also get pinned.


To prohibit b) from happening, we would have to disallow creating the 
VFIO mapping (fairly easy).

To prohibit a), there would have to be some notification to IOMMU 
implementations to unmap/refresh whenever an IOMMU entry still points at 
memory that is getting discarded (and the VM is doing something it's not 
supposed to do).


>> As soon as I enable "intel_iommu=on" in my guest kernel, graphics

>> stop working (random mess on graphics output) and I get

>>    vfio-pci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0023 address=0xff924000 flags=0x0000]

>> in the hypervisor, along with other nice messages.

>>

>> I can spot no vfio DMA mappings coming from an iommu, just as if the

>> guest wouldn't even try to setup the iommu.

>>

>> I tried with

>> 1. AMD Radeon RX Vega 56

>> 2. Nvidia GT220

>> resulting in similar issues.

>>

>> I also tried with "-device amd-iommu" with other issues

>> (guest won't even boot up). Are my graphics card missing some support or

>> is there a fundamental flaw in my setup?

> 

> I guess amd-iommu won't work if without Wei Huang's series applied.


Oh, okay - I spotted it in QEMU and thought this was already working :)

-- 
Thanks,

David / dhildenb
Peter Xu Nov. 18, 2020, 5:01 p.m. UTC | #8
On Wed, Nov 18, 2020 at 05:14:22PM +0100, David Hildenbrand wrote:
> That did the trick! Thanks!!!


Great!  At the meantime, I've a few questions majorly about memory unplugging
below, which could be naive - I know little on that, please bare with me.. :)

> 

> virtio-mem + vfio + iommu seems to work. More testing to be done.

> 

> However, malicious guests can play nasty tricks like

> 

> a) Unplugging plugged virtio-mem blocks while they are mapped via an

>    IOMMU

> 

> 1. Guest: map memory location X located on a virtio-mem device inside a

>    plugged block into the IOMMU

>    -> QEMU IOMMU notifier: create vfio DMA mapping

>    -> VFIO pins memory of unplugged blocks (populating memory)

> 2. Guest: Request to unplug memory location X via virtio-mem device

>    -> QEMU virtio-mem: discards the memory.

>    -> VFIO still has the memory pinned


When unplug some memory, does the user need to first do something to notify the
guest kernel that "this memory is going to be unplugged soon" (assuming echo
"offline" to some dev file)?  Then the kernel should be responsible to prepare
for that before it really happens, e.g., migrate anonymous pages out from this
memory block.  I don't know what would happen if some pages on the memblock
were used for DMA like this and we want to unplug it.  Ideally I thought it
should fail the "echo offline" operation with something like EBUSY if it can't
notify the device driver about this, or it's hard to.

IMHO this question not really related to vIOMMU, but a general question for
unplugging. Say, what would happen if we unplug some memory with DMA buffers
without vIOMMU at all?  The buffer will be invalid right after unplugging, so
the guest kernel should either fail the operation trying to unplug, or at least
tell the device drivers about this somehow?

> 

> We consume more memory than intended. In case virtio-memory would get

> replugged and used, we would have an inconsistency. IOMMU device resets/ fix

> it (whereby all VFIO mappings are removed via the IOMMU notifier).

> 

> 

> b) Mapping unplugged virtio-mem blocks via an IOMMU

> 

> 1. Guest: map memory location X located on a virtio-mem device inside an

>    unplugged block

>    -> QEMU IOMMU notifier: create vfio DMA mapping

>    -> VFIO pins memory of unplugged blocks (populating memory)


For this case, I would expect vfio_get_xlat_addr() to fail directly if the
guest driver force to map some IOVA onto an invalid range of the virtio-mem
device.  Even before that, since the guest should know that this region of
virtio-mem is not valid since unplugged, so shouldn't the guest kernel directly
fail the dma_map() upon such a region even before the mapping message reaching
QEMU?

Thanks,

> 

> Memory that's supposed to be discarded now consumes memory. This is similar

> to a malicious guest simply writing to unplugged memory blocks (to be

> tackled with "protection of unplugged memory" in the future) - however

> memory will also get pinned.

> 

> 

> To prohibit b) from happening, we would have to disallow creating the VFIO

> mapping (fairly easy).

> 

> To prohibit a), there would have to be some notification to IOMMU

> implementations to unmap/refresh whenever an IOMMU entry still points at

> memory that is getting discarded (and the VM is doing something it's not

> supposed to do).


-- 
Peter Xu
David Hildenbrand Nov. 18, 2020, 5:37 p.m. UTC | #9
>> virtio-mem + vfio + iommu seems to work. More testing to be done.

>>

>> However, malicious guests can play nasty tricks like

>>

>> a) Unplugging plugged virtio-mem blocks while they are mapped via an

>>     IOMMU

>>

>> 1. Guest: map memory location X located on a virtio-mem device inside a

>>     plugged block into the IOMMU

>>     -> QEMU IOMMU notifier: create vfio DMA mapping

>>     -> VFIO pins memory of unplugged blocks (populating memory)

>> 2. Guest: Request to unplug memory location X via virtio-mem device

>>     -> QEMU virtio-mem: discards the memory.

>>     -> VFIO still has the memory pinned

> 

> When unplug some memory, does the user need to first do something to notify the

> guest kernel that "this memory is going to be unplugged soon" (assuming echo

> "offline" to some dev file)?  Then the kernel should be responsible to prepare

> for that before it really happens, e.g., migrate anonymous pages out from this

> memory block.  I don't know what would happen if some pages on the memblock

> were used for DMA like this and we want to unplug it.  Ideally I thought it

> should fail the "echo offline" operation with something like EBUSY if it can't

> notify the device driver about this, or it's hard to.


In the very simple case (without resizable RAMBlocks/allocations.memory 
regions) as implemented right now, a virtio-mem device really just 
consists of a static RAM memory region that's mapped into guest physical 
address space. The size of that region corresponds to the "maximum" size 
a virtio-mem device can provide.

How much memory the VM should consume via such a device is expressed via 
a "requested size". So for the hypervisor requests the VM to consume 
less/more memory, it adjusts the "requested size".

It is up to the device driver in the guest to plug/unplug memory blocks 
(e.g., 4 MiB granularity), in order to reach the requested size. The 
device driver selects memory blocks within the device-assigned memory 
region and requests the hypervisor to (un)plug them - think of it as 
something similar (but different) to memory ballooning.

When requested to unplug memory by the hypervisor, the device driver in 
Linux will try to find memory blocks (e.g., 4 MiB) within the 
device-assigned memory region it can free up. This involves migrating 
pages away etc. Once that succeeded - nobody in the guest is using that 
memory anymore; the guest requests the hypervisor to unplug that block, 
resulting in QEMU discarding the memory. The guest agreed to not touch 
that memory anymore before officially requesting to "plug" it via the 
virtio-mem device.

There is no further action inside the guest required. A sane guest will 
never request to unplug memory that is still in use (similar to memory 
ballooning, where we don't inflate memory that is still in use).

But of course, a malicious guest could try doing that just to cause 
trouble.

> 

> IMHO this question not really related to vIOMMU, but a general question for

> unplugging. Say, what would happen if we unplug some memory with DMA buffers

> without vIOMMU at all?  The buffer will be invalid right after unplugging, so

> the guest kernel should either fail the operation trying to unplug, or at least

> tell the device drivers about this somehow?


A sane guest will never do that. The way memory is removed from Linux 
makes sure that there are no remaining users, otherwise it would be a BUG.

> 

>>

>> We consume more memory than intended. In case virtio-memory would get

>> replugged and used, we would have an inconsistency. IOMMU device resets/ fix

>> it (whereby all VFIO mappings are removed via the IOMMU notifier).

>>

>>

>> b) Mapping unplugged virtio-mem blocks via an IOMMU

>>

>> 1. Guest: map memory location X located on a virtio-mem device inside an

>>     unplugged block

>>     -> QEMU IOMMU notifier: create vfio DMA mapping

>>     -> VFIO pins memory of unplugged blocks (populating memory)

> 

> For this case, I would expect vfio_get_xlat_addr() to fail directly if the

> guest driver force to map some IOVA onto an invalid range of the virtio-mem

> device.  Even before that, since the guest should know that this region of

> virtio-mem is not valid since unplugged, so shouldn't the guest kernel directly

> fail the dma_map() upon such a region even before the mapping message reaching

> QEMU?


Again, sane guests will never do that, for the very reason you mentioned 
"the guest should know that this region of virtio-mem is not valid since 
unplugged,". But a malicious guest could try doing that to cause trouble :)

The memory region managed by a virtio-mem device is always fully mapped 
into the system address space: one reason being, that fragmenting it in 
2 MiB granularity or similar would not be feasible (e.g., KVM memory 
slots limit, migration, ...), but there are other reasons. (Again, 
similar to how memory ballooning works).

vfio_get_xlat_addr() only checks if that mapping exists. It would be 
easy to ask the virtio-mem device (similar as done in this prototype) if 
that part of the identified memory region may be mapped by VFIO right now.

-- 
Thanks,

David / dhildenb
Peter Xu Nov. 18, 2020, 7:05 p.m. UTC | #10
On Wed, Nov 18, 2020 at 06:37:42PM +0100, David Hildenbrand wrote:
> > > a) Unplugging plugged virtio-mem blocks while they are mapped via an

> > >     IOMMU

> > > 

> > > 1. Guest: map memory location X located on a virtio-mem device inside a

> > >     plugged block into the IOMMU

> > >     -> QEMU IOMMU notifier: create vfio DMA mapping

> > >     -> VFIO pins memory of unplugged blocks (populating memory)

> > > 2. Guest: Request to unplug memory location X via virtio-mem device

> > >     -> QEMU virtio-mem: discards the memory.

> > >     -> VFIO still has the memory pinned


[...]

> > > b) Mapping unplugged virtio-mem blocks via an IOMMU

> > > 

> > > 1. Guest: map memory location X located on a virtio-mem device inside an

> > >     unplugged block

> > >     -> QEMU IOMMU notifier: create vfio DMA mapping

> > >     -> VFIO pins memory of unplugged blocks (populating memory)


[...]

> Again, sane guests will never do that, for the very reason you mentioned

> "the guest should know that this region of virtio-mem is not valid since

> unplugged,". But a malicious guest could try doing that to cause trouble :)


Oh I think I see your point now. :) And thanks for the write-up about how
virtio-mem works.

So it's about the malicious guests.

I agree with you that we can try to limit above from happening, e.g. by
teaching vfio_get_xlat_addr() to fail when it's going to map some unplugged
range of virtio-mem device.

Or... imho, we may not even need to worry too much on those misuses of
virtio-mem? As long as the issue is self-contained within the buggy VM/process.
E.g., the worst case of such a malicious guest is to fiddle around the maximum
allowed memory size granted to the virtio-mem device to either have pages
incorrectly pinned, or some strange IOVA mapped to unplugged pages within that
range.  As long as it won't affect other VMs and the host, and qemu won't crash
with that, then it seems ok.

-- 
Peter Xu
David Hildenbrand Nov. 18, 2020, 7:20 p.m. UTC | #11
> So it's about the malicious guests.

> 

> I agree with you that we can try to limit above from happening, e.g. by

> teaching vfio_get_xlat_addr() to fail when it's going to map some unplugged

> range of virtio-mem device.


Exactly.

> 

> Or... imho, we may not even need to worry too much on those misuses of

> virtio-mem? As long as the issue is self-contained within the buggy VM/process.

> E.g., the worst case of such a malicious guest is to fiddle around the maximum

> allowed memory size granted to the virtio-mem device to either have pages

> incorrectly pinned, or some strange IOVA mapped to unplugged pages within that

> range.  As long as it won't affect other VMs and the host, and qemu won't crash

> with that, then it seems ok.


Indeed, I have the same thoughts.

The only "ugly" thing is that our VM might not only consume more memory 
than intended, but also pin that memory (unmovable, unswappable ...). So 
I'm thinking about at least doing a warn_report_once() until we have 
some approach to handle that - for environments that might care about this.

Thanks for looking into this!

-- 
Thanks,

David / dhildenb
diff mbox series

Patch

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 13471ae294..a3aaf70dd8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -37,6 +37,7 @@ 
 #include "sysemu/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "qemu/units.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -498,6 +499,143 @@  out:
     rcu_read_unlock();
 }
 
+static int vfio_sparse_ram_notify(SparseRAMNotifier *n, const MemoryRegion *mr,
+                                  uint64_t mr_offset, uint64_t size,
+                                  bool map)
+{
+    VFIOSparseRAM *sram = container_of(n, VFIOSparseRAM, notifier);
+    const hwaddr mr_start = MAX(mr_offset, sram->offset_within_region);
+    const hwaddr mr_end = MIN(mr_offset + size,
+                              sram->offset_within_region + sram->size);
+    const hwaddr iova_start = mr_start + sram->offset_within_address_space;
+    const hwaddr iova_end = mr_end + sram->offset_within_address_space;
+    hwaddr mr_cur, iova_cur, mr_next;
+    void *vaddr;
+    int ret, ret2;
+
+    g_assert(mr == sram->mr);
+
+    /* We get notified about everything, ignore range we don't care about. */
+    if (mr_start >= mr_end) {
+        return 0;
+    }
+
+    /* Unmap everything with a single call. */
+    if (!map) {
+        ret = vfio_dma_unmap(sram->container, iova_start,
+                             iova_end - iova_start);
+        if (ret) {
+            error_report("%s: vfio_dma_unmap() failed: %s", __func__,
+                         strerror(-ret));
+        }
+        return 0;
+    }
+
+    /* TODO: fail early if we would exceed a specified number of mappings. */
+
+    /* Map in (aligned within MR) granularity, so we can unmap later. */
+    for (mr_cur = mr_start; mr_cur < mr_end; mr_cur = mr_next) {
+        iova_cur = mr_cur + sram->offset_within_address_space;
+        mr_next = QEMU_ALIGN_UP(mr_cur + 1, sram->granularity);
+        mr_next = MIN(mr_next, mr_end);
+
+        vaddr = memory_region_get_ram_ptr(sram->mr) + mr_cur;
+        ret = vfio_dma_map(sram->container, iova_cur, mr_next - mr_cur,
+                           vaddr, mr->readonly);
+        if (ret) {
+            /* Rollback in case of error. */
+            if (mr_cur != mr_start) {
+                ret2 = vfio_dma_unmap(sram->container, iova_start,
+                                      iova_end - iova_start);
+                if (ret2) {
+                    error_report("%s: vfio_dma_unmap() failed: %s", __func__,
+                                  strerror(-ret));
+                }
+            }
+            return ret;
+        }
+    }
+    return 0;
+}
+
+static int vfio_sparse_ram_notify_map(SparseRAMNotifier *n,
+                                      const MemoryRegion *mr,
+                                      uint64_t mr_offset, uint64_t size)
+{
+    return vfio_sparse_ram_notify(n, mr, mr_offset, size, true);
+}
+
+static void vfio_sparse_ram_notify_unmap(SparseRAMNotifier *n,
+                                         const MemoryRegion *mr,
+                                         uint64_t mr_offset, uint64_t size)
+{
+    vfio_sparse_ram_notify(n, mr, mr_offset, size, false);
+}
+
+static void vfio_register_sparse_ram(VFIOContainer *container,
+                                     MemoryRegionSection *section)
+{
+    VFIOSparseRAM *sram;
+    int ret;
+
+    sram = g_new0(VFIOSparseRAM, 1);
+    sram->container = container;
+    sram->mr = section->mr;
+    sram->offset_within_region = section->offset_within_region;
+    sram->offset_within_address_space = section->offset_within_address_space;
+    sram->size = int128_get64(section->size);
+    sram->granularity = memory_region_sparse_ram_get_granularity(section->mr);
+
+    /*
+     * TODO: We usually want a bigger granularity (for a lot of addded memory,
+     * as we need quite a lot of mappings) - however, this has to be configured
+     * by the user.
+     */
+    g_assert(sram->granularity >= 1 * MiB &&
+             is_power_of_2(sram->granularity));
+
+    /* Register the notifier */
+    sparse_ram_notifier_init(&sram->notifier, vfio_sparse_ram_notify_map,
+                             vfio_sparse_ram_notify_unmap);
+    memory_region_register_sparse_ram_notifier(section->mr, &sram->notifier);
+    QLIST_INSERT_HEAD(&container->sram_list, sram, next);
+    /*
+     * Replay mapped blocks - if anything goes wrong (only when hotplugging
+     * vfio devices), report the error for now.
+     *
+     * TODO: Can we catch this earlier?
+     */
+    ret = memory_region_sparse_ram_replay_mapped(section->mr, &sram->notifier);
+    if (ret) {
+        error_report("%s: failed to replay mappings: %s", __func__,
+                     strerror(-ret));
+    }
+}
+
+static void vfio_unregister_sparse_ram(VFIOContainer *container,
+                                       MemoryRegionSection *section)
+{
+    VFIOSparseRAM *sram = NULL;
+
+    QLIST_FOREACH(sram, &container->sram_list, next) {
+        if (sram->mr == section->mr &&
+            sram->offset_within_region == section->offset_within_region &&
+            sram->offset_within_address_space ==
+            section->offset_within_address_space) {
+            break;
+        }
+    }
+
+    if (!sram) {
+        hw_error("vfio: Trying to unregister non-existant sparse RAM");
+    }
+
+    memory_region_unregister_sparse_ram_notifier(section->mr, &sram->notifier);
+    QLIST_REMOVE(sram, next);
+    g_free(sram);
+    /* The caller is expected to vfio_dma_unmap(). */
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -650,6 +788,15 @@  static void vfio_listener_region_add(MemoryListener *listener,
 
     /* Here we assume that memory_region_is_ram(section->mr)==true */
 
+    /*
+     * For sparse RAM, we only want to register the actually mapped
+     * pieces - and update the mapping whenever we're notified about changes.
+     */
+    if (memory_region_is_sparse_ram(section->mr)) {
+        vfio_register_sparse_ram(container, section);
+        return;
+    }
+
     vaddr = memory_region_get_ram_ptr(section->mr) +
             section->offset_within_region +
             (iova - section->offset_within_address_space);
@@ -786,6 +933,13 @@  static void vfio_listener_region_del(MemoryListener *listener,
 
         pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
         try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
+    } else if (memory_region_is_sparse_ram(section->mr)) {
+        vfio_unregister_sparse_ram(container, section);
+        /*
+         * We rely on a single vfio_dma_unmap() call below to clean the whole
+         * region.
+         */
+        try_unmap = true;
     }
 
     if (try_unmap) {
@@ -1275,6 +1429,7 @@  static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     container->error = NULL;
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
+    QLIST_INIT(&container->sram_list);
 
     ret = vfio_init_container(container, group->fd, errp);
     if (ret) {
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c78f3ff559..dfa18dbd8e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -77,6 +77,7 @@  typedef struct VFIOContainer {
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
+    QLIST_HEAD(, VFIOSparseRAM) sram_list;
     QLIST_ENTRY(VFIOContainer) next;
 } VFIOContainer;
 
@@ -88,6 +89,17 @@  typedef struct VFIOGuestIOMMU {
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
 } VFIOGuestIOMMU;
 
+typedef struct VFIOSparseRAM {
+    VFIOContainer *container;
+    MemoryRegion *mr;
+    hwaddr offset_within_region;
+    hwaddr offset_within_address_space;
+    hwaddr size;
+    uint64_t granularity;
+    SparseRAMNotifier notifier;
+    QLIST_ENTRY(VFIOSparseRAM) next;
+} VFIOSparseRAM;
+
 typedef struct VFIOHostDMAWindow {
     hwaddr min_iova;
     hwaddr max_iova;