Message ID | 20200924160423.106747-4-david@redhat.com |
---|---|
State | New |
Headers | show |
Series | virtio-mem: vfio support | expand |
On Thu, Sep 24, 2020 at 06:04:20PM +0200, David Hildenbrand wrote: > Implement support for sparse RAM, to be used by virtio-mem. Handling > is somewhat-similar to memory_region_is_iommu() handling, which also > notifies on changes. > > Instead of mapping the whole region, we only map selected pieces (and > unmap previously selected pieces) when notified by the SparseRAMHandler. It works with vIOMMU too, right? :) As long as vfio_get_vaddr() works as expected with virtio-mem plugging new memories, then I think the answer should be yes. If it's true, maybe worth mention it somewhere either in the commit message or in the code comment, because it seems not that obvious. So if you have plan to do some real IOs in your future tests, may also worth try with the "-device intel-iommu" and intel_iommu=on in the guest against the same test. Thanks, -- Peter Xu
On 20.10.20 21:44, Peter Xu wrote: > On Thu, Sep 24, 2020 at 06:04:20PM +0200, David Hildenbrand wrote: >> Implement support for sparse RAM, to be used by virtio-mem. Handling >> is somewhat-similar to memory_region_is_iommu() handling, which also >> notifies on changes. >> >> Instead of mapping the whole region, we only map selected pieces (and >> unmap previously selected pieces) when notified by the SparseRAMHandler. > > It works with vIOMMU too, right? :) As long as vfio_get_vaddr() works as > expected with virtio-mem plugging new memories, then I think the answer should > be yes. > I haven't tried, but I guess as it's simply mapped into &address_space_memory, it should work just fine. > If it's true, maybe worth mention it somewhere either in the commit message or > in the code comment, because it seems not that obvious. I will test and add, thanks for the hint! > > So if you have plan to do some real IOs in your future tests, may also worth > try with the "-device intel-iommu" and intel_iommu=on in the guest against the > same test. Thanks ... but I have an AMD system. Will try to find out how to get that running with AMD :) -- Thanks, David / dhildenb
On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote: > Thanks ... but I have an AMD system. Will try to find out how to get > that running with AMD :) May still start with trying intel-iommu first. :) I think it should work for amd hosts too. Just another FYI - Wei is working on amd-iommu for vfio [1], but it's still during review. [1] https://lore.kernel.org/qemu-devel/20201002145907.1294353-1-wei.huang2@amd.com/ -- Peter Xu
On 20.10.20 22:44, Peter Xu wrote: > On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote: >> Thanks ... but I have an AMD system. Will try to find out how to get >> that running with AMD :) I just did some more testing with the oldish GPU I have for that purpose. Seems to work, at least I get video output that keeps on working - did not try advanced things yet. I use -device vfio-pci,host=05:00.0,x-vga=on -device vfio-pci,host=05:00.1 when adding "-device intel-iommu", I got "qemu-system-x86_64: -device vfio-pci,host=05:00.1: vfio 0000:05:00.1: group 23 used in multiple address spaces" ... so I poked around the internet a bit and got it running with -device intel-iommu,caching-mode=on \ -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ -device vfio-pci,host=05:00.0,xvga=on,bus=pci.1,addr=1.0,multifunction=on \ -device vfio-pci,host=05:00.1,bus=pci.1,addr=1.1 \ Things still seem to be working, so I assume it works (I guess ?!). -- Thanks, David / dhildenb
On 20.10.20 22:44, Peter Xu wrote: > On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote: >> Thanks ... but I have an AMD system. Will try to find out how to get >> that running with AMD :) > > May still start with trying intel-iommu first. :) I think it should work for > amd hosts too. > > Just another FYI - Wei is working on amd-iommu for vfio [1], but it's still > during review. > > [1] https://lore.kernel.org/qemu-devel/20201002145907.1294353-1-wei.huang2@amd.com/ > I'm trying to get an iommu setup running (without virtio-mem!), but it's a big mess. Essential parts of my QEMU cmdline are: sudo build/qemu-system-x86_64 \ -accel kvm,kernel-irqchip=split \ ... device pcie-pci-bridge,addr=1e.0,id=pci.1 \ -device vfio-pci,host=0c:00.0,x-vga=on,bus=pci.1,addr=1.0,multifunction=on \ -device vfio-pci,host=0c:00.1,bus=pci.1,addr=1.1 \ -device intel-iommu,caching-mode=on,intremap=on \ I am running upstream QEMU + Linux -next kernel inside the guest on an AMD Ryzen 9 3900X 12-Core Processor. I am using SeaBios. I tried faking an Intel CPU without luck. ("-cpu Skylake-Client,kvm=off,vendor=GenuineIntel") As soon as I enable "intel_iommu=on" in my guest kernel, graphics stop working (random mess on graphics output) and I get vfio-pci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0023 address=0xff924000 flags=0x0000] in the hypervisor, along with other nice messages. I can spot no vfio DMA mappings coming from an iommu, just as if the guest wouldn't even try to setup the iommu. I tried with 1. AMD Radeon RX Vega 56 2. Nvidia GT220 resulting in similar issues. I also tried with "-device amd-iommu" with other issues (guest won't even boot up). Are my graphics card missing some support or is there a fundamental flaw in my setup? Any clues appreciated. -- Thanks, David / dhildenb
David, On Wed, Nov 18, 2020 at 02:04:00PM +0100, David Hildenbrand wrote: > On 20.10.20 22:44, Peter Xu wrote: > > On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote: > >> Thanks ... but I have an AMD system. Will try to find out how to get > >> that running with AMD :) > > > > May still start with trying intel-iommu first. :) I think it should work for > > amd hosts too. > > > > Just another FYI - Wei is working on amd-iommu for vfio [1], but it's still > > during review. > > > > [1] https://lore.kernel.org/qemu-devel/20201002145907.1294353-1-wei.huang2@amd.com/ > > > > I'm trying to get an iommu setup running (without virtio-mem!), > but it's a big mess. > > Essential parts of my QEMU cmdline are: > > sudo build/qemu-system-x86_64 \ > -accel kvm,kernel-irqchip=split \ > ... > device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > -device vfio-pci,host=0c:00.0,x-vga=on,bus=pci.1,addr=1.0,multifunction=on \ > -device vfio-pci,host=0c:00.1,bus=pci.1,addr=1.1 \ > -device intel-iommu,caching-mode=on,intremap=on \ The intel-iommu device needs to be created before the rest of devices. I forgot the reason behind, should be related to how the device address spaces are created. This rule should apply to all the rest of vIOMMUs, afaiu. Libvirt guarantees that ordering when VT-d enabled, though when using qemu cmdline indeed that's hard to identify from the first glance... iirc we tried to fix this, but I forgot the details, it's just not trivial. I noticed that this ordering constraint is also missing in the qemu wiki page of vt-d, so I updated there too, hopefully.. https://wiki.qemu.org/Features/VT-d#Command_Line_Example > > I am running upstream QEMU + Linux -next kernel inside the > guest on an AMD Ryzen 9 3900X 12-Core Processor. > I am using SeaBios. > > I tried faking an Intel CPU without luck. > ("-cpu Skylake-Client,kvm=off,vendor=GenuineIntel") > > As soon as I enable "intel_iommu=on" in my guest kernel, graphics > stop working (random mess on graphics output) and I get > vfio-pci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0023 address=0xff924000 flags=0x0000] > in the hypervisor, along with other nice messages. > > I can spot no vfio DMA mappings coming from an iommu, just as if the > guest wouldn't even try to setup the iommu. > > I tried with > 1. AMD Radeon RX Vega 56 > 2. Nvidia GT220 > resulting in similar issues. > > I also tried with "-device amd-iommu" with other issues > (guest won't even boot up). Are my graphics card missing some support or > is there a fundamental flaw in my setup? I guess amd-iommu won't work if without Wei Huang's series applied. > > Any clues appreciated. Please try with above and see whether it works. Thanks, -- Peter Xu
On 18.11.20 16:23, Peter Xu wrote: > David, > > On Wed, Nov 18, 2020 at 02:04:00PM +0100, David Hildenbrand wrote: >> On 20.10.20 22:44, Peter Xu wrote: >>> On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote: >>>> Thanks ... but I have an AMD system. Will try to find out how to get >>>> that running with AMD :) >>> >>> May still start with trying intel-iommu first. :) I think it should work for >>> amd hosts too. >>> >>> Just another FYI - Wei is working on amd-iommu for vfio [1], but it's still >>> during review. >>> >>> [1] https://lore.kernel.org/qemu-devel/20201002145907.1294353-1-wei.huang2@amd.com/ >>> >> >> I'm trying to get an iommu setup running (without virtio-mem!), >> but it's a big mess. >> >> Essential parts of my QEMU cmdline are: >> >> sudo build/qemu-system-x86_64 \ >> -accel kvm,kernel-irqchip=split \ >> ... >> device pcie-pci-bridge,addr=1e.0,id=pci.1 \ >> -device vfio-pci,host=0c:00.0,x-vga=on,bus=pci.1,addr=1.0,multifunction=on \ >> -device vfio-pci,host=0c:00.1,bus=pci.1,addr=1.1 \ >> -device intel-iommu,caching-mode=on,intremap=on \ > > The intel-iommu device needs to be created before the rest of devices. I > forgot the reason behind, should be related to how the device address spaces > are created. This rule should apply to all the rest of vIOMMUs, afaiu. > > Libvirt guarantees that ordering when VT-d enabled, though when using qemu > cmdline indeed that's hard to identify from the first glance... iirc we tried > to fix this, but I forgot the details, it's just not trivial. > > I noticed that this ordering constraint is also missing in the qemu wiki page > of vt-d, so I updated there too, hopefully.. > > https://wiki.qemu.org/Features/VT-d#Command_Line_Example > That did the trick! Thanks!!! virtio-mem + vfio + iommu seems to work. More testing to be done. However, malicious guests can play nasty tricks like a) Unplugging plugged virtio-mem blocks while they are mapped via an IOMMU 1. Guest: map memory location X located on a virtio-mem device inside a plugged block into the IOMMU -> QEMU IOMMU notifier: create vfio DMA mapping -> VFIO pins memory of unplugged blocks (populating memory) 2. Guest: Request to unplug memory location X via virtio-mem device -> QEMU virtio-mem: discards the memory. -> VFIO still has the memory pinned We consume more memory than intended. In case virtio-memory would get replugged and used, we would have an inconsistency. IOMMU device resets/ fix it (whereby all VFIO mappings are removed via the IOMMU notifier). b) Mapping unplugged virtio-mem blocks via an IOMMU 1. Guest: map memory location X located on a virtio-mem device inside an unplugged block -> QEMU IOMMU notifier: create vfio DMA mapping -> VFIO pins memory of unplugged blocks (populating memory) Memory that's supposed to be discarded now consumes memory. This is similar to a malicious guest simply writing to unplugged memory blocks (to be tackled with "protection of unplugged memory" in the future) - however memory will also get pinned. To prohibit b) from happening, we would have to disallow creating the VFIO mapping (fairly easy). To prohibit a), there would have to be some notification to IOMMU implementations to unmap/refresh whenever an IOMMU entry still points at memory that is getting discarded (and the VM is doing something it's not supposed to do). >> As soon as I enable "intel_iommu=on" in my guest kernel, graphics >> stop working (random mess on graphics output) and I get >> vfio-pci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0023 address=0xff924000 flags=0x0000] >> in the hypervisor, along with other nice messages. >> >> I can spot no vfio DMA mappings coming from an iommu, just as if the >> guest wouldn't even try to setup the iommu. >> >> I tried with >> 1. AMD Radeon RX Vega 56 >> 2. Nvidia GT220 >> resulting in similar issues. >> >> I also tried with "-device amd-iommu" with other issues >> (guest won't even boot up). Are my graphics card missing some support or >> is there a fundamental flaw in my setup? > > I guess amd-iommu won't work if without Wei Huang's series applied. Oh, okay - I spotted it in QEMU and thought this was already working :) -- Thanks, David / dhildenb
On Wed, Nov 18, 2020 at 05:14:22PM +0100, David Hildenbrand wrote: > That did the trick! Thanks!!! Great! At the meantime, I've a few questions majorly about memory unplugging below, which could be naive - I know little on that, please bare with me.. :) > > virtio-mem + vfio + iommu seems to work. More testing to be done. > > However, malicious guests can play nasty tricks like > > a) Unplugging plugged virtio-mem blocks while they are mapped via an > IOMMU > > 1. Guest: map memory location X located on a virtio-mem device inside a > plugged block into the IOMMU > -> QEMU IOMMU notifier: create vfio DMA mapping > -> VFIO pins memory of unplugged blocks (populating memory) > 2. Guest: Request to unplug memory location X via virtio-mem device > -> QEMU virtio-mem: discards the memory. > -> VFIO still has the memory pinned When unplug some memory, does the user need to first do something to notify the guest kernel that "this memory is going to be unplugged soon" (assuming echo "offline" to some dev file)? Then the kernel should be responsible to prepare for that before it really happens, e.g., migrate anonymous pages out from this memory block. I don't know what would happen if some pages on the memblock were used for DMA like this and we want to unplug it. Ideally I thought it should fail the "echo offline" operation with something like EBUSY if it can't notify the device driver about this, or it's hard to. IMHO this question not really related to vIOMMU, but a general question for unplugging. Say, what would happen if we unplug some memory with DMA buffers without vIOMMU at all? The buffer will be invalid right after unplugging, so the guest kernel should either fail the operation trying to unplug, or at least tell the device drivers about this somehow? > > We consume more memory than intended. In case virtio-memory would get > replugged and used, we would have an inconsistency. IOMMU device resets/ fix > it (whereby all VFIO mappings are removed via the IOMMU notifier). > > > b) Mapping unplugged virtio-mem blocks via an IOMMU > > 1. Guest: map memory location X located on a virtio-mem device inside an > unplugged block > -> QEMU IOMMU notifier: create vfio DMA mapping > -> VFIO pins memory of unplugged blocks (populating memory) For this case, I would expect vfio_get_xlat_addr() to fail directly if the guest driver force to map some IOVA onto an invalid range of the virtio-mem device. Even before that, since the guest should know that this region of virtio-mem is not valid since unplugged, so shouldn't the guest kernel directly fail the dma_map() upon such a region even before the mapping message reaching QEMU? Thanks, > > Memory that's supposed to be discarded now consumes memory. This is similar > to a malicious guest simply writing to unplugged memory blocks (to be > tackled with "protection of unplugged memory" in the future) - however > memory will also get pinned. > > > To prohibit b) from happening, we would have to disallow creating the VFIO > mapping (fairly easy). > > To prohibit a), there would have to be some notification to IOMMU > implementations to unmap/refresh whenever an IOMMU entry still points at > memory that is getting discarded (and the VM is doing something it's not > supposed to do). -- Peter Xu
>> virtio-mem + vfio + iommu seems to work. More testing to be done. >> >> However, malicious guests can play nasty tricks like >> >> a) Unplugging plugged virtio-mem blocks while they are mapped via an >> IOMMU >> >> 1. Guest: map memory location X located on a virtio-mem device inside a >> plugged block into the IOMMU >> -> QEMU IOMMU notifier: create vfio DMA mapping >> -> VFIO pins memory of unplugged blocks (populating memory) >> 2. Guest: Request to unplug memory location X via virtio-mem device >> -> QEMU virtio-mem: discards the memory. >> -> VFIO still has the memory pinned > > When unplug some memory, does the user need to first do something to notify the > guest kernel that "this memory is going to be unplugged soon" (assuming echo > "offline" to some dev file)? Then the kernel should be responsible to prepare > for that before it really happens, e.g., migrate anonymous pages out from this > memory block. I don't know what would happen if some pages on the memblock > were used for DMA like this and we want to unplug it. Ideally I thought it > should fail the "echo offline" operation with something like EBUSY if it can't > notify the device driver about this, or it's hard to. In the very simple case (without resizable RAMBlocks/allocations.memory regions) as implemented right now, a virtio-mem device really just consists of a static RAM memory region that's mapped into guest physical address space. The size of that region corresponds to the "maximum" size a virtio-mem device can provide. How much memory the VM should consume via such a device is expressed via a "requested size". So for the hypervisor requests the VM to consume less/more memory, it adjusts the "requested size". It is up to the device driver in the guest to plug/unplug memory blocks (e.g., 4 MiB granularity), in order to reach the requested size. The device driver selects memory blocks within the device-assigned memory region and requests the hypervisor to (un)plug them - think of it as something similar (but different) to memory ballooning. When requested to unplug memory by the hypervisor, the device driver in Linux will try to find memory blocks (e.g., 4 MiB) within the device-assigned memory region it can free up. This involves migrating pages away etc. Once that succeeded - nobody in the guest is using that memory anymore; the guest requests the hypervisor to unplug that block, resulting in QEMU discarding the memory. The guest agreed to not touch that memory anymore before officially requesting to "plug" it via the virtio-mem device. There is no further action inside the guest required. A sane guest will never request to unplug memory that is still in use (similar to memory ballooning, where we don't inflate memory that is still in use). But of course, a malicious guest could try doing that just to cause trouble. > > IMHO this question not really related to vIOMMU, but a general question for > unplugging. Say, what would happen if we unplug some memory with DMA buffers > without vIOMMU at all? The buffer will be invalid right after unplugging, so > the guest kernel should either fail the operation trying to unplug, or at least > tell the device drivers about this somehow? A sane guest will never do that. The way memory is removed from Linux makes sure that there are no remaining users, otherwise it would be a BUG. > >> >> We consume more memory than intended. In case virtio-memory would get >> replugged and used, we would have an inconsistency. IOMMU device resets/ fix >> it (whereby all VFIO mappings are removed via the IOMMU notifier). >> >> >> b) Mapping unplugged virtio-mem blocks via an IOMMU >> >> 1. Guest: map memory location X located on a virtio-mem device inside an >> unplugged block >> -> QEMU IOMMU notifier: create vfio DMA mapping >> -> VFIO pins memory of unplugged blocks (populating memory) > > For this case, I would expect vfio_get_xlat_addr() to fail directly if the > guest driver force to map some IOVA onto an invalid range of the virtio-mem > device. Even before that, since the guest should know that this region of > virtio-mem is not valid since unplugged, so shouldn't the guest kernel directly > fail the dma_map() upon such a region even before the mapping message reaching > QEMU? Again, sane guests will never do that, for the very reason you mentioned "the guest should know that this region of virtio-mem is not valid since unplugged,". But a malicious guest could try doing that to cause trouble :) The memory region managed by a virtio-mem device is always fully mapped into the system address space: one reason being, that fragmenting it in 2 MiB granularity or similar would not be feasible (e.g., KVM memory slots limit, migration, ...), but there are other reasons. (Again, similar to how memory ballooning works). vfio_get_xlat_addr() only checks if that mapping exists. It would be easy to ask the virtio-mem device (similar as done in this prototype) if that part of the identified memory region may be mapped by VFIO right now. -- Thanks, David / dhildenb
On Wed, Nov 18, 2020 at 06:37:42PM +0100, David Hildenbrand wrote: > > > a) Unplugging plugged virtio-mem blocks while they are mapped via an > > > IOMMU > > > > > > 1. Guest: map memory location X located on a virtio-mem device inside a > > > plugged block into the IOMMU > > > -> QEMU IOMMU notifier: create vfio DMA mapping > > > -> VFIO pins memory of unplugged blocks (populating memory) > > > 2. Guest: Request to unplug memory location X via virtio-mem device > > > -> QEMU virtio-mem: discards the memory. > > > -> VFIO still has the memory pinned [...] > > > b) Mapping unplugged virtio-mem blocks via an IOMMU > > > > > > 1. Guest: map memory location X located on a virtio-mem device inside an > > > unplugged block > > > -> QEMU IOMMU notifier: create vfio DMA mapping > > > -> VFIO pins memory of unplugged blocks (populating memory) [...] > Again, sane guests will never do that, for the very reason you mentioned > "the guest should know that this region of virtio-mem is not valid since > unplugged,". But a malicious guest could try doing that to cause trouble :) Oh I think I see your point now. :) And thanks for the write-up about how virtio-mem works. So it's about the malicious guests. I agree with you that we can try to limit above from happening, e.g. by teaching vfio_get_xlat_addr() to fail when it's going to map some unplugged range of virtio-mem device. Or... imho, we may not even need to worry too much on those misuses of virtio-mem? As long as the issue is self-contained within the buggy VM/process. E.g., the worst case of such a malicious guest is to fiddle around the maximum allowed memory size granted to the virtio-mem device to either have pages incorrectly pinned, or some strange IOVA mapped to unplugged pages within that range. As long as it won't affect other VMs and the host, and qemu won't crash with that, then it seems ok. -- Peter Xu
> So it's about the malicious guests. > > I agree with you that we can try to limit above from happening, e.g. by > teaching vfio_get_xlat_addr() to fail when it's going to map some unplugged > range of virtio-mem device. Exactly. > > Or... imho, we may not even need to worry too much on those misuses of > virtio-mem? As long as the issue is self-contained within the buggy VM/process. > E.g., the worst case of such a malicious guest is to fiddle around the maximum > allowed memory size granted to the virtio-mem device to either have pages > incorrectly pinned, or some strange IOVA mapped to unplugged pages within that > range. As long as it won't affect other VMs and the host, and qemu won't crash > with that, then it seems ok. Indeed, I have the same thoughts. The only "ugly" thing is that our VM might not only consume more memory than intended, but also pin that memory (unmovable, unswappable ...). So I'm thinking about at least doing a warn_report_once() until we have some approach to handle that - for environments that might care about this. Thanks for looking into this! -- Thanks, David / dhildenb
diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 13471ae294..a3aaf70dd8 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -37,6 +37,7 @@ #include "sysemu/reset.h" #include "trace.h" #include "qapi/error.h" +#include "qemu/units.h" VFIOGroupList vfio_group_list = QLIST_HEAD_INITIALIZER(vfio_group_list); @@ -498,6 +499,143 @@ out: rcu_read_unlock(); } +static int vfio_sparse_ram_notify(SparseRAMNotifier *n, const MemoryRegion *mr, + uint64_t mr_offset, uint64_t size, + bool map) +{ + VFIOSparseRAM *sram = container_of(n, VFIOSparseRAM, notifier); + const hwaddr mr_start = MAX(mr_offset, sram->offset_within_region); + const hwaddr mr_end = MIN(mr_offset + size, + sram->offset_within_region + sram->size); + const hwaddr iova_start = mr_start + sram->offset_within_address_space; + const hwaddr iova_end = mr_end + sram->offset_within_address_space; + hwaddr mr_cur, iova_cur, mr_next; + void *vaddr; + int ret, ret2; + + g_assert(mr == sram->mr); + + /* We get notified about everything, ignore range we don't care about. */ + if (mr_start >= mr_end) { + return 0; + } + + /* Unmap everything with a single call. */ + if (!map) { + ret = vfio_dma_unmap(sram->container, iova_start, + iova_end - iova_start); + if (ret) { + error_report("%s: vfio_dma_unmap() failed: %s", __func__, + strerror(-ret)); + } + return 0; + } + + /* TODO: fail early if we would exceed a specified number of mappings. */ + + /* Map in (aligned within MR) granularity, so we can unmap later. */ + for (mr_cur = mr_start; mr_cur < mr_end; mr_cur = mr_next) { + iova_cur = mr_cur + sram->offset_within_address_space; + mr_next = QEMU_ALIGN_UP(mr_cur + 1, sram->granularity); + mr_next = MIN(mr_next, mr_end); + + vaddr = memory_region_get_ram_ptr(sram->mr) + mr_cur; + ret = vfio_dma_map(sram->container, iova_cur, mr_next - mr_cur, + vaddr, mr->readonly); + if (ret) { + /* Rollback in case of error. */ + if (mr_cur != mr_start) { + ret2 = vfio_dma_unmap(sram->container, iova_start, + iova_end - iova_start); + if (ret2) { + error_report("%s: vfio_dma_unmap() failed: %s", __func__, + strerror(-ret)); + } + } + return ret; + } + } + return 0; +} + +static int vfio_sparse_ram_notify_map(SparseRAMNotifier *n, + const MemoryRegion *mr, + uint64_t mr_offset, uint64_t size) +{ + return vfio_sparse_ram_notify(n, mr, mr_offset, size, true); +} + +static void vfio_sparse_ram_notify_unmap(SparseRAMNotifier *n, + const MemoryRegion *mr, + uint64_t mr_offset, uint64_t size) +{ + vfio_sparse_ram_notify(n, mr, mr_offset, size, false); +} + +static void vfio_register_sparse_ram(VFIOContainer *container, + MemoryRegionSection *section) +{ + VFIOSparseRAM *sram; + int ret; + + sram = g_new0(VFIOSparseRAM, 1); + sram->container = container; + sram->mr = section->mr; + sram->offset_within_region = section->offset_within_region; + sram->offset_within_address_space = section->offset_within_address_space; + sram->size = int128_get64(section->size); + sram->granularity = memory_region_sparse_ram_get_granularity(section->mr); + + /* + * TODO: We usually want a bigger granularity (for a lot of addded memory, + * as we need quite a lot of mappings) - however, this has to be configured + * by the user. + */ + g_assert(sram->granularity >= 1 * MiB && + is_power_of_2(sram->granularity)); + + /* Register the notifier */ + sparse_ram_notifier_init(&sram->notifier, vfio_sparse_ram_notify_map, + vfio_sparse_ram_notify_unmap); + memory_region_register_sparse_ram_notifier(section->mr, &sram->notifier); + QLIST_INSERT_HEAD(&container->sram_list, sram, next); + /* + * Replay mapped blocks - if anything goes wrong (only when hotplugging + * vfio devices), report the error for now. + * + * TODO: Can we catch this earlier? + */ + ret = memory_region_sparse_ram_replay_mapped(section->mr, &sram->notifier); + if (ret) { + error_report("%s: failed to replay mappings: %s", __func__, + strerror(-ret)); + } +} + +static void vfio_unregister_sparse_ram(VFIOContainer *container, + MemoryRegionSection *section) +{ + VFIOSparseRAM *sram = NULL; + + QLIST_FOREACH(sram, &container->sram_list, next) { + if (sram->mr == section->mr && + sram->offset_within_region == section->offset_within_region && + sram->offset_within_address_space == + section->offset_within_address_space) { + break; + } + } + + if (!sram) { + hw_error("vfio: Trying to unregister non-existant sparse RAM"); + } + + memory_region_unregister_sparse_ram_notifier(section->mr, &sram->notifier); + QLIST_REMOVE(sram, next); + g_free(sram); + /* The caller is expected to vfio_dma_unmap(). */ +} + static void vfio_listener_region_add(MemoryListener *listener, MemoryRegionSection *section) { @@ -650,6 +788,15 @@ static void vfio_listener_region_add(MemoryListener *listener, /* Here we assume that memory_region_is_ram(section->mr)==true */ + /* + * For sparse RAM, we only want to register the actually mapped + * pieces - and update the mapping whenever we're notified about changes. + */ + if (memory_region_is_sparse_ram(section->mr)) { + vfio_register_sparse_ram(container, section); + return; + } + vaddr = memory_region_get_ram_ptr(section->mr) + section->offset_within_region + (iova - section->offset_within_address_space); @@ -786,6 +933,13 @@ static void vfio_listener_region_del(MemoryListener *listener, pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1; try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask)); + } else if (memory_region_is_sparse_ram(section->mr)) { + vfio_unregister_sparse_ram(container, section); + /* + * We rely on a single vfio_dma_unmap() call below to clean the whole + * region. + */ + try_unmap = true; } if (try_unmap) { @@ -1275,6 +1429,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, container->error = NULL; QLIST_INIT(&container->giommu_list); QLIST_INIT(&container->hostwin_list); + QLIST_INIT(&container->sram_list); ret = vfio_init_container(container, group->fd, errp); if (ret) { diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index c78f3ff559..dfa18dbd8e 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -77,6 +77,7 @@ typedef struct VFIOContainer { QLIST_HEAD(, VFIOGuestIOMMU) giommu_list; QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list; QLIST_HEAD(, VFIOGroup) group_list; + QLIST_HEAD(, VFIOSparseRAM) sram_list; QLIST_ENTRY(VFIOContainer) next; } VFIOContainer; @@ -88,6 +89,17 @@ typedef struct VFIOGuestIOMMU { QLIST_ENTRY(VFIOGuestIOMMU) giommu_next; } VFIOGuestIOMMU; +typedef struct VFIOSparseRAM { + VFIOContainer *container; + MemoryRegion *mr; + hwaddr offset_within_region; + hwaddr offset_within_address_space; + hwaddr size; + uint64_t granularity; + SparseRAMNotifier notifier; + QLIST_ENTRY(VFIOSparseRAM) next; +} VFIOSparseRAM; + typedef struct VFIOHostDMAWindow { hwaddr min_iova; hwaddr max_iova;
Implement support for sparse RAM, to be used by virtio-mem. Handling is somewhat-similar to memory_region_is_iommu() handling, which also notifies on changes. Instead of mapping the whole region, we only map selected pieces (and unmap previously selected pieces) when notified by the SparseRAMHandler. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- hw/vfio/common.c | 155 ++++++++++++++++++++++++++++++++++ include/hw/vfio/vfio-common.h | 12 +++ 2 files changed, 167 insertions(+)