Message ID | 20240812170014.1583783-1-alex.williamson@redhat.com |
---|---|
State | New |
Headers | show |
Series | [RFC/RFT] vfio/pci: Create feature to disable MSI virtualization | expand |
On Mon, Aug 12, 2024 at 10:59:12AM -0600, Alex Williamson wrote: > vfio-pci has always virtualized the MSI address and data registers as > MSI programming is performed through the SET_IRQS ioctl. Often this > virtualization is not used, and in specific cases can be unhelpful. > > One such case where the virtualization is a hinderance is when the > device contains an onboard interrupt controller programmed by the guest > driver. Userspace VMMs have a chance to quirk this programming, > injecting the host physical MSI information, but only if the userspace > driver can get access to the host physical address and data registers. > > This introduces a device feature which allows the userspace driver to > disable virtualization of the MSI capability address and data registers > in order to provide read-only access the the physical values. Personally, I very much dislike this. Encouraging such hacky driver use of the interrupt subsystem is not a good direction. Enabling this in VMs will further complicate fixing the IRQ usages in these drivers over the long run. If the device has it's own interrupt sources then the device needs to create an irq_chip and related and hook them up properly. Not hackily read the MSI-X registers and write them someplace else. Thomas Gleixner has done alot of great work recently to clean this up. So if you imagine the driver is fixed, then this is not necessary. Howver, it will still not work in a VM. Making IMS and non-MSI interrupt controlers work within VMs is still something that needs to be done. Jason
On Tue, Aug 13 2024 at 13:30, Jason Gunthorpe wrote: > On Mon, Aug 12, 2024 at 10:59:12AM -0600, Alex Williamson wrote: >> vfio-pci has always virtualized the MSI address and data registers as >> MSI programming is performed through the SET_IRQS ioctl. Often this >> virtualization is not used, and in specific cases can be unhelpful. >> >> One such case where the virtualization is a hinderance is when the >> device contains an onboard interrupt controller programmed by the guest >> driver. Userspace VMMs have a chance to quirk this programming, >> injecting the host physical MSI information, but only if the userspace >> driver can get access to the host physical address and data registers. >> >> This introduces a device feature which allows the userspace driver to >> disable virtualization of the MSI capability address and data registers >> in order to provide read-only access the the physical values. > > Personally, I very much dislike this. Encouraging such hacky driver > use of the interrupt subsystem is not a good direction. Enabling this > in VMs will further complicate fixing the IRQ usages in these drivers > over the long run. > > If the device has it's own interrupt sources then the device needs to > create an irq_chip and related and hook them up properly. Not hackily > read the MSI-X registers and write them someplace else. > > Thomas Gleixner has done alot of great work recently to clean this up. > > So if you imagine the driver is fixed, then this is not necessary. Yes. I looked at the at11k driver when I was reworking the PCI/MSI subsystem and that's a perfect candidate for a proper device specific interrupt domain to replace the horrible MSI hackery it has. > Howver, it will still not work in a VM. Making IMS and non-MSI > interrupt controlers work within VMs is still something that needs to > be done. Sure, but we really want to do that in a generic way and not based on ad hoc workarounds. Did the debate around this go anywhere? Thanks, tglx
On Tue, 13 Aug 2024 13:30:53 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Mon, Aug 12, 2024 at 10:59:12AM -0600, Alex Williamson wrote: > > vfio-pci has always virtualized the MSI address and data registers as > > MSI programming is performed through the SET_IRQS ioctl. Often this > > virtualization is not used, and in specific cases can be unhelpful. > > > > One such case where the virtualization is a hinderance is when the > > device contains an onboard interrupt controller programmed by the guest > > driver. Userspace VMMs have a chance to quirk this programming, > > injecting the host physical MSI information, but only if the userspace > > driver can get access to the host physical address and data registers. > > > > This introduces a device feature which allows the userspace driver to > > disable virtualization of the MSI capability address and data registers > > in order to provide read-only access the the physical values. > > Personally, I very much dislike this. Encouraging such hacky driver > use of the interrupt subsystem is not a good direction. Enabling this > in VMs will further complicate fixing the IRQ usages in these drivers > over the long run. Clearly these _guest_ drivers are doing this regardless of the interfaces provided by vfio, so I don't see how we're encouraging hacky driver behavior, especially when it comes to Windows guest drivers. > If the device has it's own interrupt sources then the device needs to > create an irq_chip and related and hook them up properly. Not hackily > read the MSI-X registers and write them someplace else. This is how the hardware works, regardless of whether the guest driver represents the hardware using an irq_chip. > Thomas Gleixner has done alot of great work recently to clean this up. > > So if you imagine the driver is fixed, then this is not necessary. How so? Regardless of the guest driver structure, something is writing the MSI address and data values elsewhere in the device. AFAICT the only way to avoid needing to fixup those values is to give the guest ownership of the address space as you suggested in the other patch. That also seems to have a pile of issues though. > Howver, it will still not work in a VM. Making IMS and non-MSI > interrupt controlers work within VMs is still something that needs to > be done. Making it work in a VM is sort of the point here. Thanks, Alex
On Tue, Aug 13, 2024 at 03:14:01PM -0600, Alex Williamson wrote: > > Personally, I very much dislike this. Encouraging such hacky driver > > use of the interrupt subsystem is not a good direction. Enabling this > > in VMs will further complicate fixing the IRQ usages in these drivers > > over the long run. > > Clearly these _guest_ drivers are doing this regardless of the > interfaces provided by vfio, so I don't see how we're encouraging hacky > driver behavior, especially when it comes to Windows guest drivers. Because people will then say the Linux driver can't be fixed to properly use an irq_domain/etc as the only option that works in VMs will be the hacky copy from MSI-X approach :\ > > Thomas Gleixner has done alot of great work recently to clean this up. > > > > So if you imagine the driver is fixed, then this is not necessary. > > How so? Because if the driver is properly using the new irq_domain/etc infrastructure to model its additional interrupt source then this patch won't make it work in the VM anyhow, so it is not necessary.. Your other patch would be the only short term answer. Jason
On Tue, Aug 13, 2024 at 07:30:41PM +0200, Thomas Gleixner wrote: > > Howver, it will still not work in a VM. Making IMS and non-MSI > > interrupt controlers work within VMs is still something that needs to > > be done. > > Sure, but we really want to do that in a generic way and not based on ad > hoc workarounds. > > Did the debate around this go anywhere? No, it got stuck on the impossible situation that there is no existing way for the VM to have any idea if IMS will work or is broken. Recall Intel was planning to "solve" this by sticking a DVSEC in their virtual config space that said to turn off IMS :\ So using IMS in the real world looked impractical and interest faded a bit. But the underlying reasons for IMS haven't gone away and more work is coming that will bring it up again... Jason
On Tue, 13 Aug 2024 20:16:42 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Tue, Aug 13, 2024 at 03:14:01PM -0600, Alex Williamson wrote: > > > > Personally, I very much dislike this. Encouraging such hacky driver > > > use of the interrupt subsystem is not a good direction. Enabling this > > > in VMs will further complicate fixing the IRQ usages in these drivers > > > over the long run. > > > > Clearly these _guest_ drivers are doing this regardless of the > > interfaces provided by vfio, so I don't see how we're encouraging hacky > > driver behavior, especially when it comes to Windows guest drivers. > > Because people will then say the Linux driver can't be fixed to > properly use an irq_domain/etc as the only option that works in VMs > will be the hacky copy from MSI-X approach :\ Ironically QEMU already has direct access to the MSI-X vector table in MMIO space and could implement this type of quirk with no kernel changes. It's MSI that is now blocked by virtualization of the address and data registers. Note also that QEMU is still virtualizing these registers, the values seen in the guest are unchanged. It's only the VMM that can bypass that virtualization to see the host values. Let's imagine the guest driver does change to implement an irq_domain. How does that fundamentally change the problem for the VMM that guest MSI values are being written to other portions of the device? The guest driver can have whatever architecture it wants (we don't know the architecture of the Windows driver) but we still need to trap writes of the guest MSI address/data and replace it with host values. > > > Thomas Gleixner has done alot of great work recently to clean this up. > > > > > > So if you imagine the driver is fixed, then this is not necessary. > > > > How so? > > Because if the driver is properly using the new irq_domain/etc > infrastructure to model its additional interrupt source then this > patch won't make it work in the VM anyhow, so it is not necessary.. > > Your other patch would be the only short term answer. The QEMU patch relies on this kernel patch in order to be able to access the host physical MSI address and data values through the vfio interface. Otherwise QEMU has no host values with which to patch-up guest values. As noted above, this does not provide any visible change to a QEMU guest, it only enables QEMU to implement the quirk in the other patch. Thanks, Alex
On Wed, Aug 14, 2024 at 08:55:05AM -0600, Alex Williamson wrote: > Let's imagine the guest driver does change to implement an irq_domain. > How does that fundamentally change the problem for the VMM that guest > MSI values are being written to other portions of the device? If changed to irq_domain the VM will write addr/data pairs into those special register that are unique to that interrupt source and will not re-use values already set in the MSI table. This means the VMM doesn't get any value from inspecting the MSI table because the value it needs won't be there, and alos that no interrupt routing will have been setup. The VMM must call VFIO_DEVICE_SET_IRQS to setup the unique routing. These two patches are avoiding VFIO_DEVICE_SET_IRQS based on the assumption that the VM will re-use a addr/data pair already setup in the MSI table. Invalidating that assumption is the fundamental change irq_domain in the VM will make. > The guest driver can have whatever architecture it wants (we don't > know the architecture of the Windows driver) but we still need to > trap writes of the guest MSI address/data and replace it with host > values. Yes you do. But the wrinkle is you can't just assume one of the existing MSI entries is a valid replacement and copy from the MSI table. That works right now only because the Linux/Windows driver is re-using a MSI vector in the IMS registers. I suggest the general path is something like: 1) A vfio variant driver sets up an irq_domain for the additional interrupt source registers 2) Somehow wire up VFIO_DEVICE_SET_IRQS so it can target vectors in the additional interrupt domain 3) Have the VMM trap writes to the extra interrupt source registers and execute VFIO_DEVICE_SET_IRQS 4) IRQ layer will setup an appropriate unique IRQ and route it to the guest/whatever just like MSI. Callbacks into the variant driver's irq_domain will program the HW registers. Basically exactly the same flow as MSI, except instead of targetting a vector in the PCI core's MSI irq_domain it targets a vector in the variant driver's IMS IRQ domain. Then we don't make any assumptions about how the VM is using these interrupt vectors, and crucially, SET_IRQs is called for every interrupt source and we rely on the kernel to produce the correct addr/data pair. No need for copying addr/data pairs from MSI tables. > As noted above, this does not provide any visible change to a QEMU > guest, it only enables QEMU to implement the quirk in the other > patch. I see, I definitely didn't understand that it only reaches qemu from the commit message.. Jason
On Tue, 2024-08-13 at 19:30 +0200, Thomas Gleixner wrote: > On Tue, Aug 13 2024 at 13:30, Jason Gunthorpe wrote: > > On Mon, Aug 12, 2024 at 10:59:12AM -0600, Alex Williamson wrote: > > > vfio-pci has always virtualized the MSI address and data registers as > > > MSI programming is performed through the SET_IRQS ioctl. Often this > > > virtualization is not used, and in specific cases can be unhelpful. > > > > > > One such case where the virtualization is a hinderance is when the > > > device contains an onboard interrupt controller programmed by the guest > > > driver. Userspace VMMs have a chance to quirk this programming, > > > injecting the host physical MSI information, but only if the userspace > > > driver can get access to the host physical address and data registers. > > > > > > This introduces a device feature which allows the userspace driver to > > > disable virtualization of the MSI capability address and data registers > > > in order to provide read-only access the the physical values. > > > > Personally, I very much dislike this. Encouraging such hacky driver > > use of the interrupt subsystem is not a good direction. Enabling this > > in VMs will further complicate fixing the IRQ usages in these drivers > > over the long run. > > > > If the device has it's own interrupt sources then the device needs to > > create an irq_chip and related and hook them up properly. Not hackily > > read the MSI-X registers and write them someplace else. > > > > Thomas Gleixner has done alot of great work recently to clean this up. > > > > So if you imagine the driver is fixed, then this is not necessary. > > Yes. I looked at the at11k driver when I was reworking the PCI/MSI > subsystem and that's a perfect candidate for a proper device specific > interrupt domain to replace the horrible MSI hackery it has. The ath11k hacks may be awful, but in their defence, that's because the whole way the hardware works is awful. Q: With PCI passthrough to a guest, how does the guest OS tell the device where to do DMA? A: The guest OS just hands the device a guest physical address and the IOMMU does the rest. Nothing 'intercedes' between the guest and the device to mess with that address. Q: MSIs are just DMA. So with PCI passthrough to a guest, how does the guest OS configure the device's MSIs? <fantasy> A: The guest OS just hands the device a standard MSI message encoding the target guest APIC ID and vector (etc.), and the IOMMU does the rest. Nothing 'intercedes' between the guest and the device to mess with that MSI message. And thus ath11k didn't need to do *any* hacks to work around a stupid hardware design with the VMM snooping on stuff it ideally shouldn't have had any business touching in the first place. Posted interrupts are almost the *default* because the IOMMU receives a <source-id, vCPU APIC ID, vector> tuple on the bus. If receiving an interrupt for a vCPU which isn't currently running, that's when the IOMMU sets a bit in a table somewhere and notifies the host OS. All that special case MSI handling and routing code that I had nightmares about because it fell through a wormhole from a parallel universe, doesn't exist. And look, DPDK drivers which run in polling mode and 'abuse' MSIs by using real memory addresses and asking the device to "write <these> 32 bits to <this> structure if you want attention" just work nicely in virtual machines too, just as they do on real hardware. </fantasy> /me wakes up... Shit. And we have to enable this Interrupt Remapping crap even to address more than 255 CPUs *without* virtualization? Even a *guest* has to see a virtual IOMMU and enable Interrupt Remapping to be able to use more than 255 vCPUs? Even though there were a metric shitload of spare bits in the MSI message we could have used¹. Wait, so that means we have to offer an IOMMU with *DMA* remapping to guests, which means 2-stage translations and/or massive overhead, just for that guest to be able to use >255 vCPUs? Screw you all, I'm going back to bed. ¹ And *should* use, if we ever do something similar like, say, expand the vector# space past 8 bits. Intel and AMD take note.
On Fri, Dec 13, 2024 at 09:10:30AM +0000, David Woodhouse wrote: > <fantasy> > A: The guest OS just hands the device a standard MSI message encoding > the target guest APIC ID and vector (etc.), and the IOMMU does the > rest. Nothing 'intercedes' between the guest and the device to mess > with that MSI message. > /me wakes up... Well, I share your dream at least. :\ Have the VMM shadow the virtual interrupt remapping tables and assign it to the phyiscal remapping so that the physical addr/data pair doesn't change. Driving interrupt routing fully via the remapping HW and not via MSI interception. IIRC Alex had a patch series for qemu to capture and rewrite the ath non-standard MSI locations, so virtualization worries should not block moving ath to use the device-specific MSI.. Jason
On Fri, 2025-01-03 at 10:31 -0400, Jason Gunthorpe wrote: > > IIRC Alex had a patch series for qemu to capture and rewrite the ath > non-standard MSI locations, so virtualization worries should not block > moving ath to use the device-specific MSI.. In the absence of the IOMMU in our shared fantasy, this seems like the best approach. Even for passed-through PCI hardware, the config space is largely a fiction. As are the MSI(-X) vectors. So although it's a device-specific quirk, this seems like the best approach. Probably best for the hypervisor to have some way to *advertise* that it's handling this though, as guests also might want to run on hypervisors which don't.
On Fri, Jan 03, 2025 at 02:47:11PM +0000, David Woodhouse wrote: > Probably best for the hypervisor to have some way to *advertise* that > it's handling this though, as guests also might want to run on > hypervisors which don't. If the hypervisor doesn't properly virtualize the device it shouldn't assign it to a VM to start with :\ Intel looked at the question of advertising clean interrupt remapping when trying to virtualize IMS and it didn't seem so great. Bare metal machines need to work, so any test they could think of adding would either fail on bare metal or fail on existing VMs. VMM's have taken the approach of not telling the guest they are in VMs and then also not implementing the bare metal HW behaviors with full fidelity. So we have no way to discover that the VMM is, in fact, emulating broken "hw". Thus we get this push that all kernels need to accomodate the worst VMM behaviors :( Jason
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 97422aafaa7b..5f86e75ea6ca 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -1259,6 +1259,32 @@ static int vfio_msi_cap_len(struct vfio_pci_core_device *vdev, u8 pos) return len; } +/* Disable virtualization of the MSI address and data fields */ +int vfio_pci_msi_novirt(struct vfio_pci_core_device *vdev) +{ + struct pci_dev *pdev = vdev->pdev; + struct perm_bits *perm = vdev->msi_perm; + u16 flags; + int ret; + + if (!perm) + return -EINVAL; + + ret = pci_read_config_word(pdev, pdev->msi_cap + PCI_MSI_FLAGS, &flags); + if (ret) + return pcibios_err_to_errno(ret); + + p_setd(perm, PCI_MSI_ADDRESS_LO, NO_VIRT, NO_WRITE); + if (flags & PCI_MSI_FLAGS_64BIT) { + p_setd(perm, PCI_MSI_ADDRESS_HI, NO_VIRT, NO_WRITE); + p_setw(perm, PCI_MSI_DATA_64, (u16)NO_VIRT, (u16)NO_WRITE); + } else { + p_setw(perm, PCI_MSI_DATA_32, (u16)NO_VIRT, (u16)NO_WRITE); + } + + return 0; +} + /* Determine extended capability length for VC (2 & 9) and MFVC */ static int vfio_vc_cap_len(struct vfio_pci_core_device *vdev, u16 pos) { diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index ba0ce0075b2f..acdced212be2 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1518,6 +1518,24 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, return 0; } +static int vfio_pci_core_feature_msi_novirt(struct vfio_device *device, + u32 flags, void __user *arg, + size_t argsz) +{ + struct vfio_pci_core_device *vdev = + container_of(device, struct vfio_pci_core_device, vdev); + int ret; + + if (!vdev->msi_perm) + return -ENOTTY; + + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0); + if (ret != 1) + return ret; + + return vfio_pci_msi_novirt(vdev); +} + int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, void __user *arg, size_t argsz) { @@ -1531,6 +1549,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, return vfio_pci_core_pm_exit(device, flags, arg, argsz); case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: return vfio_pci_core_feature_token(device, flags, arg, argsz); + case VFIO_DEVICE_FEATURE_PCI_MSI_NOVIRT: + return vfio_pci_core_feature_msi_novirt(device, flags, + arg, argsz); default: return -ENOTTY; } diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index 5e4fa69aee16..6e6cc74c6579 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -53,6 +53,7 @@ int vfio_pci_ioeventfd(struct vfio_pci_core_device *vdev, loff_t offset, int vfio_pci_init_perm_bits(void); void vfio_pci_uninit_perm_bits(void); +int vfio_pci_msi_novirt(struct vfio_pci_core_device *vdev); int vfio_config_init(struct vfio_pci_core_device *vdev); void vfio_config_free(struct vfio_pci_core_device *vdev); diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 2b68e6cdf190..ddf5dd9245fb 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1458,6 +1458,20 @@ struct vfio_device_feature_bus_master { }; #define VFIO_DEVICE_FEATURE_BUS_MASTER 10 +/** + * Toggle virtualization of PCI MSI address and data fields off. By default + * vfio-pci-core based drivers virtualize the MSI address and data fields of + * the MSI capability to emulate direct access to the device, ie. writes are + * allowed and buffered where subsequent reads return the buffered data. + * VMMs often virtualize these registers anyway and there are cases in user- + * space where having access to the host MSI fields can be useful, such as + * quirking an embedded interrupt controller on the device to generate physical + * MSI interrupts. Upon VFIO_DEVICE_FEATURE_SET of the PCI_MSI_NOVIRT feature + * this virtualization is disabled, reads of the MSI address and data fields + * will return the physical values and writes are dropped. + */ +#define VFIO_DEVICE_FEATURE_PCI_MSI_NOVIRT 11 + /* -------- API for Type1 VFIO IOMMU -------- */ /**
vfio-pci has always virtualized the MSI address and data registers as MSI programming is performed through the SET_IRQS ioctl. Often this virtualization is not used, and in specific cases can be unhelpful. One such case where the virtualization is a hinderance is when the device contains an onboard interrupt controller programmed by the guest driver. Userspace VMMs have a chance to quirk this programming, injecting the host physical MSI information, but only if the userspace driver can get access to the host physical address and data registers. This introduces a device feature which allows the userspace driver to disable virtualization of the MSI capability address and data registers in order to provide read-only access the the physical values. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216055 Signed-off-by: Alex Williamson <alex.williamson@redhat.com> --- drivers/vfio/pci/vfio_pci_config.c | 26 ++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_core.c | 21 +++++++++++++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 1 + include/uapi/linux/vfio.h | 14 ++++++++++++++ 4 files changed, 62 insertions(+)