Message ID | 20180418114045.7968-1-shameerali.kolothum.thodi@huawei.com |
---|---|
Headers | show |
Series | vfio/type1: Add support for valid iova list management | expand |
[Cc +Joerg: AMD-Vi observation towards the end] On Wed, 18 Apr 2018 12:40:38 +0100 Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> wrote: > This series introduces an iova list associated with a vfio > iommu. The list is kept updated taking care of iommu apertures, > and reserved regions. Also this series adds checks for any conflict > with existing dma mappings whenever a new device group is attached to > the domain. > > User-space can retrieve valid iova ranges using VFIO_IOMMU_GET_INFO > ioctl capability chains. Any dma map request outside the valid iova > range will be rejected. Hi Shameer, I ran into two minor issues in testing this series, both related to mdev usage of type1. First, in patch 5/7 when we try to validate a dma map request: > +static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu, > + dma_addr_t start, dma_addr_t end) > +{ > + struct list_head *iova = &iommu->iova_list; > + struct vfio_iova *node; > + > + list_for_each_entry(node, iova, list) { > + if ((start >= node->start) && (end <= node->end)) > + return true; > + } > + > + return false; > +} A container with only an mdev device will have an empty list because it has not backing iommu to set ranges or reserved regions, so any dma map will fail. I think this is resolved as follows: --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1100,7 +1100,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu, return true; } - return false; + return list_empty(&iommu->iova_list); } ie. return false only if there was anything to test against. The second issue is similar, patch 6/7 adds to VFIO_IOMMU_GET_INFO: + ret = vfio_iommu_iova_build_caps(iommu, &caps); + if (ret) + return ret; And build_caps has: + list_for_each_entry(iova, &iommu->iova_list, list) + iovas++; + + if (!iovas) { + ret = -EINVAL; Therefore if the iova list is empty, as for mdevs, the use can no longer even call VFIO_IOMMU_GET_INFO on the container, which is a regression. Again, I think the fix is simple: @@ -2090,7 +2090,7 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu, iovas++; if (!iovas) { - ret = -EINVAL; + ret = 0; goto out_unlock; } ie. build_caps needs to handle lack of an iova_list as a non-error. Also, I wrote a small unit test to validate the iova list for my systems[1]. With the above changes, my Intel test system gives expected results: # ./vfio-type1-iova-list /sys/bus/mdev/devices/c08db5ed-05d3-4b39-b150-438a18bc698f /sys/bus/pci/devices/0000:00:1b.0 ---- Adding device: c08db5ed-05d3-4b39-b150-438a18bc698f ---- Initial info struct size: 0x18 No caps ---- Adding device: 0000:00:1b.0 ---- Initial info struct size: 0x18 Requested info struct size: 0x48 New info struct size: 0x48 argsz: 0x48, flags: 0x3, cap_offset: 0x18 00: 4800 0000 0300 0000 00f0 ffff ffff ffff 10: 1800 0000 0000 0000 0100 0100 0000 0000 20: 0200 0000 0000 0000 0000 0000 0000 0000 30: ffff dffe 0000 0000 0000 f0fe 0000 0000 40: ffff ffff ffff ff01 [cap id: 1, version: 1, next: 0x0] Found type1 iova range version: 1 00: 0000000000000000 - 00000000fedfffff 01: 00000000fef00000 - 01ffffffffffffff Adding an mdev device to the container results in no iova list, adding the physical device updates to the expected set with the MSI range excluded. I was a little surprised by an AMD system: # ./vfio-type1-iova-list /sys/bus/pci/devices/0000:01:00.0 ---- Adding device: 0000:01:00.0 ---- Initial info struct size: 0x18 Requested info struct size: 0x58 New info struct size: 0x58 argsz: 0x58, flags: 0x3, cap_offset: 0x18 00: 5800 0000 0300 0000 00f0 ffff 7fff ffff 10: 1800 0000 0000 0000 0100 0100 0000 0000 20: 0300 0000 0000 0000 0000 0000 0000 0000 30: ffff dffe 0000 0000 0000 f0fe 0000 0000 40: ffff ffff fc00 0000 0000 0000 0001 0000 50: ffff ffff ffff ffff [cap id: 1, version: 1, next: 0x0] Found type1 iova range version: 1 00: 0000000000000000 - 00000000fedfffff 01: 00000000fef00000 - 000000fcffffffff 02: 0000010000000000 - ffffffffffffffff The additional excluded range is the HyperTransport area (which for 99.9+% of use cases isn't going to be a problem) is a rather surprising limit for the size of a VM we can use on AMD, just under 1TB. I fully expect we'll see a bug report from that and we should be thinking about how to address it. Otherwise, if the changes above look good to you I'll incorporate them into their respective patches and push to my next branch. Thanks, Alex [1] https://gist.github.com/awilliam/25e39252ab28101a2d89ace1ff765fdd
Hi Alex, > -----Original Message----- > From: Alex Williamson [mailto:alex.williamson@redhat.com] > Sent: Thursday, May 24, 2018 7:21 PM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: eric.auger@redhat.com; pmorel@linux.vnet.ibm.com; > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; iommu@lists.linux- > foundation.org; Linuxarm <linuxarm@huawei.com>; John Garry > <john.garry@huawei.com>; xuwei (O) <xuwei5@huawei.com>; Joerg Roedel > <joro@8bytes.org> > Subject: Re: [PATCH v6 0/7] vfio/type1: Add support for valid iova list > management > > [Cc +Joerg: AMD-Vi observation towards the end] > > On Wed, 18 Apr 2018 12:40:38 +0100 > Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> wrote: > > > This series introduces an iova list associated with a vfio > > iommu. The list is kept updated taking care of iommu apertures, > > and reserved regions. Also this series adds checks for any conflict > > with existing dma mappings whenever a new device group is attached to > > the domain. > > > > User-space can retrieve valid iova ranges using VFIO_IOMMU_GET_INFO > > ioctl capability chains. Any dma map request outside the valid iova > > range will be rejected. > > Hi Shameer, > > I ran into two minor issues in testing this series, both related to > mdev usage of type1. First, in patch 5/7 when we try to validate a dma > map request: I must admit I haven't looked into the mdev use case at all and my impression was that it will be same as others. Thanks for doing these tests. > > +static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu, > > + dma_addr_t start, dma_addr_t end) > > +{ > > + struct list_head *iova = &iommu->iova_list; > > + struct vfio_iova *node; > > + > > + list_for_each_entry(node, iova, list) { > > + if ((start >= node->start) && (end <= node->end)) > > + return true; > > + } > > + > > + return false; > > +} > > A container with only an mdev device will have an empty list because it > has not backing iommu to set ranges or reserved regions, so any dma map > will fail. I think this is resolved as follows: > > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -1100,7 +1100,7 @@ static bool vfio_iommu_iova_dma_valid(struct > vfio_iommu *iommu, > return true; > } > > - return false; > + return list_empty(&iommu->iova_list); > } Ok. > ie. return false only if there was anything to test against. > > The second issue is similar, patch 6/7 adds to VFIO_IOMMU_GET_INFO: > > + ret = vfio_iommu_iova_build_caps(iommu, &caps); > + if (ret) > + return ret; > > And build_caps has: > > + list_for_each_entry(iova, &iommu->iova_list, list) > + iovas++; > + > + if (!iovas) { > + ret = -EINVAL; > > Therefore if the iova list is empty, as for mdevs, the use can no > longer even call VFIO_IOMMU_GET_INFO on the container, which is a > regression. Again, I think the fix is simple: > > @@ -2090,7 +2090,7 @@ static int vfio_iommu_iova_build_caps(struct > vfio_iommu *iommu, > iovas++; > > if (!iovas) { > - ret = -EINVAL; > + ret = 0; > goto out_unlock; > } > > ie. build_caps needs to handle lack of an iova_list as a non-error. Ok. > Also, I wrote a small unit test to validate the iova list for my > systems[1]. With the above changes, my Intel test system gives expected > results: > > # ./vfio-type1-iova-list /sys/bus/mdev/devices/c08db5ed-05d3-4b39-b150- > 438a18bc698f /sys/bus/pci/devices/0000:00:1b.0 > ---- Adding device: c08db5ed-05d3-4b39-b150-438a18bc698f ---- > Initial info struct size: 0x18 > No caps > ---- Adding device: 0000:00:1b.0 ---- > Initial info struct size: 0x18 > Requested info struct size: 0x48 > New info struct size: 0x48 > argsz: 0x48, flags: 0x3, cap_offset: 0x18 > 00: 4800 0000 0300 0000 00f0 ffff ffff ffff > 10: 1800 0000 0000 0000 0100 0100 0000 0000 > 20: 0200 0000 0000 0000 0000 0000 0000 0000 > 30: ffff dffe 0000 0000 0000 f0fe 0000 0000 > 40: ffff ffff ffff ff01 > [cap id: 1, version: 1, next: 0x0] > Found type1 iova range version: 1 > 00: 0000000000000000 - 00000000fedfffff > 01: 00000000fef00000 - 01ffffffffffffff > > Adding an mdev device to the container results in no iova list, adding > the physical device updates to the expected set with the MSI range > excluded. > > I was a little surprised by an AMD system: > > # ./vfio-type1-iova-list /sys/bus/pci/devices/0000:01:00.0 > ---- Adding device: 0000:01:00.0 ---- > Initial info struct size: 0x18 > Requested info struct size: 0x58 > New info struct size: 0x58 > argsz: 0x58, flags: 0x3, cap_offset: 0x18 > 00: 5800 0000 0300 0000 00f0 ffff 7fff ffff > 10: 1800 0000 0000 0000 0100 0100 0000 0000 > 20: 0300 0000 0000 0000 0000 0000 0000 0000 > 30: ffff dffe 0000 0000 0000 f0fe 0000 0000 > 40: ffff ffff fc00 0000 0000 0000 0001 0000 > 50: ffff ffff ffff ffff > [cap id: 1, version: 1, next: 0x0] > Found type1 iova range version: 1 > 00: 0000000000000000 - 00000000fedfffff > 01: 00000000fef00000 - 000000fcffffffff > 02: 0000010000000000 - ffffffffffffffff > > The additional excluded range is the HyperTransport area (which for > 99.9+% of use cases isn't going to be a problem) is a rather surprising > limit for the size of a VM we can use on AMD, just under 1TB. I fully > expect we'll see a bug report from that and we should be thinking about > how to address it. Otherwise, if the changes above look good to you > I'll incorporate them into their respective patches and push to my next > branch. Thanks, Yes, the above changes related to list empty cases looks fine to me. Much appreciated, Shameer > Alex > > [1] https://gist.github.com/awilliam/25e39252ab28101a2d89ace1ff765fdd
On Fri, 25 May 2018 08:45:36 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
> Yes, the above changes related to list empty cases looks fine to me.
Thanks Shameer, applied to my next branch with the discussed fixes for
v4.18 (modulo patch 4/7 which Joerg already pushed for v4.17). Thanks,
Alex
[Cc +dwmw2] On Fri, 25 May 2018 14:54:08 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > On Fri, 25 May 2018 08:45:36 +0000 > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote: > > > Yes, the above changes related to list empty cases looks fine to me. > > Thanks Shameer, applied to my next branch with the discussed fixes for > v4.18 (modulo patch 4/7 which Joerg already pushed for v4.17). Thanks, Hi Shameer, We're still hitting issues with this. VT-d marks reserved regions for any RMRR ranges, which are IOVA ranges that the BIOS requests to be identity mapped for a device. These are indicated by these sorts of log entries on boot: DMAR: Setting RMRR: DMAR: Setting identity map for device 0000:00:02.0 [0xbf800000 - 0xcf9fffff] DMAR: Setting identity map for device 0000:00:1a.0 [0xbe8d1000 - 0xbe8dffff] DMAR: Setting identity map for device 0000:00:1d.0 [0xbe8d1000 - 0xbe8dffff] So while for an unaffected device, I see very usable ranges for QEMU, such as: 00: 0000000000000000 - 00000000fedfffff 01: 00000000fef00000 - 01ffffffffffffff If I try to assign the previously assignable 00:02.0 IGD graphics device, I get: 00: 0000000000000000 - 00000000bf7fffff 01: 00000000cfa00000 - 00000000fedfffff 02: 00000000fef00000 - 01ffffffffffffff And we get a fault when QEMU tries the following mapping: vfio_dma_map(0x55f790421a20, 0x100000, 0xbff00000, 0x7ff163f00000) bff00000 clearly extends into the gap starting at bf800000. VT-d is rather split-brained about RMRRs, typically we'd exclude devices from assignment at all for relying on RMRRs and these reserved ranges would be a welcome mechanism to avoid conflicts with those ranges, but for RMRR ranges covering IGD and USB devices we've decided that we don't need to honor the RMRR (see device_is_rmrr_locked()), but it's still listed as a reserved range and bites us here. Unless we can get VT-d to exclude RMRRs from reserved regions where we've already seen fit to allow the devices to participate in the IOMMU API, this would introduce a functional regression for assignment of these devices. David, should we skip GFX and USB devices with RMRRs in intel_iommu_get_resv_regions() just as we do in device_is_rmrr_locked()? Thanks, Alex
> -----Original Message----- > From: Alex Williamson [mailto:alex.williamson@redhat.com] > Sent: 05 June 2018 18:03 > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: eric.auger@redhat.com; pmorel@linux.vnet.ibm.com; > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; iommu@lists.linux- > foundation.org; Linuxarm <linuxarm@huawei.com>; John Garry > <john.garry@huawei.com>; xuwei (O) <xuwei5@huawei.com>; Joerg Roedel > <joro@8bytes.org>; David Woodhouse <dwmw2@infradead.org> > Subject: Re: [PATCH v6 0/7] vfio/type1: Add support for valid iova list > management > > [Cc +dwmw2] > > On Fri, 25 May 2018 14:54:08 -0600 > Alex Williamson <alex.williamson@redhat.com> wrote: > > > On Fri, 25 May 2018 08:45:36 +0000 > > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > wrote: > > > > > Yes, the above changes related to list empty cases looks fine to me. > > > > Thanks Shameer, applied to my next branch with the discussed fixes for > > v4.18 (modulo patch 4/7 which Joerg already pushed for v4.17). Thanks, > > Hi Shameer, > > We're still hitting issues with this. VT-d marks reserved regions for > any RMRR ranges, which are IOVA ranges that the BIOS requests to be > identity mapped for a device. These are indicated by these sorts of > log entries on boot: > > DMAR: Setting RMRR: > DMAR: Setting identity map for device 0000:00:02.0 [0xbf800000 - 0xcf9fffff] > DMAR: Setting identity map for device 0000:00:1a.0 [0xbe8d1000 - 0xbe8dffff] > DMAR: Setting identity map for device 0000:00:1d.0 [0xbe8d1000 - 0xbe8dffff] > > So while for an unaffected device, I see very usable ranges for QEMU, > such as: > > 00: 0000000000000000 - 00000000fedfffff > 01: 00000000fef00000 - 01ffffffffffffff > > If I try to assign the previously assignable 00:02.0 IGD graphics > device, I get: > > 00: 0000000000000000 - 00000000bf7fffff > 01: 00000000cfa00000 - 00000000fedfffff > 02: 00000000fef00000 - 01ffffffffffffff > > And we get a fault when QEMU tries the following mapping: > > vfio_dma_map(0x55f790421a20, 0x100000, 0xbff00000, 0x7ff163f00000) > > bff00000 clearly extends into the gap starting at bf800000. VT-d is > rather split-brained about RMRRs, typically we'd exclude devices from > assignment at all for relying on RMRRs and these reserved ranges > would be a welcome mechanism to avoid conflicts with those ranges, but > for RMRR ranges covering IGD and USB devices we've decided that we > don't need to honor the RMRR (see device_is_rmrr_locked()), but it's > still listed as a reserved range and bites us here. Ah..:(. This looks similar to the pci window range reporting issue faced in the arm world. I see the RFC you have sent out to ignore these "known" RMRRs. Hope we will be able to resolve this soon. Thanks, Shameer > Unless we can get VT-d to exclude RMRRs from reserved regions where > we've already seen fit to allow the devices to participate in the IOMMU > API, this would introduce a functional regression for assignment of > these devices. David, should we skip GFX and USB devices with RMRRs in > intel_iommu_get_resv_regions() just as we do in > device_is_rmrr_locked()? Thanks, > > Alex
On Wed, 6 Jun 2018 06:52:08 +0000 Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote: > > -----Original Message----- > > From: Alex Williamson [mailto:alex.williamson@redhat.com] > > Sent: 05 June 2018 18:03 > > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > > Cc: eric.auger@redhat.com; pmorel@linux.vnet.ibm.com; > > kvm@vger.kernel.org; linux-kernel@vger.kernel.org; iommu@lists.linux- > > foundation.org; Linuxarm <linuxarm@huawei.com>; John Garry > > <john.garry@huawei.com>; xuwei (O) <xuwei5@huawei.com>; Joerg Roedel > > <joro@8bytes.org>; David Woodhouse <dwmw2@infradead.org> > > Subject: Re: [PATCH v6 0/7] vfio/type1: Add support for valid iova list > > management > > > > [Cc +dwmw2] > > > > On Fri, 25 May 2018 14:54:08 -0600 > > Alex Williamson <alex.williamson@redhat.com> wrote: > > > > > On Fri, 25 May 2018 08:45:36 +0000 > > > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > > wrote: > > > > > > > Yes, the above changes related to list empty cases looks fine to me. > > > > > > Thanks Shameer, applied to my next branch with the discussed fixes for > > > v4.18 (modulo patch 4/7 which Joerg already pushed for v4.17). Thanks, > > > > Hi Shameer, > > > > We're still hitting issues with this. VT-d marks reserved regions for > > any RMRR ranges, which are IOVA ranges that the BIOS requests to be > > identity mapped for a device. These are indicated by these sorts of > > log entries on boot: > > > > DMAR: Setting RMRR: > > DMAR: Setting identity map for device 0000:00:02.0 [0xbf800000 - 0xcf9fffff] > > DMAR: Setting identity map for device 0000:00:1a.0 [0xbe8d1000 - 0xbe8dffff] > > DMAR: Setting identity map for device 0000:00:1d.0 [0xbe8d1000 - 0xbe8dffff] > > > > So while for an unaffected device, I see very usable ranges for QEMU, > > such as: > > > > 00: 0000000000000000 - 00000000fedfffff > > 01: 00000000fef00000 - 01ffffffffffffff > > > > If I try to assign the previously assignable 00:02.0 IGD graphics > > device, I get: > > > > 00: 0000000000000000 - 00000000bf7fffff > > 01: 00000000cfa00000 - 00000000fedfffff > > 02: 00000000fef00000 - 01ffffffffffffff > > > > And we get a fault when QEMU tries the following mapping: > > > > vfio_dma_map(0x55f790421a20, 0x100000, 0xbff00000, 0x7ff163f00000) > > > > bff00000 clearly extends into the gap starting at bf800000. VT-d is > > rather split-brained about RMRRs, typically we'd exclude devices from > > assignment at all for relying on RMRRs and these reserved ranges > > would be a welcome mechanism to avoid conflicts with those ranges, but > > for RMRR ranges covering IGD and USB devices we've decided that we > > don't need to honor the RMRR (see device_is_rmrr_locked()), but it's > > still listed as a reserved range and bites us here. > > Ah..:(. This looks similar to the pci window range reporting issue faced in > the arm world. I see the RFC you have sent out to ignore these "known" > RMRRs. Hope we will be able to resolve this soon. In the meantime, it seems that I need to drop the iova list from my branch for v4.18, I don't think it makes much sense to expose to userspace if we cannot also enforce it and it seems like it presents API issues if we were to expose the iova list without enforcement and later try to enforce it. Sorry I didn't catch this earlier. Thanks, Alex