[v5,0/7] vfio/type1: Add support for valid iova list management

Message ID	20180315163509.17740-1-shameerali.kolothum.thodi@huawei.com
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> To: <alex.williamson@redhat.com>, <eric.auger@redhat.com>, <pmorel@linux.vnet.ibm.com> CC: <kvm@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <iommu@lists.linux-foundation.org>, <linuxarm@huawei.com>, <john.garry@huawei.com>, <xuwei5@hisilicon.com>, Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Subject: [PATCH v5 0/7] vfio/type1: Add support for valid iova list management Date: Thu, 15 Mar 2018 16:35:02 +0000 Message-ID: <20180315163509.17740-1-shameerali.kolothum.thodi@huawei.com> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	vfio/type1: Add support for valid iova list management \| expand [v5,0/7] vfio/type1: Add support for valid iova list management [v5,1/7] vfio/type1: Introduce iova list and add iommu aperture validity check [v5,2/7] vfio/type1: Check reserve region conflict and update iova list [v5,3/7] vfio/type1: Update iova list on detach [v5,4/7] vfio/type1: check dma map request is within a valid iova range [v5,5/7] vfio/type1: Add IOVA range capability support [v5,6/7] vfio/type1: remove duplicate retrieval of reserved regions [v5,7/7] iommu/dma: Move PCI window region reservation back into dma specific path.

Shameerali Kolothum Thodi March 15, 2018, 4:35 p.m. UTC

This series introduces an iova list associated with a vfio 
iommu. The list is kept updated taking care of iommu apertures,
and reserved regions. Also this series adds checks for any conflict
with existing dma mappings whenever a new device group is attached to
the domain.

User-space can retrieve valid iova ranges using VFIO_IOMMU_GET_INFO
ioctl capability chains. Any dma map request outside the valid iova
range will be rejected.


v4 --> v5
Rebased to next-20180315.
 
 -Incorporated the corner case bug fix suggested by Alex to patch #5.
 -Based on suggestions by Alex and Robin, added patch#7. This
  moves the PCI window  reservation back in to DMA specific path.
  This is to fix the issue reported by Eric[1].

Note:
The patch #7 has dependency with [2][3]

1. https://patchwork.kernel.org/patch/10232043/
2. https://patchwork.kernel.org/patch/10216553/
3. https://patchwork.kernel.org/patch/10216555/

v3 --> v4
 Addressed comments received for v3.
 -dma_addr_t instead of phys_addr_t
 -LIST_HEAD() usage.
 -Free up iova_copy list in case of error.
 -updated logic in filling the iova caps info(patch #5)

RFCv2 --> v3
 Removed RFC tag.
 Addressed comments from Alex and Eric:
 - Added comments to make iova list management logic more clear.
 - Use of iova list copy so that original is not altered in
   case of failure.

RFCv1 --> RFCv2
 Addressed comments from Alex:
-Introduced IOVA list management and added checks for conflicts with 
 existing dma map entries during attach/detach.

Shameer Kolothum (2):
  vfio/type1: Add IOVA range capability support
  iommu/dma: Move PCI window region reservation back into dma specific
    path.

Shameerali Kolothum Thodi (5):
  vfio/type1: Introduce iova list and add iommu aperture validity check
  vfio/type1: Check reserve region conflict and update iova list
  vfio/type1: Update iova list on detach
  vfio/type1: check dma map request is within a valid iova range
  vfio/type1: remove duplicate retrieval of reserved regions

 drivers/iommu/dma-iommu.c       |  54 ++---
 drivers/vfio/vfio_iommu_type1.c | 497 +++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h       |  23 ++
 3 files changed, 533 insertions(+), 41 deletions(-)

-- 
2.7.4

Tian, Kevin March 19, 2018, 8:28 a.m. UTC | #1

> From: Shameer Kolothum

> Sent: Friday, March 16, 2018 12:35 AM

> 

> This series introduces an iova list associated with a vfio

> iommu. The list is kept updated taking care of iommu apertures,

> and reserved regions. Also this series adds checks for any conflict

> with existing dma mappings whenever a new device group is attached to

> the domain.

> 

> User-space can retrieve valid iova ranges using VFIO_IOMMU_GET_INFO

> ioctl capability chains. Any dma map request outside the valid iova

> range will be rejected.


GET_INFO is done at initialization time which is good for cold attached
devices. If a hotplugged device may cause change of valid iova ranges
at run-time, then there could be potential problem (which however is
difficult for user space or orchestration stack to figure out in advance)
Can we do some extension like below to make hotplug case cleaner?

- An interface allowing user space to request VFIO rejecting further 
attach_group if doing so may cause iova range change. e.g. Qemu can 
do such request once completing initial GET_INFO;

- or an event notification to user space upon change of valid iova
ranges when attaching a new device at run-time. It goes one step 
further - even attach may cause iova range change, it may still
succeed as long as Qemu hasn't allocated any iova in impacted 
range

Thanks
Kevin

> 

> 

> v4 --> v5

> Rebased to next-20180315.

> 

>  -Incorporated the corner case bug fix suggested by Alex to patch #5.

>  -Based on suggestions by Alex and Robin, added patch#7. This

>   moves the PCI window  reservation back in to DMA specific path.

>   This is to fix the issue reported by Eric[1].

> 

> Note:

> The patch #7 has dependency with [2][3]

> 

> 1. https://patchwork.kernel.org/patch/10232043/

> 2. https://patchwork.kernel.org/patch/10216553/

> 3. https://patchwork.kernel.org/patch/10216555/

> 

> v3 --> v4

>  Addressed comments received for v3.

>  -dma_addr_t instead of phys_addr_t

>  -LIST_HEAD() usage.

>  -Free up iova_copy list in case of error.

>  -updated logic in filling the iova caps info(patch #5)

> 

> RFCv2 --> v3

>  Removed RFC tag.

>  Addressed comments from Alex and Eric:

>  - Added comments to make iova list management logic more clear.

>  - Use of iova list copy so that original is not altered in

>    case of failure.

> 

> RFCv1 --> RFCv2

>  Addressed comments from Alex:

> -Introduced IOVA list management and added checks for conflicts with

>  existing dma map entries during attach/detach.

> 

> Shameer Kolothum (2):

>   vfio/type1: Add IOVA range capability support

>   iommu/dma: Move PCI window region reservation back into dma specific

>     path.

> 

> Shameerali Kolothum Thodi (5):

>   vfio/type1: Introduce iova list and add iommu aperture validity check

>   vfio/type1: Check reserve region conflict and update iova list

>   vfio/type1: Update iova list on detach

>   vfio/type1: check dma map request is within a valid iova range

>   vfio/type1: remove duplicate retrieval of reserved regions

> 

>  drivers/iommu/dma-iommu.c       |  54 ++---

>  drivers/vfio/vfio_iommu_type1.c | 497

> +++++++++++++++++++++++++++++++++++++++-

>  include/uapi/linux/vfio.h       |  23 ++

>  3 files changed, 533 insertions(+), 41 deletions(-)

> 

> --

> 2.7.4

> 

> 

> _______________________________________________

> iommu mailing list

> iommu@lists.linux-foundation.org

> https://lists.linuxfoundation.org/mailman/listinfo/iommu

Shameerali Kolothum Thodi March 19, 2018, 10:54 a.m. UTC | #2

Hi Kevin,

Thanks for taking a look at this series.

> -----Original Message-----

> From: Tian, Kevin [mailto:kevin.tian@intel.com]

> Sent: Monday, March 19, 2018 8:29 AM

> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;

> alex.williamson@redhat.com; eric.auger@redhat.com;

> pmorel@linux.vnet.ibm.com

> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org; xuwei (O)

> <xuwei5@huawei.com>; Linuxarm <linuxarm@huawei.com>;

> iommu@lists.linux-foundation.org

> Subject: RE: [PATCH v5 0/7] vfio/type1: Add support for valid iova list

> management

> 

> > From: Shameer Kolothum

> > Sent: Friday, March 16, 2018 12:35 AM

> >

> > This series introduces an iova list associated with a vfio

> > iommu. The list is kept updated taking care of iommu apertures,

> > and reserved regions. Also this series adds checks for any conflict

> > with existing dma mappings whenever a new device group is attached to

> > the domain.

> >

> > User-space can retrieve valid iova ranges using VFIO_IOMMU_GET_INFO

> > ioctl capability chains. Any dma map request outside the valid iova

> > range will be rejected.

> 

> GET_INFO is done at initialization time which is good for cold attached

> devices. If a hotplugged device may cause change of valid iova ranges

> at run-time, then there could be potential problem (which however is

> difficult for user space or orchestration stack to figure out in advance)

> Can we do some extension like below to make hotplug case cleaner?


As I understand, in case a hotplugged device results in an update to the valid
Iova ranges then the Qemu, vfio_connect_container() --> memory_listner_register()
will fail if there is a conflict as patch #4 checks for invalid dma map requests. 

Not sure, your concern is preventing hotplug much before this happens or not.

Thanks,
Shameer

> - An interface allowing user space to request VFIO rejecting further

> attach_group if doing so may cause iova range change. e.g. Qemu can

> do such request once completing initial GET_INFO;

> 

> - or an event notification to user space upon change of valid iova

> ranges when attaching a new device at run-time. It goes one step

> further - even attach may cause iova range change, it may still

> succeed as long as Qemu hasn't allocated any iova in impacted

> range

> 

> Thanks

> Kevin

> 

> >

> >

> > v4 --> v5

> > Rebased to next-20180315.

> >

> >  -Incorporated the corner case bug fix suggested by Alex to patch #5.

> >  -Based on suggestions by Alex and Robin, added patch#7. This

> >   moves the PCI window  reservation back in to DMA specific path.

> >   This is to fix the issue reported by Eric[1].

> >

> > Note:

> > The patch #7 has dependency with [2][3]

> >

> > 1. https://patchwork.kernel.org/patch/10232043/

> > 2. https://patchwork.kernel.org/patch/10216553/

> > 3. https://patchwork.kernel.org/patch/10216555/

> >

> > v3 --> v4

> >  Addressed comments received for v3.

> >  -dma_addr_t instead of phys_addr_t

> >  -LIST_HEAD() usage.

> >  -Free up iova_copy list in case of error.

> >  -updated logic in filling the iova caps info(patch #5)

> >

> > RFCv2 --> v3

> >  Removed RFC tag.

> >  Addressed comments from Alex and Eric:

> >  - Added comments to make iova list management logic more clear.

> >  - Use of iova list copy so that original is not altered in

> >    case of failure.

> >

> > RFCv1 --> RFCv2

> >  Addressed comments from Alex:

> > -Introduced IOVA list management and added checks for conflicts with

> >  existing dma map entries during attach/detach.

> >

> > Shameer Kolothum (2):

> >   vfio/type1: Add IOVA range capability support

> >   iommu/dma: Move PCI window region reservation back into dma specific

> >     path.

> >

> > Shameerali Kolothum Thodi (5):

> >   vfio/type1: Introduce iova list and add iommu aperture validity check

> >   vfio/type1: Check reserve region conflict and update iova list

> >   vfio/type1: Update iova list on detach

> >   vfio/type1: check dma map request is within a valid iova range

> >   vfio/type1: remove duplicate retrieval of reserved regions

> >

> >  drivers/iommu/dma-iommu.c       |  54 ++---

> >  drivers/vfio/vfio_iommu_type1.c | 497

> > +++++++++++++++++++++++++++++++++++++++-

> >  include/uapi/linux/vfio.h       |  23 ++

> >  3 files changed, 533 insertions(+), 41 deletions(-)

> >

> > --

> > 2.7.4

> >

> >

> > _______________________________________________

> > iommu mailing list

> > iommu@lists.linux-foundation.org

> > https://lists.linuxfoundation.org/mailman/listinfo/iommu

Tian, Kevin March 19, 2018, 12:12 p.m. UTC | #3

> From: Shameerali Kolothum Thodi

> [mailto:shameerali.kolothum.thodi@huawei.com]

> 

> Hi Kevin,

> 

> Thanks for taking a look at this series.

> 

> > -----Original Message-----

> > From: Tian, Kevin [mailto:kevin.tian@intel.com]

> > Sent: Monday, March 19, 2018 8:29 AM

> > To: Shameerali Kolothum Thodi

> <shameerali.kolothum.thodi@huawei.com>;

> > alex.williamson@redhat.com; eric.auger@redhat.com;

> > pmorel@linux.vnet.ibm.com

> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org; xuwei (O)

> > <xuwei5@huawei.com>; Linuxarm <linuxarm@huawei.com>;

> > iommu@lists.linux-foundation.org

> > Subject: RE: [PATCH v5 0/7] vfio/type1: Add support for valid iova list

> > management

> >

> > > From: Shameer Kolothum

> > > Sent: Friday, March 16, 2018 12:35 AM

> > >

> > > This series introduces an iova list associated with a vfio

> > > iommu. The list is kept updated taking care of iommu apertures,

> > > and reserved regions. Also this series adds checks for any conflict

> > > with existing dma mappings whenever a new device group is attached

> to

> > > the domain.

> > >

> > > User-space can retrieve valid iova ranges using

> VFIO_IOMMU_GET_INFO

> > > ioctl capability chains. Any dma map request outside the valid iova

> > > range will be rejected.

> >

> > GET_INFO is done at initialization time which is good for cold attached

> > devices. If a hotplugged device may cause change of valid iova ranges

> > at run-time, then there could be potential problem (which however is

> > difficult for user space or orchestration stack to figure out in advance)

> > Can we do some extension like below to make hotplug case cleaner?

> 

> As I understand, in case a hotplugged device results in an update to the

> valid

> Iova ranges then the Qemu, vfio_connect_container() -->

> memory_listner_register()

> will fail if there is a conflict as patch #4 checks for invalid dma map requests.


OK, possibly Qemu can do another GET_INFO upon any dma map 
error to get latest ranges and then allocate a new valid iova to
redo the map. this should work if valid ranges shrink due to new
hotplugged device. But if hot-removing a device may result in
more valid ranges, so far there is no way for Qemu to pick up.
I'm not sure whether we want to go that far though...

> 

> Not sure, your concern is preventing hotplug much before this happens or

> not.

> 


yes, my earlier thought is more to catch the problem in attach phase.

Thanks
Kevin

Alex Williamson March 20, 2018, 10:55 p.m. UTC | #4

On Mon, 19 Mar 2018 08:28:32 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Shameer Kolothum

> > Sent: Friday, March 16, 2018 12:35 AM

> > 

> > This series introduces an iova list associated with a vfio

> > iommu. The list is kept updated taking care of iommu apertures,

> > and reserved regions. Also this series adds checks for any conflict

> > with existing dma mappings whenever a new device group is attached to

> > the domain.

> > 

> > User-space can retrieve valid iova ranges using VFIO_IOMMU_GET_INFO

> > ioctl capability chains. Any dma map request outside the valid iova

> > range will be rejected.  

> 

> GET_INFO is done at initialization time which is good for cold attached

> devices. If a hotplugged device may cause change of valid iova ranges

> at run-time, then there could be potential problem (which however is

> difficult for user space or orchestration stack to figure out in advance)

> Can we do some extension like below to make hotplug case cleaner?

Let's be clear what we mean by hotplug here, as I see it, the only
relevant hotplug would be a physical device, hot added to the host,
which becomes a member of an existing, in-use IOMMU group.  If, on the
other hand, we're talking about hotplug related to the user process,
there's nothing asynchronous about that.  For instance in the QEMU
case, QEMU must add the group to the container, at which point it can
evaluate the new iova list and remove the group from the container if
it doesn't like the result.  So what would be a case of the available
iova list for a group changing as a result of adding a device?

> - An interface allowing user space to request VFIO rejecting further 

> attach_group if doing so may cause iova range change. e.g. Qemu can 

> do such request once completing initial GET_INFO;

For the latter case above, it seems unnecessary, QEMU can revert the
attach, we're only promising that the attach won't interfere with
existing mappings.  For the host hotplug case, the user has no control,
the new device is a member of the iommu group and therefore necessarily
becomes a part of container.  I imagine there are plenty of other holes
in this scenario already.

> - or an event notification to user space upon change of valid iova

> ranges when attaching a new device at run-time. It goes one step 

> further - even attach may cause iova range change, it may still

> succeed as long as Qemu hasn't allocated any iova in impacted 

> range

Same as above, in the QEMU hotplug case, the user is orchestrating the
adding of the group to the container, they can then check the iommu
info on their own and determine what, if any, changes are relevant to
them, knowing that the addition would not succeed if any current
mappings where affected.  In the host case, a notification would be
required, but we'd first need to identify exactly how the iova list can
change asynchronous to the user doing anything.  Thanks,

Alex

Tian, Kevin March 21, 2018, 3:28 a.m. UTC | #5

> From: Alex Williamson [mailto:alex.williamson@redhat.com]

> Sent: Wednesday, March 21, 2018 6:55 AM

> 

> On Mon, 19 Mar 2018 08:28:32 +0000

> "Tian, Kevin" <kevin.tian@intel.com> wrote:

> 

> > > From: Shameer Kolothum

> > > Sent: Friday, March 16, 2018 12:35 AM

> > >

> > > This series introduces an iova list associated with a vfio

> > > iommu. The list is kept updated taking care of iommu apertures,

> > > and reserved regions. Also this series adds checks for any conflict

> > > with existing dma mappings whenever a new device group is attached

> to

> > > the domain.

> > >

> > > User-space can retrieve valid iova ranges using

> VFIO_IOMMU_GET_INFO

> > > ioctl capability chains. Any dma map request outside the valid iova

> > > range will be rejected.

> >

> > GET_INFO is done at initialization time which is good for cold attached

> > devices. If a hotplugged device may cause change of valid iova ranges

> > at run-time, then there could be potential problem (which however is

> > difficult for user space or orchestration stack to figure out in advance)

> > Can we do some extension like below to make hotplug case cleaner?

> 

> Let's be clear what we mean by hotplug here, as I see it, the only

> relevant hotplug would be a physical device, hot added to the host,

> which becomes a member of an existing, in-use IOMMU group.  If, on the

> other hand, we're talking about hotplug related to the user process,

> there's nothing asynchronous about that.  For instance in the QEMU

> case, QEMU must add the group to the container, at which point it can

> evaluate the new iova list and remove the group from the container if

> it doesn't like the result.  So what would be a case of the available

> iova list for a group changing as a result of adding a device?


My original thought was about the latter case. At that moment 
I was not sure whether the window between adding/removing 
the group may cause some issue if there are right some IOVA 
allocations happening in parallel. But looks Qemu can anyway
handle it well as long as such scenario is considered.

> 

> > - An interface allowing user space to request VFIO rejecting further

> > attach_group if doing so may cause iova range change. e.g. Qemu can

> > do such request once completing initial GET_INFO;

> 

> For the latter case above, it seems unnecessary, QEMU can revert the

> attach, we're only promising that the attach won't interfere with

> existing mappings.  For the host hotplug case, the user has no control,

> the new device is a member of the iommu group and therefore necessarily

> becomes a part of container.  I imagine there are plenty of other holes

> in this scenario already.

> 

> > - or an event notification to user space upon change of valid iova

> > ranges when attaching a new device at run-time. It goes one step

> > further - even attach may cause iova range change, it may still

> > succeed as long as Qemu hasn't allocated any iova in impacted

> > range

> 

> Same as above, in the QEMU hotplug case, the user is orchestrating the

> adding of the group to the container, they can then check the iommu

> info on their own and determine what, if any, changes are relevant to

> them, knowing that the addition would not succeed if any current

> mappings where affected.  In the host case, a notification would be

> required, but we'd first need to identify exactly how the iova list can

> change asynchronous to the user doing anything.  Thanks,


for host hotplug, possibly notification could be an opt-in model.
meaning if user space doesn't explicitly request receiving notification 
on such event, the device is just left in unused state (vfio-pci still
claims the device, assuming it assigned to the container owner, but
the owner doesn't use it)

Thanks
Kevin

Alex Williamson March 21, 2018, 5:11 p.m. UTC | #6

On Wed, 21 Mar 2018 03:28:16 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]

> > Sent: Wednesday, March 21, 2018 6:55 AM

> > 

> > On Mon, 19 Mar 2018 08:28:32 +0000

> > "Tian, Kevin" <kevin.tian@intel.com> wrote:

> >   

> > > > From: Shameer Kolothum

> > > > Sent: Friday, March 16, 2018 12:35 AM

> > > >

> > > > This series introduces an iova list associated with a vfio

> > > > iommu. The list is kept updated taking care of iommu apertures,

> > > > and reserved regions. Also this series adds checks for any conflict

> > > > with existing dma mappings whenever a new device group is attached  

> > to  

> > > > the domain.

> > > >

> > > > User-space can retrieve valid iova ranges using  

> > VFIO_IOMMU_GET_INFO  

> > > > ioctl capability chains. Any dma map request outside the valid iova

> > > > range will be rejected.  

> > >

> > > GET_INFO is done at initialization time which is good for cold attached

> > > devices. If a hotplugged device may cause change of valid iova ranges

> > > at run-time, then there could be potential problem (which however is

> > > difficult for user space or orchestration stack to figure out in advance)

> > > Can we do some extension like below to make hotplug case cleaner?  

> > 

> > Let's be clear what we mean by hotplug here, as I see it, the only

> > relevant hotplug would be a physical device, hot added to the host,

> > which becomes a member of an existing, in-use IOMMU group.  If, on the

> > other hand, we're talking about hotplug related to the user process,

> > there's nothing asynchronous about that.  For instance in the QEMU

> > case, QEMU must add the group to the container, at which point it can

> > evaluate the new iova list and remove the group from the container if

> > it doesn't like the result.  So what would be a case of the available

> > iova list for a group changing as a result of adding a device?  

> 

> My original thought was about the latter case. At that moment 

> I was not sure whether the window between adding/removing 

> the group may cause some issue if there are right some IOVA 

> allocations happening in parallel. But looks Qemu can anyway

> handle it well as long as such scenario is considered.

I believe the kernel patches here and the existing code are using
locking to prevent races between mapping changes and device changes, so
the acceptance of a new group into a container and the iova list for a
container should always be correct for the order these operations
arrive.  Beyond that, I don't believe it's the kernel's responsibility
to do anything more than block groups from being added if they conflict
with current mappings.  The user already has the minimum interfaces
they need to manage other scenarios.

> > > - An interface allowing user space to request VFIO rejecting further

> > > attach_group if doing so may cause iova range change. e.g. Qemu can

> > > do such request once completing initial GET_INFO;  

> > 

> > For the latter case above, it seems unnecessary, QEMU can revert the

> > attach, we're only promising that the attach won't interfere with

> > existing mappings.  For the host hotplug case, the user has no control,

> > the new device is a member of the iommu group and therefore necessarily

> > becomes a part of container.  I imagine there are plenty of other holes

> > in this scenario already.

> >   

> > > - or an event notification to user space upon change of valid iova

> > > ranges when attaching a new device at run-time. It goes one step

> > > further - even attach may cause iova range change, it may still

> > > succeed as long as Qemu hasn't allocated any iova in impacted

> > > range  

> > 

> > Same as above, in the QEMU hotplug case, the user is orchestrating the

> > adding of the group to the container, they can then check the iommu

> > info on their own and determine what, if any, changes are relevant to

> > them, knowing that the addition would not succeed if any current

> > mappings where affected.  In the host case, a notification would be

> > required, but we'd first need to identify exactly how the iova list can

> > change asynchronous to the user doing anything.  Thanks,  

> 

> for host hotplug, possibly notification could be an opt-in model.

> meaning if user space doesn't explicitly request receiving notification 

> on such event, the device is just left in unused state (vfio-pci still

> claims the device, assuming it assigned to the container owner, but

> the owner doesn't use it)

Currently, if a device is added to a live group, the kernel will print
a warning.  We have a todo to bind that device to a vfio-bus driver,
but I'm not sure that isn't an overreach into the system policy
decisions.  If system policy decides to bind to a vfio bus driver, all
is well, but we might be missing the iommu backend adding the device to
the iommu domain, so it probably won't work unless the requester ID is
actually aliased to the IOMMU (such as a conventional PCI device), a
standard PCIe device simply wouldn't be part of the IOMMU domain and
would generate IOMMU faults if the user attempts to use it (AFAICT).
OTOH, if system policy binds the device to a native host driver, then
the integrity of the group for userspace use is compromised, which is
terminated with prejudice via a BUG.  Obviously the user is never
obligated to listen to signals and we don't provide a signal here as
this scenario is mostly theoretical, though it would be relatively easy
with software hotplug to artificially induce such a condition.

However, I'm still not sure how adding a device to a group is
necessarily relevant to this series, particularly how it would change
the iova list.  When we add groups to a container, we're potentially
crossing boundaries between IOMMUs and may therefore pickup new
reserved resource requirements, but devices within a group seem like
reserved regions should already be accounted for in the existing group.
At least so long as we've decided reserved regions are only the IOMMU
imposed reserved regions and not routing within the group, which we've
hand waved as the user's problem already.  Thanks,

Alex

Tian, Kevin March 22, 2018, 9:10 a.m. UTC | #7

> From: Alex Williamson

> Sent: Thursday, March 22, 2018 1:11 AM

> 

> On Wed, 21 Mar 2018 03:28:16 +0000

> "Tian, Kevin" <kevin.tian@intel.com> wrote:

> 

> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]

> > > Sent: Wednesday, March 21, 2018 6:55 AM

> > >

> > > On Mon, 19 Mar 2018 08:28:32 +0000

> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:

> > >

> > > > > From: Shameer Kolothum

> > > > > Sent: Friday, March 16, 2018 12:35 AM

> > > > >

> > > > > This series introduces an iova list associated with a vfio

> > > > > iommu. The list is kept updated taking care of iommu apertures,

> > > > > and reserved regions. Also this series adds checks for any conflict

> > > > > with existing dma mappings whenever a new device group is

> attached

> > > to

> > > > > the domain.

> > > > >

> > > > > User-space can retrieve valid iova ranges using

> > > VFIO_IOMMU_GET_INFO

> > > > > ioctl capability chains. Any dma map request outside the valid iova

> > > > > range will be rejected.

> > > >

> > > > GET_INFO is done at initialization time which is good for cold attached

> > > > devices. If a hotplugged device may cause change of valid iova ranges

> > > > at run-time, then there could be potential problem (which however is

> > > > difficult for user space or orchestration stack to figure out in advance)

> > > > Can we do some extension like below to make hotplug case cleaner?

> > >

> > > Let's be clear what we mean by hotplug here, as I see it, the only

> > > relevant hotplug would be a physical device, hot added to the host,

> > > which becomes a member of an existing, in-use IOMMU group.  If, on

> the

> > > other hand, we're talking about hotplug related to the user process,

> > > there's nothing asynchronous about that.  For instance in the QEMU

> > > case, QEMU must add the group to the container, at which point it can

> > > evaluate the new iova list and remove the group from the container if

> > > it doesn't like the result.  So what would be a case of the available

> > > iova list for a group changing as a result of adding a device?

> >

> > My original thought was about the latter case. At that moment

> > I was not sure whether the window between adding/removing

> > the group may cause some issue if there are right some IOVA

> > allocations happening in parallel. But looks Qemu can anyway

> > handle it well as long as such scenario is considered.

> 

> I believe the kernel patches here and the existing code are using

> locking to prevent races between mapping changes and device changes, so

> the acceptance of a new group into a container and the iova list for a

> container should always be correct for the order these operations

> arrive.  Beyond that, I don't believe it's the kernel's responsibility

> to do anything more than block groups from being added if they conflict

> with current mappings.  The user already has the minimum interfaces

> they need to manage other scenarios.


Agree

> 

> > > > - An interface allowing user space to request VFIO rejecting further

> > > > attach_group if doing so may cause iova range change. e.g. Qemu can

> > > > do such request once completing initial GET_INFO;

> > >

> > > For the latter case above, it seems unnecessary, QEMU can revert the

> > > attach, we're only promising that the attach won't interfere with

> > > existing mappings.  For the host hotplug case, the user has no control,

> > > the new device is a member of the iommu group and therefore

> necessarily

> > > becomes a part of container.  I imagine there are plenty of other holes

> > > in this scenario already.

> > >

> > > > - or an event notification to user space upon change of valid iova

> > > > ranges when attaching a new device at run-time. It goes one step

> > > > further - even attach may cause iova range change, it may still

> > > > succeed as long as Qemu hasn't allocated any iova in impacted

> > > > range

> > >

> > > Same as above, in the QEMU hotplug case, the user is orchestrating the

> > > adding of the group to the container, they can then check the iommu

> > > info on their own and determine what, if any, changes are relevant to

> > > them, knowing that the addition would not succeed if any current

> > > mappings where affected.  In the host case, a notification would be

> > > required, but we'd first need to identify exactly how the iova list can

> > > change asynchronous to the user doing anything.  Thanks,

> >

> > for host hotplug, possibly notification could be an opt-in model.

> > meaning if user space doesn't explicitly request receiving notification

> > on such event, the device is just left in unused state (vfio-pci still

> > claims the device, assuming it assigned to the container owner, but

> > the owner doesn't use it)

> 

> Currently, if a device is added to a live group, the kernel will print

> a warning.  We have a todo to bind that device to a vfio-bus driver,

> but I'm not sure that isn't an overreach into the system policy

> decisions.  If system policy decides to bind to a vfio bus driver, all

> is well, but we might be missing the iommu backend adding the device to

> the iommu domain, so it probably won't work unless the requester ID is

> actually aliased to the IOMMU (such as a conventional PCI device), a

> standard PCIe device simply wouldn't be part of the IOMMU domain and

> would generate IOMMU faults if the user attempts to use it (AFAICT).

> OTOH, if system policy binds the device to a native host driver, then

> the integrity of the group for userspace use is compromised, which is

> terminated with prejudice via a BUG.  Obviously the user is never

> obligated to listen to signals and we don't provide a signal here as

> this scenario is mostly theoretical, though it would be relatively easy

> with software hotplug to artificially induce such a condition.

> 

> However, I'm still not sure how adding a device to a group is

> necessarily relevant to this series, particularly how it would change

> the iova list.  When we add groups to a container, we're potentially

> crossing boundaries between IOMMUs and may therefore pickup new

> reserved resource requirements, but devices within a group seem like

> reserved regions should already be accounted for in the existing group.

> At least so long as we've decided reserved regions are only the IOMMU

> imposed reserved regions and not routing within the group, which we've

> hand waved as the user's problem already.  Thanks,

> 


oh it's not relevant to this series now. As I said my earlier concern
is more on guest hotplug side which has been closed. Sorry for 
distracting the thread when further replying to host hotplug which
should be in a separate thread if necessary (so I'll stop further comment
here until there is a real need for that part. :-)

Thanks
Kevin

[v5,0/7] vfio/type1: Add support for valid iova list management

Message

Comments