mbox series

[v2,0/6] IOMMUFD: Deliver IO page faults to user space

Message ID 20231026024930.382898-1-baolu.lu@linux.intel.com
Headers show
Series IOMMUFD: Deliver IO page faults to user space | expand

Message

Baolu Lu Oct. 26, 2023, 2:49 a.m. UTC
Hi folks,

This series implements the functionality of delivering IO page faults to
user space through the IOMMUFD framework for nested translation. Nested
translation is a hardware feature that supports two-stage translation
tables for IOMMU. The second-stage translation table is managed by the
host VMM, while the first-stage translation table is owned by user
space. This allows user space to control the IOMMU mappings for its
devices.

When an IO page fault occurs on the first-stage translation table, the
IOMMU hardware can deliver the page fault to user space through the
IOMMUFD framework. User space can then handle the page fault and respond
to the device top-down through the IOMMUFD. This allows user space to
implement its own IO page fault handling policies.

User space indicates its capability of handling IO page faults by
setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
hardware page table (HWPT). IOMMUFD will then set up its infrastructure
for page fault delivery. On a successful return of HWPT allocation, the
user can retrieve and respond to page faults by reading and writing to
the file descriptor (FD) returned in out_fault_fd.

The iommu selftest framework has been updated to test the IO page fault
delivery and response functionality.

This series is based on the latest implementation of nested translation
under discussion [1] and the page fault handling framework refactoring in
the IOMMU core [2].

The series and related patches are available on GitHub: [3]

[1] https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/
[2] https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@linux.intel.com/
[3] https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-v2

Best regards,
baolu

Change log:
v2:
 - Move all iommu refactoring patches into a sparated series and discuss
   it in a different thread. The latest patch series [v6] is available at
   https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@linux.intel.com/
 - We discussed the timeout of the pending page fault messages. We
   agreed that we shouldn't apply any timeout policy for the page fault
   handling in user space.
   https://lore.kernel.org/linux-iommu/20230616113232.GA84678@myrica/
 - Jason suggested that we adopt a simple file descriptor interface for
   reading and responding to I/O page requests, so that user space
   applications can improve performance using io_uring.
   https://lore.kernel.org/linux-iommu/ZJWjD1ajeem6pK3I@ziepe.ca/

v1: https://lore.kernel.org/linux-iommu/20230530053724.232765-1-baolu.lu@linux.intel.com/

Lu Baolu (6):
  iommu: Add iommu page fault cookie helpers
  iommufd: Add iommu page fault uapi data
  iommufd: Initializing and releasing IO page fault data
  iommufd: Deliver fault messages to user space
  iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_IOPF test support
  iommufd/selftest: Add coverage for IOMMU_TEST_OP_TRIGGER_IOPF

 include/linux/iommu.h                         |   9 +
 drivers/iommu/iommu-priv.h                    |  15 +
 drivers/iommu/iommufd/iommufd_private.h       |  12 +
 drivers/iommu/iommufd/iommufd_test.h          |   8 +
 include/uapi/linux/iommufd.h                  |  65 +++++
 tools/testing/selftests/iommu/iommufd_utils.h |  66 ++++-
 drivers/iommu/io-pgfault.c                    |  50 ++++
 drivers/iommu/iommufd/device.c                |  69 ++++-
 drivers/iommu/iommufd/hw_pagetable.c          | 260 +++++++++++++++++-
 drivers/iommu/iommufd/selftest.c              |  56 ++++
 tools/testing/selftests/iommu/iommufd.c       |  24 +-
 .../selftests/iommu/iommufd_fail_nth.c        |   2 +-
 12 files changed, 620 insertions(+), 16 deletions(-)

Comments

Jason Gunthorpe Nov. 2, 2023, 12:47 p.m. UTC | #1
On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> Hi folks,
> 
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework for nested translation. Nested
> translation is a hardware feature that supports two-stage translation
> tables for IOMMU. The second-stage translation table is managed by the
> host VMM, while the first-stage translation table is owned by user
> space. This allows user space to control the IOMMU mappings for its
> devices.

Having now looked more closely at the ARM requirements it seems we
will need generic events, not just page fault events to have a
complete emulation.

So I'd like to see this generalized into a channel to carry any
events..

> User space indicates its capability of handling IO page faults by
> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> for page fault delivery. On a successful return of HWPT allocation, the
> user can retrieve and respond to page faults by reading and writing to
> the file descriptor (FD) returned in out_fault_fd.

This is the right way to approach it, and more broadly this shouldn't
be an iommufd specific thing. Kernel drivers will also need to create
fault capable PAGING iommu domains.

Jason
Tian, Kevin Nov. 7, 2023, 8:35 a.m. UTC | #2
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Thursday, November 2, 2023 8:48 PM
>
> On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> > Hi folks,
> >
> > This series implements the functionality of delivering IO page faults to
> > user space through the IOMMUFD framework for nested translation.
> Nested
> > translation is a hardware feature that supports two-stage translation
> > tables for IOMMU. The second-stage translation table is managed by the
> > host VMM, while the first-stage translation table is owned by user
> > space. This allows user space to control the IOMMU mappings for its
> > devices.
> 
> Having now looked more closely at the ARM requirements it seems we
> will need generic events, not just page fault events to have a
> complete emulation.

Can you elaborate?

> 
> So I'd like to see this generalized into a channel to carry any
> events..
> 
> > User space indicates its capability of handling IO page faults by
> > setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> > hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> > for page fault delivery. On a successful return of HWPT allocation, the
> > user can retrieve and respond to page faults by reading and writing to
> > the file descriptor (FD) returned in out_fault_fd.
> 
> This is the right way to approach it, and more broadly this shouldn't
> be an iommufd specific thing. Kernel drivers will also need to create
> fault capable PAGING iommu domains.
> 

Are you suggesting a common interface used by both iommufd and
kernel drivers?

but I didn't get the last piece. If those domains are created by kernel
drivers why would they require a uAPI for userspace to specify fault
capable?
Jason Gunthorpe Nov. 7, 2023, 5:54 p.m. UTC | #3
On Tue, Nov 07, 2023 at 08:35:10AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@ziepe.ca>
> > Sent: Thursday, November 2, 2023 8:48 PM
> >
> > On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> > > Hi folks,
> > >
> > > This series implements the functionality of delivering IO page faults to
> > > user space through the IOMMUFD framework for nested translation.
> > Nested
> > > translation is a hardware feature that supports two-stage translation
> > > tables for IOMMU. The second-stage translation table is managed by the
> > > host VMM, while the first-stage translation table is owned by user
> > > space. This allows user space to control the IOMMU mappings for its
> > > devices.
> > 
> > Having now looked more closely at the ARM requirements it seems we
> > will need generic events, not just page fault events to have a
> > complete emulation.
> 
> Can you elaborate?

There are many events related to object in guest memory or controlled
by the guest, eg C_BAD_CD and C_BAD_STE. These should be relayed or
the emulation is not working well.

> > > User space indicates its capability of handling IO page faults by
> > > setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> > > hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> > > for page fault delivery. On a successful return of HWPT allocation, the
> > > user can retrieve and respond to page faults by reading and writing to
> > > the file descriptor (FD) returned in out_fault_fd.
> > 
> > This is the right way to approach it, and more broadly this shouldn't
> > be an iommufd specific thing. Kernel drivers will also need to create
> > fault capable PAGING iommu domains.
> 
> Are you suggesting a common interface used by both iommufd and
> kernel drivers?

Yes
 
> but I didn't get the last piece. If those domains are created by kernel
> drivers why would they require a uAPI for userspace to specify fault
> capable?

Not to userspace, but a kapi to request a fault capable domain and to
supply the fault handler. Eg:

 iommu_domain_alloc_faultable(dev, handler);

Jason
Tian, Kevin Nov. 8, 2023, 8:53 a.m. UTC | #4
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Wednesday, November 8, 2023 1:54 AM
> 
> On Tue, Nov 07, 2023 at 08:35:10AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > Sent: Thursday, November 2, 2023 8:48 PM
> > >
> > > On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> > > > Hi folks,
> > > >
> > > > This series implements the functionality of delivering IO page faults to
> > > > user space through the IOMMUFD framework for nested translation.
> > > Nested
> > > > translation is a hardware feature that supports two-stage translation
> > > > tables for IOMMU. The second-stage translation table is managed by
> the
> > > > host VMM, while the first-stage translation table is owned by user
> > > > space. This allows user space to control the IOMMU mappings for its
> > > > devices.
> > >
> > > Having now looked more closely at the ARM requirements it seems we
> > > will need generic events, not just page fault events to have a
> > > complete emulation.
> >
> > Can you elaborate?
> 
> There are many events related to object in guest memory or controlled
> by the guest, eg C_BAD_CD and C_BAD_STE. These should be relayed or
> the emulation is not working well.

so that's the category of unrecoverable faults?

btw I can understand C_BAD_CD given it's walked by the physical SMMU
in nested configuration. But presumably STE is created by the smmu
driver itself then why would there be an error to be relayed for guest STE?

> 
> > > > User space indicates its capability of handling IO page faults by
> > > > setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> > > > hardware page table (HWPT). IOMMUFD will then set up its
> infrastructure
> > > > for page fault delivery. On a successful return of HWPT allocation, the
> > > > user can retrieve and respond to page faults by reading and writing to
> > > > the file descriptor (FD) returned in out_fault_fd.
> > >
> > > This is the right way to approach it, and more broadly this shouldn't
> > > be an iommufd specific thing. Kernel drivers will also need to create
> > > fault capable PAGING iommu domains.
> >
> > Are you suggesting a common interface used by both iommufd and
> > kernel drivers?
> 
> Yes
> 
> > but I didn't get the last piece. If those domains are created by kernel
> > drivers why would they require a uAPI for userspace to specify fault
> > capable?
> 
> Not to userspace, but a kapi to request a fault capable domain and to
> supply the fault handler. Eg:
> 
>  iommu_domain_alloc_faultable(dev, handler);
> 

Does it affect SVA too?
Jason Gunthorpe Nov. 8, 2023, 5:39 p.m. UTC | #5
On Wed, Nov 08, 2023 at 08:53:00AM +0000, Tian, Kevin wrote:

> > There are many events related to object in guest memory or controlled
> > by the guest, eg C_BAD_CD and C_BAD_STE. These should be relayed or
> > the emulation is not working well.
> 
> so that's the category of unrecoverable faults?

I haven't looked exhaustively but I do have the impression that the
only recoverable fault is the 'page not present' one.

> btw I can understand C_BAD_CD given it's walked by the physical SMMU
> in nested configuration. But presumably STE is created by the smmu
> driver itself then why would there be an error to be relayed for
> guest STE?

If the guest programs a bad STE it should still generate a C_BAD_STE
even if the mediation SW could theoretically sanitize it (but sanitize
it to what? BLOCKED?). Since we have to forward things like C_BAD_CD
and others we may as well just drop an invalid STE and forward the
event like real HW.

> > > but I didn't get the last piece. If those domains are created by kernel
> > > drivers why would they require a uAPI for userspace to specify fault
> > > capable?
> > 
> > Not to userspace, but a kapi to request a fault capable domain and to
> > supply the fault handler. Eg:
> > 
> >  iommu_domain_alloc_faultable(dev, handler);
> 
> Does it affect SVA too?

Inside the driver the SVA should be constructed out of the same fault
handling infrastructure, but a SVA domain allocation should have a
different allocation function.

Jason
Jason Gunthorpe Nov. 21, 2023, 12:14 a.m. UTC | #6
On Thu, Nov 16, 2023 at 09:42:23AM +0800, Liu, Jing2 wrote:
> Hi Jason,
> 
> On 11/15/2023 9:58 PM, Jason Gunthorpe wrote:
> > On Wed, Nov 15, 2023 at 01:17:06PM +0800, Liu, Jing2 wrote:
> > 
> > > This is the right way to approach it,
> > > 
> > >     I learned that there was discussion about using io_uring to get the
> > >     page fault without
> > > 
> > >     eventfd notification in [1], and I am new at io_uring and studying the
> > >     man page of
> > > 
> > >     liburing, but there're questions in my mind on how can QEMU get the
> > >     coming page fault
> > > 
> > >     with a good performance.
> > > 
> > >     Since both QEMU and Kernel don't know when comes faults, after QEMU
> > >     submits one
> > > 
> > >     read task to io_uring, we want kernel pending until fault comes. While
> > >     based on
> > > 
> > >     hwpt_fault_fops_read() in [patch v2 4/6], it just returns 0 since
> > >     there's now no fault,
> > > 
> > >     thus this round of read completes to CQ but it's not what we want. So
> > >     I'm wondering
> > > 
> > >     how kernel pending on the read until fault comes. Does fops callback
> > >     need special work to
> > Implement a fops with poll support that triggers when a new event is
> > pushed and everything will be fine.
> 
> Does userspace need also setup a POLL flag to let io_uring go into poll, or
> io_uring always try to poll?

io_uring can trigger poll and use other approaches, it is flexible the
driver can scale this in different ways.

Jason
Shameerali Kolothum Thodi Nov. 29, 2023, 9:08 a.m. UTC | #7
> -----Original Message-----
> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
> Sent: 26 October 2023 03:49
> To: Jason Gunthorpe <jgg@ziepe.ca>; Kevin Tian <kevin.tian@intel.com>;
> Joerg Roedel <joro@8bytes.org>; Will Deacon <will@kernel.org>; Robin
> Murphy <robin.murphy@arm.com>; Jean-Philippe Brucker
> <jean-philippe@linaro.org>; Nicolin Chen <nicolinc@nvidia.com>; Yi Liu
> <yi.l.liu@intel.com>; Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: iommu@lists.linux.dev; linux-kselftest@vger.kernel.org;
> virtualization@lists.linux-foundation.org; linux-kernel@vger.kernel.org; Lu
> Baolu <baolu.lu@linux.intel.com>
> Subject: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space
> 
> Hi folks,
> 
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework for nested translation. Nested
> translation is a hardware feature that supports two-stage translation
> tables for IOMMU. The second-stage translation table is managed by the
> host VMM, while the first-stage translation table is owned by user
> space. This allows user space to control the IOMMU mappings for its
> devices.
> 
> When an IO page fault occurs on the first-stage translation table, the
> IOMMU hardware can deliver the page fault to user space through the
> IOMMUFD framework. User space can then handle the page fault and
> respond
> to the device top-down through the IOMMUFD. This allows user space to
> implement its own IO page fault handling policies.
> 
> User space indicates its capability of handling IO page faults by
> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> for page fault delivery. On a successful return of HWPT allocation, the
> user can retrieve and respond to page faults by reading and writing to
> the file descriptor (FD) returned in out_fault_fd.
> 
> The iommu selftest framework has been updated to test the IO page fault
> delivery and response functionality.
> 
> This series is based on the latest implementation of nested translation
> under discussion [1] and the page fault handling framework refactoring in
> the IOMMU core [2].
> 
> The series and related patches are available on GitHub: [3]
> 
> [1]
> https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@int
> el.com/
> [2]
> https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@li
> nux.intel.com/
> [3]
> https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-deliv
> ery-v2

Hi Baolu,

Do you have a corresponding Qemu git to share? I could give it a spin on our ARM 
platform. Please let me know.

Thanks,
Shameer
Baolu Lu Nov. 30, 2023, 3:44 a.m. UTC | #8
On 2023/11/29 17:08, Shameerali Kolothum Thodi wrote:
> 
> 
>> -----Original Message-----
>> From: Lu Baolu [mailto:baolu.lu@linux.intel.com]
>> Sent: 26 October 2023 03:49
>> To: Jason Gunthorpe <jgg@ziepe.ca>; Kevin Tian <kevin.tian@intel.com>;
>> Joerg Roedel <joro@8bytes.org>; Will Deacon <will@kernel.org>; Robin
>> Murphy <robin.murphy@arm.com>; Jean-Philippe Brucker
>> <jean-philippe@linaro.org>; Nicolin Chen <nicolinc@nvidia.com>; Yi Liu
>> <yi.l.liu@intel.com>; Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Cc: iommu@lists.linux.dev; linux-kselftest@vger.kernel.org;
>> virtualization@lists.linux-foundation.org; linux-kernel@vger.kernel.org; Lu
>> Baolu <baolu.lu@linux.intel.com>
>> Subject: [PATCH v2 0/6] IOMMUFD: Deliver IO page faults to user space
>>
>> Hi folks,
>>
>> This series implements the functionality of delivering IO page faults to
>> user space through the IOMMUFD framework for nested translation. Nested
>> translation is a hardware feature that supports two-stage translation
>> tables for IOMMU. The second-stage translation table is managed by the
>> host VMM, while the first-stage translation table is owned by user
>> space. This allows user space to control the IOMMU mappings for its
>> devices.
>>
>> When an IO page fault occurs on the first-stage translation table, the
>> IOMMU hardware can deliver the page fault to user space through the
>> IOMMUFD framework. User space can then handle the page fault and
>> respond
>> to the device top-down through the IOMMUFD. This allows user space to
>> implement its own IO page fault handling policies.
>>
>> User space indicates its capability of handling IO page faults by
>> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
>> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
>> for page fault delivery. On a successful return of HWPT allocation, the
>> user can retrieve and respond to page faults by reading and writing to
>> the file descriptor (FD) returned in out_fault_fd.
>>
>> The iommu selftest framework has been updated to test the IO page fault
>> delivery and response functionality.
>>
>> This series is based on the latest implementation of nested translation
>> under discussion [1] and the page fault handling framework refactoring in
>> the IOMMU core [2].
>>
>> The series and related patches are available on GitHub: [3]
>>
>> [1]
>> https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@int
>> el.com/
>> [2]
>> https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@li
>> nux.intel.com/
>> [3]
>> https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-deliv
>> ery-v2
> 
> Hi Baolu,

Hi Shameer,

> 
> Do you have a corresponding Qemu git to share? I could give it a spin on our ARM
> platform. Please let me know.

This version of the series is tested by the iommufd selftest. We are in
process of developing the QEMU code. I will provide the repo link after
we complete it.

Best regards,
baolu
Jason Gunthorpe Dec. 1, 2023, 2:24 p.m. UTC | #9
On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> Hi folks,
> 
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework for nested translation. Nested
> translation is a hardware feature that supports two-stage translation
> tables for IOMMU. The second-stage translation table is managed by the
> host VMM, while the first-stage translation table is owned by user
> space. This allows user space to control the IOMMU mappings for its
> devices.
> 
> When an IO page fault occurs on the first-stage translation table, the
> IOMMU hardware can deliver the page fault to user space through the
> IOMMUFD framework. User space can then handle the page fault and respond
> to the device top-down through the IOMMUFD. This allows user space to
> implement its own IO page fault handling policies.
> 
> User space indicates its capability of handling IO page faults by
> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> for page fault delivery. On a successful return of HWPT allocation, the
> user can retrieve and respond to page faults by reading and writing to
> the file descriptor (FD) returned in out_fault_fd.

This is probably backwards, userspace should allocate the FD with a
dedicated ioctl and provide it during domain allocation.

If the userspace wants a fd per domain then it should do that. If it
wants to share fds between domains that should work too.

Jason
Jason Gunthorpe Dec. 1, 2023, 2:38 p.m. UTC | #10
On Thu, Oct 26, 2023 at 10:49:25AM +0800, Lu Baolu wrote:

> +void *iopf_pasid_cookie_get(struct device *dev, ioasid_t pasid)
> +{
> +	struct iommu_fault_param *iopf_param = iopf_get_dev_fault_param(dev);
> +	void *curr;
> +
> +	if (!iopf_param)
> +		return ERR_PTR(-ENODEV);
> +
> +	xa_lock(&iopf_param->pasid_cookie);
> +	curr = xa_load(&iopf_param->pasid_cookie, pasid);
> +	xa_unlock(&iopf_param->pasid_cookie);

No need for this locking, the caller has to provide some kind of
locking to protect the returned pointer.

I'm not sure how this can work really..

What iommfd wants is to increment the device object refcount under
this xa_lock.

I'm not sure this is the right arrangement: Basically you want to
have a cookie per domain attachment for iopf domains that is forwarded
to the handler.

So maybe this entire thing is not quite right, instead of having a
generic iopf attached to the domain the iopf should be supplied at
domain attach time? Something like:

iommu_domain_attach_iopf(struct iommu_domain *, struct device *,
                         ioasid_t pasid, struct iopf *, void *cookie);

The per-attach cookie would be passed to the iopf function
automatically by the infrastructure.

Detach would have the necessary locking to ensure that no handler is
running across detach

Then the cookie is logically placed in the API and properly protected
with natural locking we already need.

Jason
Joel Granados Dec. 4, 2023, 3:07 p.m. UTC | #11
On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> Hi folks,
> 
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework for nested translation. Nested
Does this mean the IOPF_CAPABLE HWPT needs to be parented by a HWPT
created with IOMMU_HWPT_ALLOC_NEST_PARENT set?

> translation is a hardware feature that supports two-stage translation
> tables for IOMMU. The second-stage translation table is managed by the
> host VMM, while the first-stage translation table is owned by user
> space. This allows user space to control the IOMMU mappings for its
> devices.
> 
> When an IO page fault occurs on the first-stage translation table, the
> IOMMU hardware can deliver the page fault to user space through the
> IOMMUFD framework. User space can then handle the page fault and respond
> to the device top-down through the IOMMUFD. This allows user space to
> implement its own IO page fault handling policies.
> 
> User space indicates its capability of handling IO page faults by
> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> for page fault delivery. On a successful return of HWPT allocation, the
> user can retrieve and respond to page faults by reading and writing to
> the file descriptor (FD) returned in out_fault_fd.
> 
> The iommu selftest framework has been updated to test the IO page fault
> delivery and response functionality.
> 
> This series is based on the latest implementation of nested translation
> under discussion [1] and the page fault handling framework refactoring in
> the IOMMU core [2].
> 
> The series and related patches are available on GitHub: [3]
> 
> [1] https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/
> [2] https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@linux.intel.com/
> [3] https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-v2
> 
> Best regards,
> baolu
> 
> Change log:
> v2:
>  - Move all iommu refactoring patches into a sparated series and discuss
>    it in a different thread. The latest patch series [v6] is available at
>    https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@linux.intel.com/
>  - We discussed the timeout of the pending page fault messages. We
>    agreed that we shouldn't apply any timeout policy for the page fault
>    handling in user space.
>    https://lore.kernel.org/linux-iommu/20230616113232.GA84678@myrica/
>  - Jason suggested that we adopt a simple file descriptor interface for
>    reading and responding to I/O page requests, so that user space
>    applications can improve performance using io_uring.
>    https://lore.kernel.org/linux-iommu/ZJWjD1ajeem6pK3I@ziepe.ca/
> 
> v1: https://lore.kernel.org/linux-iommu/20230530053724.232765-1-baolu.lu@linux.intel.com/
> 
> Lu Baolu (6):
>   iommu: Add iommu page fault cookie helpers
>   iommufd: Add iommu page fault uapi data
>   iommufd: Initializing and releasing IO page fault data
>   iommufd: Deliver fault messages to user space
>   iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_IOPF test support
>   iommufd/selftest: Add coverage for IOMMU_TEST_OP_TRIGGER_IOPF
> 
>  include/linux/iommu.h                         |   9 +
>  drivers/iommu/iommu-priv.h                    |  15 +
>  drivers/iommu/iommufd/iommufd_private.h       |  12 +
>  drivers/iommu/iommufd/iommufd_test.h          |   8 +
>  include/uapi/linux/iommufd.h                  |  65 +++++
>  tools/testing/selftests/iommu/iommufd_utils.h |  66 ++++-
>  drivers/iommu/io-pgfault.c                    |  50 ++++
>  drivers/iommu/iommufd/device.c                |  69 ++++-
>  drivers/iommu/iommufd/hw_pagetable.c          | 260 +++++++++++++++++-
>  drivers/iommu/iommufd/selftest.c              |  56 ++++
>  tools/testing/selftests/iommu/iommufd.c       |  24 +-
>  .../selftests/iommu/iommufd_fail_nth.c        |   2 +-
>  12 files changed, 620 insertions(+), 16 deletions(-)
> 
> -- 
> 2.34.1
>
Jason Gunthorpe Dec. 4, 2023, 3:32 p.m. UTC | #12
On Mon, Dec 04, 2023 at 04:07:44PM +0100, Joel Granados wrote:
> On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> > Hi folks,
> > 
> > This series implements the functionality of delivering IO page faults to
> > user space through the IOMMUFD framework for nested translation. Nested
> Does this mean the IOPF_CAPABLE HWPT needs to be parented by a HWPT
> created with IOMMU_HWPT_ALLOC_NEST_PARENT set?

I would expect no. Both a nested and un-nested configuration should
work.

Jason
Baolu Lu Dec. 8, 2023, 5:10 a.m. UTC | #13
On 12/4/23 11:07 PM, Joel Granados wrote:
> On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
>> Hi folks,
>>
>> This series implements the functionality of delivering IO page faults to
>> user space through the IOMMUFD framework for nested translation. Nested
> Does this mean the IOPF_CAPABLE HWPT needs to be parented by a HWPT
> created with IOMMU_HWPT_ALLOC_NEST_PARENT set?

No. It's generic, nested translation is simply a use case that is
currently feasible.

Best regards,
baolu
Baolu Lu Dec. 8, 2023, 5:57 a.m. UTC | #14
On 12/1/23 10:24 PM, Jason Gunthorpe wrote:
> On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
>> Hi folks,
>>
>> This series implements the functionality of delivering IO page faults to
>> user space through the IOMMUFD framework for nested translation. Nested
>> translation is a hardware feature that supports two-stage translation
>> tables for IOMMU. The second-stage translation table is managed by the
>> host VMM, while the first-stage translation table is owned by user
>> space. This allows user space to control the IOMMU mappings for its
>> devices.
>>
>> When an IO page fault occurs on the first-stage translation table, the
>> IOMMU hardware can deliver the page fault to user space through the
>> IOMMUFD framework. User space can then handle the page fault and respond
>> to the device top-down through the IOMMUFD. This allows user space to
>> implement its own IO page fault handling policies.
>>
>> User space indicates its capability of handling IO page faults by
>> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
>> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
>> for page fault delivery. On a successful return of HWPT allocation, the
>> user can retrieve and respond to page faults by reading and writing to
>> the file descriptor (FD) returned in out_fault_fd.
> 
> This is probably backwards, userspace should allocate the FD with a
> dedicated ioctl and provide it during domain allocation.

Introducing a dedicated fault FD for fault handling seems promising. It
decouples the fault handling from any specific domain. I suppose we need
different fault fd for recoverable faults (a.k.a. IO page fault) and
unrecoverable faults. Do I understand you correctly?


> If the userspace wants a fd per domain then it should do that. If it
> wants to share fds between domains that should work too.

Yes, it's more flexible. The fault message contains the hwpt obj id, so
user space can recognize the hwpt on which the fault happened.

Best regards,
baolu
Jason Gunthorpe Dec. 8, 2023, 1:43 p.m. UTC | #15
On Fri, Dec 08, 2023 at 01:57:26PM +0800, Baolu Lu wrote:
> On 12/1/23 10:24 PM, Jason Gunthorpe wrote:
> > On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> > > Hi folks,
> > > 
> > > This series implements the functionality of delivering IO page faults to
> > > user space through the IOMMUFD framework for nested translation. Nested
> > > translation is a hardware feature that supports two-stage translation
> > > tables for IOMMU. The second-stage translation table is managed by the
> > > host VMM, while the first-stage translation table is owned by user
> > > space. This allows user space to control the IOMMU mappings for its
> > > devices.
> > > 
> > > When an IO page fault occurs on the first-stage translation table, the
> > > IOMMU hardware can deliver the page fault to user space through the
> > > IOMMUFD framework. User space can then handle the page fault and respond
> > > to the device top-down through the IOMMUFD. This allows user space to
> > > implement its own IO page fault handling policies.
> > > 
> > > User space indicates its capability of handling IO page faults by
> > > setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> > > hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> > > for page fault delivery. On a successful return of HWPT allocation, the
> > > user can retrieve and respond to page faults by reading and writing to
> > > the file descriptor (FD) returned in out_fault_fd.
> > 
> > This is probably backwards, userspace should allocate the FD with a
> > dedicated ioctl and provide it during domain allocation.
> 
> Introducing a dedicated fault FD for fault handling seems promising. It
> decouples the fault handling from any specific domain. I suppose we need
> different fault fd for recoverable faults (a.k.a. IO page fault) and
> unrecoverable faults. Do I understand you correctly?

I haven't thought that far ahead :) Once you have a generic fault FD
concept it can be sliced in different ways. If there is a technical
need to seperate recoverable/unrecoverable then the FD flavour should
be specified during FD creation. Otherwise just let userspace do
whatever it wants.

Jason
Joel Granados Dec. 12, 2023, 1:10 p.m. UTC | #16
On Thu, Oct 26, 2023 at 10:49:27AM +0800, Lu Baolu wrote:
> Add some housekeeping code for IO page fault dilivery. Add a fault field
> in the iommufd_hw_pagetable structure to store pending IO page faults and
> other related data.
> 
> The fault field is allocated and initialized when an IOPF-capable user
> HWPT is allocated. It is indicated by the IOMMU_HWPT_ALLOC_IOPF_CAPABLE
> flag being set in the allocation user data. The fault field exists until
> the HWPT is destroyed. This also means that you can determine whether a
> HWPT is IOPF-capable by checking the fault field.
> 
> When an IOPF-capable HWPT is attached to a device (could also be a PASID of
> a device in the future), the iommufd device pointer is saved for the pasid
> of the device. The pointer is recalled and all pending iopf groups are
> discarded after the HWPT is detached from the device.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
>  include/linux/iommu.h                   |  6 +++
>  drivers/iommu/iommufd/iommufd_private.h | 10 ++++
>  drivers/iommu/iommufd/device.c          | 69 +++++++++++++++++++++++--
>  drivers/iommu/iommufd/hw_pagetable.c    | 56 +++++++++++++++++++-
>  4 files changed, 137 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 615d8a5f9dee..600ca3842c8a 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -130,6 +130,12 @@ struct iopf_group {
>  	struct work_struct work;
>  	struct device *dev;
>  	struct iommu_domain *domain;
> +
> +	/*
> +	 * Used by iopf handlers, like iommufd, to hook the iopf group
> +	 * on its own lists.
> +	 */
> +	struct list_head node;
>  };
>  
>  /**
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 1bd412cff2d6..0dbaa2dc5b22 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -230,6 +230,15 @@ int iommufd_option_rlimit_mode(struct iommu_option *cmd,
>  
>  int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
>  
> +struct hw_pgtable_fault {
> +	struct iommufd_ctx *ictx;
> +	struct iommufd_hw_pagetable *hwpt;
> +	/* Protect below iopf lists. */
> +	struct mutex mutex;
> +	struct list_head deliver;
> +	struct list_head response;
> +};
> +
>  /*
>   * A HW pagetable is called an iommu_domain inside the kernel. This user object
>   * allows directly creating and inspecting the domains. Domains that have kernel
> @@ -239,6 +248,7 @@ int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
>  struct iommufd_hw_pagetable {
>  	struct iommufd_object obj;
>  	struct iommu_domain *domain;
> +	struct hw_pgtable_fault *fault;
>  
>  	void (*abort)(struct iommufd_object *obj);
>  	void (*destroy)(struct iommufd_object *obj);
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index 645ab5d290fe..0a8e03d5e7c5 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -456,6 +456,16 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
>  	if (rc)
>  		goto err_unlock;
>  
> +	if (hwpt->fault) {
> +		void *curr;
> +
> +		curr = iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, idev);
I'm hitting an error here when I try to attach to a hwpt that I created
previously with the `IOMMU_HWPT_ALLOC_IOPF_CAPABLE` flag.

I get an -ENODEV from iopf_pasid_cookie_set which is triggered by
dev->iommu->fault_param being 0x0.

I looked around and I see that the fault param gets set in
iopf_queue_add_device which is called from iommu_dev_enable_feature
only. Furthermore iommu_dev_enable_feature is only called in idxd and
uacce drivers.

Questions:
1. Should iopf_queue_add_device get called from the
   IOMMU_HWPT_ALLOC_IOPF_CAPABLE ioctl call? This make sense to me as
   this is where the device and the IOPF are related from user space.
2. This is not intended to work only with idxd and uacce. right?

Best
> +		if (IS_ERR(curr)) {
> +			rc = PTR_ERR(curr);
> +			goto err_unresv;
> +		}
> +	}
> +
>  	/*
>  	 * Only attach to the group once for the first device that is in the
>  	 * group. All the other devices will follow this attachment. The user
> @@ -466,17 +476,20 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
>  	if (list_empty(&idev->igroup->device_list)) {
>  		rc = iommufd_group_setup_msi(idev->igroup, hwpt);
>  		if (rc)
> -			goto err_unresv;
> +			goto err_unset;
>  
>  		rc = iommu_attach_group(hwpt->domain, idev->igroup->group);
>  		if (rc)
> -			goto err_unresv;
> +			goto err_unset;
>  		idev->igroup->hwpt = hwpt;
>  	}
>  	refcount_inc(&hwpt->obj.users);
>  	list_add_tail(&idev->group_item, &idev->igroup->device_list);
>  	mutex_unlock(&idev->igroup->lock);
>  	return 0;
> +err_unset:
> +	if (hwpt->fault)
> +		iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, NULL);
>  err_unresv:
>  	iommufd_device_remove_rr(idev, hwpt);
>  err_unlock:
> @@ -484,6 +497,30 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
>  	return rc;
>  }
>  
> +/*
> + * Discard all pending page faults. Called when a hw pagetable is detached
> + * from a device. The iommu core guarantees that all page faults have been
> + * responded, hence there's no need to respond it again.
> + */
> +static void iommufd_hw_pagetable_discard_iopf(struct iommufd_hw_pagetable *hwpt)
> +{
> +	struct iopf_group *group, *next;
> +
> +	if (!hwpt->fault)
> +		return;
> +
> +	mutex_lock(&hwpt->fault->mutex);
> +	list_for_each_entry_safe(group, next, &hwpt->fault->deliver, node) {
> +		list_del(&group->node);
> +		iopf_free_group(group);
> +	}
> +	list_for_each_entry_safe(group, next, &hwpt->fault->response, node) {
> +		list_del(&group->node);
> +		iopf_free_group(group);
> +	}
> +	mutex_unlock(&hwpt->fault->mutex);
> +}
> +
>  struct iommufd_hw_pagetable *
>  iommufd_hw_pagetable_detach(struct iommufd_device *idev)
>  {
> @@ -491,6 +528,8 @@ iommufd_hw_pagetable_detach(struct iommufd_device *idev)
>  
>  	mutex_lock(&idev->igroup->lock);
>  	list_del(&idev->group_item);
> +	if (hwpt->fault)
> +		iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, NULL);
>  	if (list_empty(&idev->igroup->device_list)) {
>  		iommu_detach_group(hwpt->domain, idev->igroup->group);
>  		idev->igroup->hwpt = NULL;
> @@ -498,6 +537,8 @@ iommufd_hw_pagetable_detach(struct iommufd_device *idev)
>  	iommufd_device_remove_rr(idev, hwpt);
>  	mutex_unlock(&idev->igroup->lock);
>  
> +	iommufd_hw_pagetable_discard_iopf(hwpt);
> +
>  	/* Caller must destroy hwpt */
>  	return hwpt;
>  }
> @@ -563,9 +604,24 @@ iommufd_device_do_replace(struct iommufd_device *idev,
>  	if (rc)
>  		goto err_unresv;
>  
> +	if (old_hwpt->fault) {
> +		iommufd_hw_pagetable_discard_iopf(old_hwpt);
> +		iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, NULL);
> +	}
> +
> +	if (hwpt->fault) {
> +		void *curr;
> +
> +		curr = iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, idev);
> +		if (IS_ERR(curr)) {
> +			rc = PTR_ERR(curr);
> +			goto err_unresv;
> +		}
> +	}
> +
>  	rc = iommu_group_replace_domain(igroup->group, hwpt->domain);
>  	if (rc)
> -		goto err_unresv;
> +		goto err_unset;
>  
>  	if (iommufd_hw_pagetable_compare_ioas(old_hwpt, hwpt)) {
>  		list_for_each_entry(cur, &igroup->device_list, group_item)
> @@ -583,8 +639,15 @@ iommufd_device_do_replace(struct iommufd_device *idev,
>  					      &old_hwpt->obj.users));
>  	mutex_unlock(&idev->igroup->lock);
>  
> +	iommufd_hw_pagetable_discard_iopf(old_hwpt);
> +
>  	/* Caller must destroy old_hwpt */
>  	return old_hwpt;
> +err_unset:
> +	if (hwpt->fault)
> +		iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, NULL);
> +	if (old_hwpt->fault)
> +		iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, idev);
>  err_unresv:
>  	if (iommufd_hw_pagetable_compare_ioas(old_hwpt, hwpt)) {
>  		list_for_each_entry(cur, &igroup->device_list, group_item)
> diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> index 72c46de1396b..9f94c824cf86 100644
> --- a/drivers/iommu/iommufd/hw_pagetable.c
> +++ b/drivers/iommu/iommufd/hw_pagetable.c
> @@ -38,9 +38,38 @@ static void iommufd_kernel_managed_hwpt_destroy(struct iommufd_object *obj)
>  	refcount_dec(&hwpt->ioas->obj.users);
>  }
>  
> +static struct hw_pgtable_fault *hw_pagetable_fault_alloc(void)
> +{
> +	struct hw_pgtable_fault *fault;
> +
> +	fault = kzalloc(sizeof(*fault), GFP_KERNEL);
> +	if (!fault)
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_LIST_HEAD(&fault->deliver);
> +	INIT_LIST_HEAD(&fault->response);
> +	mutex_init(&fault->mutex);
> +
> +	return fault;
> +}
> +
> +static void hw_pagetable_fault_free(struct hw_pgtable_fault *fault)
> +{
> +	WARN_ON(!list_empty(&fault->deliver));
> +	WARN_ON(!list_empty(&fault->response));
> +
> +	kfree(fault);
> +}
> +
>  void iommufd_hw_pagetable_destroy(struct iommufd_object *obj)
>  {
> -	container_of(obj, struct iommufd_hw_pagetable, obj)->destroy(obj);
> +	struct iommufd_hw_pagetable *hwpt =
> +		container_of(obj, struct iommufd_hw_pagetable, obj);
> +
> +	if (hwpt->fault)
> +		hw_pagetable_fault_free(hwpt->fault);
> +
> +	hwpt->destroy(obj);
>  }
>  
>  static void iommufd_user_managed_hwpt_abort(struct iommufd_object *obj)
> @@ -289,6 +318,17 @@ iommufd_hw_pagetable_alloc(struct iommufd_ctx *ictx,
>  	return ERR_PTR(rc);
>  }
>  
> +static int iommufd_hw_pagetable_iopf_handler(struct iopf_group *group)
> +{
> +	struct iommufd_hw_pagetable *hwpt = group->domain->fault_data;
> +
> +	mutex_lock(&hwpt->fault->mutex);
> +	list_add_tail(&group->node, &hwpt->fault->deliver);
> +	mutex_unlock(&hwpt->fault->mutex);
> +
> +	return 0;
> +}
> +
>  int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
>  {
>  	struct iommufd_hw_pagetable *(*alloc_fn)(
> @@ -364,6 +404,20 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
>  		goto out_unlock;
>  	}
>  
> +	if (cmd->flags & IOMMU_HWPT_ALLOC_IOPF_CAPABLE) {
> +		hwpt->fault = hw_pagetable_fault_alloc();
> +		if (IS_ERR(hwpt->fault)) {
> +			rc = PTR_ERR(hwpt->fault);
> +			hwpt->fault = NULL;
> +			goto out_hwpt;
> +		}
> +
> +		hwpt->fault->ictx = ucmd->ictx;
> +		hwpt->fault->hwpt = hwpt;
> +		hwpt->domain->iopf_handler = iommufd_hw_pagetable_iopf_handler;
> +		hwpt->domain->fault_data = hwpt;
> +	}
> +
>  	cmd->out_hwpt_id = hwpt->obj.id;
>  	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
>  	if (rc)
> -- 
> 2.34.1
>
Jason Gunthorpe Dec. 12, 2023, 2:12 p.m. UTC | #17
On Tue, Dec 12, 2023 at 02:10:08PM +0100, Joel Granados wrote:

> > diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> > index 645ab5d290fe..0a8e03d5e7c5 100644
> > --- a/drivers/iommu/iommufd/device.c
> > +++ b/drivers/iommu/iommufd/device.c
> > @@ -456,6 +456,16 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
> >  	if (rc)
> >  		goto err_unlock;
> >  
> > +	if (hwpt->fault) {
> > +		void *curr;
> > +
> > +		curr = iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, idev);
> I'm hitting an error here when I try to attach to a hwpt that I created
> previously with the `IOMMU_HWPT_ALLOC_IOPF_CAPABLE` flag.
> 
> I get an -ENODEV from iopf_pasid_cookie_set which is triggered by
> dev->iommu->fault_param being 0x0.
> 
> I looked around and I see that the fault param gets set in
> iopf_queue_add_device which is called from iommu_dev_enable_feature
> only. Furthermore iommu_dev_enable_feature is only called in idxd and
> uacce drivers.
> 
> Questions:
> 1. Should iopf_queue_add_device get called from the
>    IOMMU_HWPT_ALLOC_IOPF_CAPABLE ioctl call? This make sense to me as
>    this is where the device and the IOPF are related from user space.

It probably needs to call the set feature thing in the short term.

In the medium term I would like the drivers to manage the iopf based
on domain attachment not explicit feature asks

> 2. This is not intended to work only with idxd and uacce. right?

It should work everywhere, I suspect Intel Team didn't hit this
because they are testing IDXD SIOV? Can you guys also test it as a PF
assignment?

Jason
Baolu Lu Dec. 13, 2023, 2:04 a.m. UTC | #18
On 12/12/23 10:12 PM, Jason Gunthorpe wrote:
> On Tue, Dec 12, 2023 at 02:10:08PM +0100, Joel Granados wrote:
> 
>>> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
>>> index 645ab5d290fe..0a8e03d5e7c5 100644
>>> --- a/drivers/iommu/iommufd/device.c
>>> +++ b/drivers/iommu/iommufd/device.c
>>> @@ -456,6 +456,16 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
>>>   	if (rc)
>>>   		goto err_unlock;
>>>   
>>> +	if (hwpt->fault) {
>>> +		void *curr;
>>> +
>>> +		curr = iopf_pasid_cookie_set(idev->dev, IOMMU_NO_PASID, idev);
>> I'm hitting an error here when I try to attach to a hwpt that I created
>> previously with the `IOMMU_HWPT_ALLOC_IOPF_CAPABLE` flag.
>>
>> I get an -ENODEV from iopf_pasid_cookie_set which is triggered by
>> dev->iommu->fault_param being 0x0.
>>
>> I looked around and I see that the fault param gets set in
>> iopf_queue_add_device which is called from iommu_dev_enable_feature
>> only. Furthermore iommu_dev_enable_feature is only called in idxd and
>> uacce drivers.
>>
>> Questions:
>> 1. Should iopf_queue_add_device get called from the
>>     IOMMU_HWPT_ALLOC_IOPF_CAPABLE ioctl call? This make sense to me as
>>     this is where the device and the IOPF are related from user space.
> It probably needs to call the set feature thing in the short term.
> 
> In the medium term I would like the drivers to manage the iopf based
> on domain attachment not explicit feature asks

Yes, it's the same as my plan.

> 
>> 2. This is not intended to work only with idxd and uacce. right?
> It should work everywhere, I suspect Intel Team didn't hit this
> because they are testing IDXD SIOV?

Yes.

> Can you guys also test it as a PF
> assignment?

For PF assignment, probably the driver (vfio-pci) needs to enable iopf.

Best regards,
baolu
Tian, Kevin Dec. 13, 2023, 2:15 a.m. UTC | #19
> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Wednesday, December 13, 2023 10:05 AM
> >
> >> 2. This is not intended to work only with idxd and uacce. right?
> > It should work everywhere, I suspect Intel Team didn't hit this
> > because they are testing IDXD SIOV?
> 
> Yes.
> 
> > Can you guys also test it as a PF
> > assignment?
> 
> For PF assignment, probably the driver (vfio-pci) needs to enable iopf.
> 

We haven't merged anything for SIOV yet.

so the base of this series should be PCI functions (PF or VF) and vfio-pci
has to be extended with whatever required to support iopf.
Jason Gunthorpe Dec. 13, 2023, 1:19 p.m. UTC | #20
On Wed, Dec 13, 2023 at 02:15:28AM +0000, Tian, Kevin wrote:
> > From: Baolu Lu <baolu.lu@linux.intel.com>
> > Sent: Wednesday, December 13, 2023 10:05 AM
> > >
> > >> 2. This is not intended to work only with idxd and uacce. right?
> > > It should work everywhere, I suspect Intel Team didn't hit this
> > > because they are testing IDXD SIOV?
> > 
> > Yes.
> > 
> > > Can you guys also test it as a PF
> > > assignment?
> > 
> > For PF assignment, probably the driver (vfio-pci) needs to enable iopf.
> > 
> 
> We haven't merged anything for SIOV yet.
> 
> so the base of this series should be PCI functions (PF or VF) and vfio-pci
> has to be extended with whatever required to support iopf.

Right. I suggest you target full idxd device assignment to a guest
with working PRI/etc as a validation.

Jason
Joel Granados Jan. 12, 2024, 9:56 p.m. UTC | #21
On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> Hi folks,
> 
> This series implements the functionality of delivering IO page faults to
> user space through the IOMMUFD framework for nested translation. Nested
> translation is a hardware feature that supports two-stage translation
> tables for IOMMU. The second-stage translation table is managed by the
> host VMM, while the first-stage translation table is owned by user
> space. This allows user space to control the IOMMU mappings for its
> devices.
> 
> When an IO page fault occurs on the first-stage translation table, the
> IOMMU hardware can deliver the page fault to user space through the
> IOMMUFD framework. User space can then handle the page fault and respond
> to the device top-down through the IOMMUFD. This allows user space to
> implement its own IO page fault handling policies.
> 
> User space indicates its capability of handling IO page faults by
> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> for page fault delivery. On a successful return of HWPT allocation, the
> user can retrieve and respond to page faults by reading and writing to
> the file descriptor (FD) returned in out_fault_fd.
> 
> The iommu selftest framework has been updated to test the IO page fault
> delivery and response functionality.
> 
> This series is based on the latest implementation of nested translation
> under discussion [1] and the page fault handling framework refactoring in
> the IOMMU core [2].
> 
> The series and related patches are available on GitHub: [3]
> 
> [1] https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/
> [2] https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@linux.intel.com/
> [3] https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-v2
I was working with this branch that included Yi Liu's
wip/iommufd_nesting branch. Now Yi Lui has updated his work in this post
https://lore.kernel.org/all/20240102143834.146165-1-yi.l.liu@intel.com.
Is there an updated version of the page fault work that is rebased on
top of Liu's new version?

Thx in advance

Best

> 
> Best regards,
> baolu
> 
> Change log:
> v2:
>  - Move all iommu refactoring patches into a sparated series and discuss
>    it in a different thread. The latest patch series [v6] is available at
>    https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@linux.intel.com/
>  - We discussed the timeout of the pending page fault messages. We
>    agreed that we shouldn't apply any timeout policy for the page fault
>    handling in user space.
>    https://lore.kernel.org/linux-iommu/20230616113232.GA84678@myrica/
>  - Jason suggested that we adopt a simple file descriptor interface for
>    reading and responding to I/O page requests, so that user space
>    applications can improve performance using io_uring.
>    https://lore.kernel.org/linux-iommu/ZJWjD1ajeem6pK3I@ziepe.ca/
> 
> v1: https://lore.kernel.org/linux-iommu/20230530053724.232765-1-baolu.lu@linux.intel.com/
> 
> Lu Baolu (6):
>   iommu: Add iommu page fault cookie helpers
>   iommufd: Add iommu page fault uapi data
>   iommufd: Initializing and releasing IO page fault data
>   iommufd: Deliver fault messages to user space
>   iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_IOPF test support
>   iommufd/selftest: Add coverage for IOMMU_TEST_OP_TRIGGER_IOPF
> 
>  include/linux/iommu.h                         |   9 +
>  drivers/iommu/iommu-priv.h                    |  15 +
>  drivers/iommu/iommufd/iommufd_private.h       |  12 +
>  drivers/iommu/iommufd/iommufd_test.h          |   8 +
>  include/uapi/linux/iommufd.h                  |  65 +++++
>  tools/testing/selftests/iommu/iommufd_utils.h |  66 ++++-
>  drivers/iommu/io-pgfault.c                    |  50 ++++
>  drivers/iommu/iommufd/device.c                |  69 ++++-
>  drivers/iommu/iommufd/hw_pagetable.c          | 260 +++++++++++++++++-
>  drivers/iommu/iommufd/selftest.c              |  56 ++++
>  tools/testing/selftests/iommu/iommufd.c       |  24 +-
>  .../selftests/iommu/iommufd_fail_nth.c        |   2 +-
>  12 files changed, 620 insertions(+), 16 deletions(-)
> 
> -- 
> 2.34.1
>
Baolu Lu Jan. 14, 2024, 1:13 p.m. UTC | #22
On 2024/1/13 5:56, Joel Granados wrote:
> On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
>> Hi folks,
>>
>> This series implements the functionality of delivering IO page faults to
>> user space through the IOMMUFD framework for nested translation. Nested
>> translation is a hardware feature that supports two-stage translation
>> tables for IOMMU. The second-stage translation table is managed by the
>> host VMM, while the first-stage translation table is owned by user
>> space. This allows user space to control the IOMMU mappings for its
>> devices.
>>
>> When an IO page fault occurs on the first-stage translation table, the
>> IOMMU hardware can deliver the page fault to user space through the
>> IOMMUFD framework. User space can then handle the page fault and respond
>> to the device top-down through the IOMMUFD. This allows user space to
>> implement its own IO page fault handling policies.
>>
>> User space indicates its capability of handling IO page faults by
>> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
>> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
>> for page fault delivery. On a successful return of HWPT allocation, the
>> user can retrieve and respond to page faults by reading and writing to
>> the file descriptor (FD) returned in out_fault_fd.
>>
>> The iommu selftest framework has been updated to test the IO page fault
>> delivery and response functionality.
>>
>> This series is based on the latest implementation of nested translation
>> under discussion [1] and the page fault handling framework refactoring in
>> the IOMMU core [2].
>>
>> The series and related patches are available on GitHub: [3]
>>
>> [1]https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/
>> [2]https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@linux.intel.com/
>> [3]https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-v2
> I was working with this branch that included Yi Liu's
> wip/iommufd_nesting branch. Now Yi Lui has updated his work in this post
> https://lore.kernel.org/all/20240102143834.146165-1-yi.l.liu@intel.com.
> Is there an updated version of the page fault work that is rebased on
> top of Liu's new version?

Yes. I am preparing the new version and will post it for discussion
after the merge window.

Best regards,
baolu
Joel Granados Jan. 14, 2024, 5:18 p.m. UTC | #23
On Sun, Jan 14, 2024 at 09:13:19PM +0800, Baolu Lu wrote:
> On 2024/1/13 5:56, Joel Granados wrote:
> > On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
> >> Hi folks,
> >>
> >> This series implements the functionality of delivering IO page faults to
> >> user space through the IOMMUFD framework for nested translation. Nested
> >> translation is a hardware feature that supports two-stage translation
> >> tables for IOMMU. The second-stage translation table is managed by the
> >> host VMM, while the first-stage translation table is owned by user
> >> space. This allows user space to control the IOMMU mappings for its
> >> devices.
> >>
> >> When an IO page fault occurs on the first-stage translation table, the
> >> IOMMU hardware can deliver the page fault to user space through the
> >> IOMMUFD framework. User space can then handle the page fault and respond
> >> to the device top-down through the IOMMUFD. This allows user space to
> >> implement its own IO page fault handling policies.
> >>
> >> User space indicates its capability of handling IO page faults by
> >> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
> >> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
> >> for page fault delivery. On a successful return of HWPT allocation, the
> >> user can retrieve and respond to page faults by reading and writing to
> >> the file descriptor (FD) returned in out_fault_fd.
> >>
> >> The iommu selftest framework has been updated to test the IO page fault
> >> delivery and response functionality.
> >>
> >> This series is based on the latest implementation of nested translation
> >> under discussion [1] and the page fault handling framework refactoring in
> >> the IOMMU core [2].
> >>
> >> The series and related patches are available on GitHub: [3]
> >>
> >> [1]https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/
> >> [2]https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@linux.intel.com/
> >> [3]https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-v2
> > I was working with this branch that included Yi Liu's
> > wip/iommufd_nesting branch. Now Yi Lui has updated his work in this post
> > https://lore.kernel.org/all/20240102143834.146165-1-yi.l.liu@intel.com.
> > Is there an updated version of the page fault work that is rebased on
> > top of Liu's new version?
> 
> Yes. I am preparing the new version and will post it for discussion
> after the merge window.
Great to hear and thx for getting back to me.

I'll be on the look out for your post. Would it be possible for you to
add me to the CC when you send it?

Best
Baolu Lu Jan. 15, 2024, 1:25 a.m. UTC | #24
On 1/15/24 1:18 AM, Joel Granados wrote:
> On Sun, Jan 14, 2024 at 09:13:19PM +0800, Baolu Lu wrote:
>> On 2024/1/13 5:56, Joel Granados wrote:
>>> On Thu, Oct 26, 2023 at 10:49:24AM +0800, Lu Baolu wrote:
>>>> Hi folks,
>>>>
>>>> This series implements the functionality of delivering IO page faults to
>>>> user space through the IOMMUFD framework for nested translation. Nested
>>>> translation is a hardware feature that supports two-stage translation
>>>> tables for IOMMU. The second-stage translation table is managed by the
>>>> host VMM, while the first-stage translation table is owned by user
>>>> space. This allows user space to control the IOMMU mappings for its
>>>> devices.
>>>>
>>>> When an IO page fault occurs on the first-stage translation table, the
>>>> IOMMU hardware can deliver the page fault to user space through the
>>>> IOMMUFD framework. User space can then handle the page fault and respond
>>>> to the device top-down through the IOMMUFD. This allows user space to
>>>> implement its own IO page fault handling policies.
>>>>
>>>> User space indicates its capability of handling IO page faults by
>>>> setting the IOMMU_HWPT_ALLOC_IOPF_CAPABLE flag when allocating a
>>>> hardware page table (HWPT). IOMMUFD will then set up its infrastructure
>>>> for page fault delivery. On a successful return of HWPT allocation, the
>>>> user can retrieve and respond to page faults by reading and writing to
>>>> the file descriptor (FD) returned in out_fault_fd.
>>>>
>>>> The iommu selftest framework has been updated to test the IO page fault
>>>> delivery and response functionality.
>>>>
>>>> This series is based on the latest implementation of nested translation
>>>> under discussion [1] and the page fault handling framework refactoring in
>>>> the IOMMU core [2].
>>>>
>>>> The series and related patches are available on GitHub: [3]
>>>>
>>>> [1]https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/
>>>> [2]https://lore.kernel.org/linux-iommu/20230928042734.16134-1-baolu.lu@linux.intel.com/
>>>> [3]https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-v2
>>> I was working with this branch that included Yi Liu's
>>> wip/iommufd_nesting branch. Now Yi Lui has updated his work in this post
>>> https://lore.kernel.org/all/20240102143834.146165-1-yi.l.liu@intel.com.
>>> Is there an updated version of the page fault work that is rebased on
>>> top of Liu's new version?
>> Yes. I am preparing the new version and will post it for discussion
>> after the merge window.
> Great to hear and thx for getting back to me.
> 
> I'll be on the look out for your post. Would it be possible for you to
> add me to the CC when you send it?

Sure.

Best regards,
baolu