diff mbox series

[v6,10/25] iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl

Message ID 7dfb002613f224f57a069d27e7bf2b306b0a5ba0.1749884998.git.nicolinc@nvidia.com
State New
Headers show
Series iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) | expand

Commit Message

Nicolin Chen June 14, 2025, 7:14 a.m. UTC
Introduce a new IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl for user space to allocate
a HW QUEUE object for a vIOMMU specific HW-accelerated queue, e.g.:
 - NVIDIA's Virtual Command Queue
 - AMD vIOMMU's Command Buffer, Event Log Buffers, and PPR Log Buffers

Since this is introduced with NVIDIA's VCMDQs that access the guest memory
in the physical address space, add an iommufd_hw_queue_alloc_phys() helper
that will create an access object to the queue memory in the IOAS, to avoid
the mappings of the guest memory from being unmapped, during the life cycle
of the HW queue object.

Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_private.h |   2 +
 include/linux/iommufd.h                 |   1 +
 include/uapi/linux/iommufd.h            |  33 +++++
 drivers/iommu/iommufd/main.c            |   6 +
 drivers/iommu/iommufd/viommu.c          | 184 ++++++++++++++++++++++++
 5 files changed, 226 insertions(+)

Comments

Baolu Lu June 16, 2025, 6:54 a.m. UTC | #1
On 6/16/25 14:47, Nicolin Chen wrote:
> On Mon, Jun 16, 2025 at 02:12:04PM +0800, Baolu Lu wrote:
>> On 6/14/25 15:14, Nicolin Chen wrote:
>>> +	if (!viommu->ops || !viommu->ops->get_hw_queue_size ||
>>> +	    !viommu->ops->hw_queue_init_phys) {
>>> +		rc = -EOPNOTSUPP;
>>> +		goto out_put_viommu;
>>> +	}
> Hmm, here it does abort when !viommu->ops->hw_queue_init_phys ..
> 
>>> +	/*
>>> +	 * FIXME once ops->hw_queue_init is introduced, this should check "if
>>> +	 * ops->hw_queue_init_phys". And "access" should be initialized to NULL.
>>> +	 */
>> I just don't follow here. Up until now, only viommu->ops->
>> hw_queue_init_phys has been added, which means the current code only
>> supports hardware queues that access guest memory using physical
>> addresses. The access object is not needed for the other type of
>> hardware queue that uses guest IOVA.
>>
>> So, why not just abort here if ops->hw_queue_init_phys is not supported
>> by the IOMMU driver?
> .. so, it already does.
> 
>> Leave other logics to the patches that introduce
>> ops->hw_queue_init? I guess that would make this patch more readible.
> The patch is doing in the way to support the hw_queue_init_phys
> case only. It is just adding some extra FIXMEs as the guideline
> for the future patch adding hw_queue_init op.
> 
> I personally don't feel these are confusing. Maybe you can help
> point out what specific wording feels odd here? Maybe "FIXME"s
> should be "TODO"s?

Oh, I probably misunderstood the logic here. For both hw_queue_init and
hw_queue_init_phys, using an access object to pin the pages for hardware
access is necessary, right? My understanding was that pinning pages is
only required for hw_queue_init_phys.

> 
> I could also drop all of these guideline if they are considered
> very unnecessary.
> 
> Thanks
> Nicolin

Thanks,
baolu
Baolu Lu June 16, 2025, 7:09 a.m. UTC | #2
On 6/16/25 15:04, Nicolin Chen wrote:
> On Mon, Jun 16, 2025 at 02:54:10PM +0800, Baolu Lu wrote:
>> On 6/16/25 14:47, Nicolin Chen wrote:
>>> On Mon, Jun 16, 2025 at 02:12:04PM +0800, Baolu Lu wrote:
>>>> On 6/14/25 15:14, Nicolin Chen wrote:
>>>>> +	if (!viommu->ops || !viommu->ops->get_hw_queue_size ||
>>>>> +	    !viommu->ops->hw_queue_init_phys) {
>>>>> +		rc = -EOPNOTSUPP;
>>>>> +		goto out_put_viommu;
>>>>> +	}
>>> Hmm, here it does abort when !viommu->ops->hw_queue_init_phys ..
>>>
>>>>> +	/*
>>>>> +	 * FIXME once ops->hw_queue_init is introduced, this should check "if
>>>>> +	 * ops->hw_queue_init_phys". And "access" should be initialized to NULL.
>>>>> +	 */
>>>> I just don't follow here. Up until now, only viommu->ops->
>>>> hw_queue_init_phys has been added, which means the current code only
>>>> supports hardware queues that access guest memory using physical
>>>> addresses. The access object is not needed for the other type of
>>>> hardware queue that uses guest IOVA.
>>>>
>>>> So, why not just abort here if ops->hw_queue_init_phys is not supported
>>>> by the IOMMU driver?
>>> .. so, it already does.
>>>
>>>> Leave other logics to the patches that introduce
>>>> ops->hw_queue_init? I guess that would make this patch more readible.
>>> The patch is doing in the way to support the hw_queue_init_phys
>>> case only. It is just adding some extra FIXMEs as the guideline
>>> for the future patch adding hw_queue_init op.
>>>
>>> I personally don't feel these are confusing. Maybe you can help
>>> point out what specific wording feels odd here? Maybe "FIXME"s
>>> should be "TODO"s?
>> Oh, I probably misunderstood the logic here. For both hw_queue_init and
>> hw_queue_init_phys, using an access object to pin the pages for hardware
>> access is necessary, right? My understanding was that pinning pages is
>> only required for hw_queue_init_phys.
> No. The access is only used by the ops->hw_queue_init_phys case.
> 
> The ops->hw_queue_init case will use the cmd->nesting_parent_iova
> directly without calling iommufd_hw_queue_alloc_phys().
> 
> This FIXME means that, when adding ops->hw_queue_init, add this:
> 
> -	struct iommufd_access *access;
> +	struct iommufd_access *access = NULL;
> ...
> -	access = iommufd_hw_queue_alloc_phys(cmd, viommu, &base_pa);
> +	if (ops->hw_queue_init_phys) {
> +		access = iommufd_hw_queue_alloc_phys(cmd, viommu, &base_pa);
> 
> Also, the other FIXME guideline states that these two ops should be
> mutually exclusive. So, add this too:
> +	if (WARN_ON_ONCE(ops->hw_queue_init &&
> +			 ops->hw_queue_init_phys)) {
> +		rc = -EOPNOTSUPP;

Okay, above matches my understanding. Thanks for the explanation.

Thanks,
baolu
Jason Gunthorpe June 16, 2025, 1:58 p.m. UTC | #3
On Sat, Jun 14, 2025 at 12:14:35AM -0700, Nicolin Chen wrote:
> +	/*
> +	 * FIXME allocation may fail when sizeof(*pages) * max_npages is
> +	 * larger than PAGE_SIZE. This might need a new API returning a
> +	 * bio_vec or something more efficient.
> +	 */
> +	pages = kcalloc(max_npages, sizeof(*pages), GFP_KERNEL);

Use the kvcalloc variation here then. You probably also need a
GFP_NOWARN to avoid syzkaller blowups.

> +	access = iommufd_hw_queue_alloc_phys(cmd, viommu, &base_pa);
> +	if (IS_ERR(access)) {
> +		rc = PTR_ERR(access);
> +		goto out_put_viommu;
> +	}
> +
> +	hw_queue = (struct iommufd_hw_queue *)_iommufd_object_alloc_ucmd(
> +		ucmd, hw_queue_size, IOMMUFD_OBJ_HW_QUEUE);
> +	if (IS_ERR(hw_queue)) {
> +		rc = PTR_ERR(hw_queue);
> +		goto out_destroy_access;
> +	}

I think these two are out of order, alloc the object first, then
do the access and set hw_queue->access. Make sure abort will clean it up
automatically when non-null and remove the out_destroy_access

Jason
diff mbox series

Patch

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index e6b1eb2ab375..1bb1c0764bc2 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -637,6 +637,8 @@  int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd);
 void iommufd_viommu_destroy(struct iommufd_object *obj);
 int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd);
 void iommufd_vdevice_destroy(struct iommufd_object *obj);
+int iommufd_hw_queue_alloc_ioctl(struct iommufd_ucmd *ucmd);
+void iommufd_hw_queue_destroy(struct iommufd_object *obj);
 
 #ifdef CONFIG_IOMMUFD_TEST
 int iommufd_test(struct iommufd_ucmd *ucmd);
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index d6aa2a5eea9c..b38c6631ad85 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -119,6 +119,7 @@  struct iommufd_hw_queue {
 	struct iommufd_object obj;
 	struct iommufd_ctx *ictx;
 	struct iommufd_viommu *viommu;
+	struct iommufd_access *access;
 
 	u64 base_addr; /* in guest physical address space */
 	size_t length;
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 971aa19c0ba1..f091ea072c5f 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -56,6 +56,7 @@  enum {
 	IOMMUFD_CMD_VDEVICE_ALLOC = 0x91,
 	IOMMUFD_CMD_IOAS_CHANGE_PROCESS = 0x92,
 	IOMMUFD_CMD_VEVENTQ_ALLOC = 0x93,
+	IOMMUFD_CMD_HW_QUEUE_ALLOC = 0x94,
 };
 
 /**
@@ -1156,4 +1157,36 @@  enum iommu_hw_queue_type {
 	IOMMU_HW_QUEUE_TYPE_DEFAULT = 0,
 };
 
+/**
+ * struct iommu_hw_queue_alloc - ioctl(IOMMU_HW_QUEUE_ALLOC)
+ * @size: sizeof(struct iommu_hw_queue_alloc)
+ * @flags: Must be 0
+ * @viommu_id: Virtual IOMMU ID to associate the HW queue with
+ * @type: One of enum iommu_hw_queue_type
+ * @index: The logical index to the HW queue per virtual IOMMU for a multi-queue
+ *         model
+ * @out_hw_queue_id: The ID of the new HW queue
+ * @nesting_parent_iova: Base address of the queue memory in the guest physical
+ *                       address space
+ * @length: Length of the queue memory
+ *
+ * Allocate a HW queue object for a vIOMMU-specific HW-accelerated queue, which
+ * allows HW to access a guest queue memory described using @nesting_parent_iova
+ * and @length.
+ *
+ * A vIOMMU can allocate multiple queues, but it must use a different @index per
+ * type to separate each allocation, e.g.
+ *     Type1 HW queue0, Type1 HW queue1, Type2 HW queue0, ...
+ */
+struct iommu_hw_queue_alloc {
+	__u32 size;
+	__u32 flags;
+	__u32 viommu_id;
+	__u32 type;
+	__u32 index;
+	__u32 out_hw_queue_id;
+	__aligned_u64 nesting_parent_iova;
+	__aligned_u64 length;
+};
+#define IOMMU_HW_QUEUE_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HW_QUEUE_ALLOC)
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 778694d7c207..4e8dbbfac890 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -354,6 +354,7 @@  union ucmd_buffer {
 	struct iommu_destroy destroy;
 	struct iommu_fault_alloc fault;
 	struct iommu_hw_info info;
+	struct iommu_hw_queue_alloc hw_queue;
 	struct iommu_hwpt_alloc hwpt;
 	struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap;
 	struct iommu_hwpt_invalidate cache;
@@ -396,6 +397,8 @@  static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 struct iommu_fault_alloc, out_fault_fd),
 	IOCTL_OP(IOMMU_GET_HW_INFO, iommufd_get_hw_info, struct iommu_hw_info,
 		 __reserved),
+	IOCTL_OP(IOMMU_HW_QUEUE_ALLOC, iommufd_hw_queue_alloc_ioctl,
+		 struct iommu_hw_queue_alloc, length),
 	IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc,
 		 __reserved),
 	IOCTL_OP(IOMMU_HWPT_GET_DIRTY_BITMAP, iommufd_hwpt_get_dirty_bitmap,
@@ -559,6 +562,9 @@  static const struct iommufd_object_ops iommufd_object_ops[] = {
 	[IOMMUFD_OBJ_FAULT] = {
 		.destroy = iommufd_fault_destroy,
 	},
+	[IOMMUFD_OBJ_HW_QUEUE] = {
+		.destroy = iommufd_hw_queue_destroy,
+	},
 	[IOMMUFD_OBJ_HWPT_PAGING] = {
 		.destroy = iommufd_hwpt_paging_destroy,
 		.abort = iommufd_hwpt_paging_abort,
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index 28ea5d026222..506479ece826 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -201,3 +201,187 @@  int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd)
 	iommufd_put_object(ucmd->ictx, &viommu->obj);
 	return rc;
 }
+
+static void iommufd_hw_queue_destroy_access(struct iommufd_ctx *ictx,
+					    struct iommufd_access *access,
+					    u64 base_iova, size_t length)
+{
+	iommufd_access_unpin_pages(access, base_iova, length);
+	iommufd_access_detach_internal(access);
+	iommufd_access_destroy_internal(ictx, access);
+}
+
+void iommufd_hw_queue_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_hw_queue *hw_queue =
+		container_of(obj, struct iommufd_hw_queue, obj);
+	struct iommufd_viommu *viommu = hw_queue->viommu;
+
+	if (hw_queue->destroy)
+		hw_queue->destroy(hw_queue);
+	if (hw_queue->access)
+		iommufd_hw_queue_destroy_access(viommu->ictx, hw_queue->access,
+						hw_queue->base_addr,
+						hw_queue->length);
+	refcount_dec(&viommu->obj.users);
+}
+
+/*
+ * When the HW accesses the guest queue via physical addresses, the underlying
+ * physical pages of the guest queue must be contiguous. Also, for the security
+ * concern that IOMMUFD_CMD_IOAS_UNMAP could potentially remove the mappings of
+ * the guest queue from the nesting parent iopt while the HW is still accessing
+ * the guest queue memory physically, such a HW queue must require an access to
+ * pin the underlying pages and prevent that from happening.
+ */
+static struct iommufd_access *
+iommufd_hw_queue_alloc_phys(struct iommu_hw_queue_alloc *cmd,
+			    struct iommufd_viommu *viommu, phys_addr_t *base_pa)
+{
+	struct iommufd_access *access;
+	struct page **pages;
+	int max_npages, i;
+	u64 offset;
+	int rc;
+
+	offset =
+		cmd->nesting_parent_iova - PAGE_ALIGN(cmd->nesting_parent_iova);
+	max_npages = DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE);
+
+	/*
+	 * FIXME allocation may fail when sizeof(*pages) * max_npages is
+	 * larger than PAGE_SIZE. This might need a new API returning a
+	 * bio_vec or something more efficient.
+	 */
+	pages = kcalloc(max_npages, sizeof(*pages), GFP_KERNEL);
+	if (!pages)
+		return ERR_PTR(-ENOMEM);
+
+	access = iommufd_access_create_internal(viommu->ictx);
+	if (IS_ERR(access)) {
+		rc = PTR_ERR(access);
+		goto out_free;
+	}
+
+	rc = iommufd_access_attach_internal(access, viommu->hwpt->ioas);
+	if (rc)
+		goto out_destroy;
+
+	rc = iommufd_access_pin_pages(access, cmd->nesting_parent_iova,
+				      cmd->length, pages, 0);
+	if (rc)
+		goto out_detach;
+
+	/* Validate if the underlying physical pages are contiguous */
+	for (i = 1; i < max_npages; i++) {
+		if (page_to_pfn(pages[i]) == page_to_pfn(pages[i - 1]) + 1)
+			continue;
+		rc = -EFAULT;
+		goto out_unpin;
+	}
+
+	*base_pa = page_to_pfn(pages[0]) << PAGE_SHIFT;
+	kfree(pages);
+	return access;
+
+out_unpin:
+	iommufd_access_unpin_pages(access, cmd->nesting_parent_iova,
+				   cmd->length);
+out_detach:
+	iommufd_access_detach_internal(access);
+out_destroy:
+	iommufd_access_destroy_internal(viommu->ictx, access);
+out_free:
+	kfree(pages);
+	return ERR_PTR(rc);
+}
+
+int iommufd_hw_queue_alloc_ioctl(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_hw_queue_alloc *cmd = ucmd->cmd;
+	struct iommufd_hw_queue *hw_queue;
+	struct iommufd_viommu *viommu;
+	struct iommufd_access *access;
+	size_t hw_queue_size;
+	phys_addr_t base_pa;
+	u64 last;
+	int rc;
+
+	if (cmd->flags || cmd->type == IOMMU_HW_QUEUE_TYPE_DEFAULT)
+		return -EOPNOTSUPP;
+	if (!cmd->length)
+		return -EINVAL;
+	if (check_add_overflow(cmd->nesting_parent_iova, cmd->length - 1,
+			       &last))
+		return -EOVERFLOW;
+
+	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
+	if (IS_ERR(viommu))
+		return PTR_ERR(viommu);
+
+	if (!viommu->ops || !viommu->ops->get_hw_queue_size ||
+	    !viommu->ops->hw_queue_init_phys) {
+		rc = -EOPNOTSUPP;
+		goto out_put_viommu;
+	}
+
+	/*
+	 * FIXME once ops->hw_queue_init is introduced, a WARN_ON_ONCE will be
+	 * required, if hw_queue_init and hw_queue_init_phys both exist, since
+	 * they should be mutually exclusive
+	 */
+
+	hw_queue_size = viommu->ops->get_hw_queue_size(viommu, cmd->type);
+	if (!hw_queue_size) {
+		rc = -EOPNOTSUPP;
+		goto out_put_viommu;
+	}
+
+	/*
+	 * It is a driver bug for providing a hw_queue_size smaller than the
+	 * core HW queue structure size
+	 */
+	if (WARN_ON_ONCE(hw_queue_size < sizeof(*hw_queue))) {
+		rc = -EOPNOTSUPP;
+		goto out_put_viommu;
+	}
+
+	/*
+	 * FIXME once ops->hw_queue_init is introduced, this should check "if
+	 * ops->hw_queue_init_phys". And "access" should be initialized to NULL.
+	 */
+	access = iommufd_hw_queue_alloc_phys(cmd, viommu, &base_pa);
+	if (IS_ERR(access)) {
+		rc = PTR_ERR(access);
+		goto out_put_viommu;
+	}
+
+	hw_queue = (struct iommufd_hw_queue *)_iommufd_object_alloc_ucmd(
+		ucmd, hw_queue_size, IOMMUFD_OBJ_HW_QUEUE);
+	if (IS_ERR(hw_queue)) {
+		rc = PTR_ERR(hw_queue);
+		goto out_destroy_access;
+	}
+
+	hw_queue->viommu = viommu;
+	refcount_inc(&viommu->obj.users);
+	hw_queue->access = access;
+	hw_queue->type = cmd->type;
+	hw_queue->length = cmd->length;
+	hw_queue->base_addr = cmd->nesting_parent_iova;
+
+	rc = viommu->ops->hw_queue_init_phys(hw_queue, cmd->index, base_pa);
+	if (rc)
+		goto out_put_viommu;
+
+	cmd->out_hw_queue_id = hw_queue->obj.id;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	goto out_put_viommu;
+
+out_destroy_access:
+	iommufd_hw_queue_destroy_access(ucmd->ictx, access,
+					cmd->nesting_parent_iova, cmd->length);
+out_put_viommu:
+	iommufd_put_object(ucmd->ictx, &viommu->obj);
+	return rc;
+}