mbox series

[v2,0/8] Initial support for SMMUv3 nested translation

Message ID 0-v2-621370057090+91fec-smmuv3_nesting_jgg@nvidia.com
Headers show
Series Initial support for SMMUv3 nested translation | expand

Message

Jason Gunthorpe Aug. 27, 2024, 3:51 p.m. UTC
This brings support for the IOMMFD ioctls:

 - IOMMU_GET_HW_INFO
 - IOMMU_HWPT_ALLOC_NEST_PARENT
 - IOMMU_DOMAIN_NESTED
 - ops->enforce_cache_coherency()

This is quite straightforward as the nested STE can just be built in the
special NESTED domain op and fed through the generic update machinery.

The design allows the user provided STE fragment to control several
aspects of the translation, including putting the STE into a "virtual
bypass" or a aborting state. This duplicates functionality available by
other means, but it allows trivially preserving the VMID in the STE as we
eventually move towards the VIOMMU owning the VMID.

Nesting support requires the system to either support S2FWB or the
stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
cache and view incoherent data, currently VFIO lacks any cache flushing
that would make this safe.

Yan has a series to add some of the needed infrastructure for VFIO cache
flushing here:

 https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/

Which may someday allow relaxing this further.

Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by
this.

This is the first series in what will be several to complete nesting
support. At least:
 - IOMMU_RESV_SW_MSI related fixups
    https://lore.kernel.org/linux-iommu/cover.1722644866.git.nicolinc@nvidia.com/
 - VIOMMU object support to allow ATS and CD invalidations
    https://lore.kernel.org/linux-iommu/cover.1723061377.git.nicolinc@nvidia.com/
 - vCMDQ hypervisor support for direct invalidation queue assignment
    https://lore.kernel.org/linux-iommu/cover.1712978212.git.nicolinc@nvidia.com/
 - KVM pinned VMID using VIOMMU for vBTM
    https://lore.kernel.org/linux-iommu/20240208151837.35068-1-shameerali.kolothum.thodi@huawei.com/
 - Cross instance S2 sharing
 - Virtual Machine Structure using VIOMMU (for vMPAM?)
 - Fault forwarding support through IOMMUFD's fault fd for vSVA

The VIOMMU series is essential to allow the invalidations to be processed
for the CD as well.

It is enough to allow qemu work to progress.

This is on github: https://github.com/jgunthorpe/linux/commits/smmuv3_nesting

v2:
 - Revise commit messages
 - Guard S2FWB support with ARM_SMMU_FEAT_COHERENCY, since it doesn't make
   sense to use S2FWB to enforce coherency on inherently non-coherent hardware.
 - Add missing IO_PGTABLE_QUIRK_ARM_S2FWB validation
 - Include formal ACPIA commit for IORT built using
   generate/linux/gen-patch.sh
 - Use FEAT_NESTING to block creating a NESTING_PARENT
 - Use an abort STE instead of non-valid if the user requests a non-valid
   vSTE
 - Consistently use 'nest_parent' for naming variables
 - Use the right domain for arm_smmu_remove_master_domain() when it
   removes the master
 - Join bitfields together
 - Drop arm_smmu_cache_invalidate_user patch, invalidation will
   exclusively go via viommu
v1: https://patch.msgid.link/r/0-v1-54e734311a7f+14f72-smmuv3_nesting_jgg@nvidia.com

Jason Gunthorpe (5):
  vfio: Remove VFIO_TYPE1_NESTING_IOMMU
  iommu/arm-smmu-v3: Use S2FWB when available
  iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
  iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED

Nicolin Chen (3):
  ACPICA: IORT: Update for revision E.f
  ACPI/IORT: Support CANWBS memory access flag
  iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct
    arm_smmu_hw_info

 drivers/acpi/arm64/iort.c                   |  13 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 314 ++++++++++++++++++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  26 ++
 drivers/iommu/arm/arm-smmu/arm-smmu.c       |  16 -
 drivers/iommu/io-pgtable-arm.c              |  27 +-
 drivers/iommu/iommu.c                       |  10 -
 drivers/iommu/iommufd/vfio_compat.c         |   7 +-
 drivers/vfio/vfio_iommu_type1.c             |  12 +-
 include/acpi/actbl2.h                       |   3 +-
 include/linux/io-pgtable.h                  |   2 +
 include/linux/iommu.h                       |   5 +-
 include/uapi/linux/iommufd.h                |  55 ++++
 include/uapi/linux/vfio.h                   |   2 +-
 13 files changed, 415 insertions(+), 77 deletions(-)


base-commit: e5e288d94186b266b062b3e44c82c285dfe68712

Comments

Nicolin Chen Aug. 27, 2024, 8:12 p.m. UTC | #1
On Tue, Aug 27, 2024 at 12:51:35PM -0300, Jason Gunthorpe wrote:
> HW with CANWBS is always cache coherent and ignores PCI No Snoop requests
> as well. This meets the requirement for IOMMU_CAP_ENFORCE_CACHE_COHERENCY,
> so let's return it.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

With two very ignorable nits:

> @@ -2693,6 +2718,15 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
>  		 * one of them.
>  		 */
>  		spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> +		if (smmu_domain->enforce_cache_coherency &&
> +		    !(dev_iommu_fwspec_get(master->dev)->flags &
> +		      IOMMU_FWSPEC_PCI_RC_CANWBS)) {

How about a small dev_enforce_cache_coherency() helper?

> +			kfree(master_domain);
> +			spin_unlock_irqrestore(&smmu_domain->devices_lock,
> +					       flags);
> +			return -EINVAL;

kfree() doesn't need to be locked.

Thanks
Nicolin
Shameerali Kolothum Thodi Aug. 28, 2024, 4:31 p.m. UTC | #2
Hi Nicolin,

> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, August 27, 2024 10:31 PM
> To: Jason Gunthorpe <jgg@nvidia.com>
> Cc: acpica-devel@lists.linux.dev; Guohanjun (Hanjun Guo)
> <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>;
> kvm@vger.kernel.org; Len Brown <lenb@kernel.org>; linux-
> acpi@vger.kernel.org; linux-arm-kernel@lists.infradead.org; Lorenzo Pieralisi
> <lpieralisi@kernel.org>; Rafael J. Wysocki <rafael@kernel.org>; Robert
> Moore <robert.moore@intel.com>; Robin Murphy
> <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>; Will
> Deacon <will@kernel.org>; Alex Williamson <alex.williamson@redhat.com>;
> Eric Auger <eric.auger@redhat.com>; Jean-Philippe Brucker <jean-
> philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>; Michael Shavit
> <mshavit@google.com>; patches@lists.linux.dev; Shameerali Kolothum
> Thodi <shameerali.kolothum.thodi@huawei.com>; Mostafa Saleh
> <smostafa@google.com>
> Subject: Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
> 
 
> As mentioned above, the VIOMMU series would be required to test the
> entire nesting feature, which now has a v2 rebasing on this series. I tested it
> with a paring QEMU branch. Please refer to:
> https://lore.kernel.org/linux-
> iommu/cover.1724776335.git.nicolinc@nvidia.com/

Thanks for this. I haven't gone through the viommu and its Qemu branch
yet.  The way we present nested-smmuv3/iommufd to the Qemu seems to
have changed  with the above Qemu branch(multiple nested SMMUs).
The old Qemu command line for nested setup doesn't work anymore.

Could you please share an example Qemu command line  to verify this
series(Sorry, if I missed it in the links/git).

Thanks,
Shameer
Nicolin Chen Aug. 28, 2024, 5:14 p.m. UTC | #3
Hi Shameer,

On Wed, Aug 28, 2024 at 04:31:36PM +0000, Shameerali Kolothum Thodi wrote:
> Hi Nicolin,
> 
> > -----Original Message-----
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, August 27, 2024 10:31 PM
> > To: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: acpica-devel@lists.linux.dev; Guohanjun (Hanjun Guo)
> > <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> > <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>;
> > kvm@vger.kernel.org; Len Brown <lenb@kernel.org>; linux-
> > acpi@vger.kernel.org; linux-arm-kernel@lists.infradead.org; Lorenzo Pieralisi
> > <lpieralisi@kernel.org>; Rafael J. Wysocki <rafael@kernel.org>; Robert
> > Moore <robert.moore@intel.com>; Robin Murphy
> > <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>; Will
> > Deacon <will@kernel.org>; Alex Williamson <alex.williamson@redhat.com>;
> > Eric Auger <eric.auger@redhat.com>; Jean-Philippe Brucker <jean-
> > philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>; Michael Shavit
> > <mshavit@google.com>; patches@lists.linux.dev; Shameerali Kolothum
> > Thodi <shameerali.kolothum.thodi@huawei.com>; Mostafa Saleh
> > <smostafa@google.com>
> > Subject: Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
> >
> 
> > As mentioned above, the VIOMMU series would be required to test the
> > entire nesting feature, which now has a v2 rebasing on this series. I tested it
> > with a paring QEMU branch. Please refer to:
> > https://lore.kernel.org/linux-
> > iommu/cover.1724776335.git.nicolinc@nvidia.com/
> 
> Thanks for this. I haven't gone through the viommu and its Qemu branch
> yet.  The way we present nested-smmuv3/iommufd to the Qemu seems to
> have changed  with the above Qemu branch(multiple nested SMMUs).
> The old Qemu command line for nested setup doesn't work anymore.
> 
> Could you please share an example Qemu command line  to verify this
> series(Sorry, if I missed it in the links/git).

My bad. I updated those two "for_iommufd_" QEMU branches with a
README commit on top of each for the reference command.

By the way, I wonder how many SMMUv3 instances there are on the
platforms that SMMUv3 developers here are running on -- if some
one is also working on a chip that has multiple instances?

Thanks
Nicolin
Shameerali Kolothum Thodi Aug. 28, 2024, 6:06 p.m. UTC | #4
> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, August 28, 2024 6:15 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; acpica-devel@lists.linux.dev;
> Guohanjun (Hanjun Guo) <guohanjun@huawei.com>;
> iommu@lists.linux.dev; Joerg Roedel <joro@8bytes.org>; Kevin Tian
> <kevin.tian@intel.com>; kvm@vger.kernel.org; Len Brown
> <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J.
> Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>;
> Robin Murphy <robin.murphy@arm.com>; Sudeep Holla
> <sudeep.holla@arm.com>; Will Deacon <will@kernel.org>; Alex Williamson
> <alex.williamson@redhat.com>; Eric Auger <eric.auger@redhat.com>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Moritz Fischer
> <mdf@kernel.org>; Michael Shavit <mshavit@google.com>;
> patches@lists.linux.dev; Mostafa Saleh <smostafa@google.com>
> Subject: Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
> 
> Hi Shameer,
> 
> On Wed, Aug 28, 2024 at 04:31:36PM +0000, Shameerali Kolothum Thodi
> wrote:
> > Hi Nicolin,
> >
> > > -----Original Message-----
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Tuesday, August 27, 2024 10:31 PM
> > > To: Jason Gunthorpe <jgg@nvidia.com>
> > > Cc: acpica-devel@lists.linux.dev; Guohanjun (Hanjun Guo)
> > > <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> > > <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>;
> > > kvm@vger.kernel.org; Len Brown <lenb@kernel.org>; linux-
> > > acpi@vger.kernel.org; linux-arm-kernel@lists.infradead.org; Lorenzo
> > > Pieralisi <lpieralisi@kernel.org>; Rafael J. Wysocki
> > > <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>; Robin
> > > Murphy <robin.murphy@arm.com>; Sudeep Holla
> <sudeep.holla@arm.com>;
> > > Will Deacon <will@kernel.org>; Alex Williamson
> > > <alex.williamson@redhat.com>; Eric Auger <eric.auger@redhat.com>;
> > > Jean-Philippe Brucker <jean- philippe@linaro.org>; Moritz Fischer
> > > <mdf@kernel.org>; Michael Shavit <mshavit@google.com>;
> > > patches@lists.linux.dev; Shameerali Kolothum Thodi
> > > <shameerali.kolothum.thodi@huawei.com>; Mostafa Saleh
> > > <smostafa@google.com>
> > > Subject: Re: [PATCH v2 0/8] Initial support for SMMUv3 nested
> > > translation
> > >
> >
> > > As mentioned above, the VIOMMU series would be required to test the
> > > entire nesting feature, which now has a v2 rebasing on this series.
> > > I tested it with a paring QEMU branch. Please refer to:
> > > https://lore.kernel.org/linux-
> > > iommu/cover.1724776335.git.nicolinc@nvidia.com/
> >
> > Thanks for this. I haven't gone through the viommu and its Qemu branch
> > yet.  The way we present nested-smmuv3/iommufd to the Qemu seems to
> > have changed  with the above Qemu branch(multiple nested SMMUs).
> > The old Qemu command line for nested setup doesn't work anymore.
> >
> > Could you please share an example Qemu command line  to verify this
> > series(Sorry, if I missed it in the links/git).
> 
> My bad. I updated those two "for_iommufd_" QEMU branches with a
> README commit on top of each for the reference command.

Thanks. I did give it a go and this is my command line based on above,

./qemu-system-aarch64-nicolin-viommu -object iommufd,id=iommufd0 \
-machine hmat=on \
-machine virt,accel=kvm,gic-version=3,iommu=nested-smmuv3,ras=on \
-cpu host -smp cpus=61 -m size=16G,slots=4,maxmem=256G -nographic \
-object memory-backend-ram,size=8G,id=m0 \
-object memory-backend-ram,size=8G,id=m1 \
-numa node,memdev=m0,cpus=0-60,nodeid=0  -numa node,memdev=m1,nodeid=1 \
-device vfio-pci-nohotplug,host=0000:75:00.1,iommufd=iommufd0 \
-bios QEMU_EFI.fd \
-drive if=none,file=ubuntu-18.04-old.img,id=fs \
-device virtio-blk-device,drive=fs \
-kernel Image \
-append "rdinit=init console=ttyAMA0 root=/dev/vda rw earlycon=pl011,0x9000000 kpti=off" \
-nographic

But it fails to boot very early:

root@ubuntu:/home/shameer/qemu-test# ./qemu_run-simple-iommufd-nicolin-2
qemu-system-aarch64-nicolin-viommu: Illegal numa node 2
 
Any idea what am I missing? Do you any special config enabled while building Qemu?

Thanks,
Shameer
Nicolin Chen Aug. 28, 2024, 6:12 p.m. UTC | #5
On Wed, Aug 28, 2024 at 06:06:36PM +0000, Shameerali Kolothum Thodi wrote:
> > > > As mentioned above, the VIOMMU series would be required to test the
> > > > entire nesting feature, which now has a v2 rebasing on this series.
> > > > I tested it with a paring QEMU branch. Please refer to:
> > > > https://lore.kernel.org/linux-
> > > > iommu/cover.1724776335.git.nicolinc@nvidia.com/
> > >
> > > Thanks for this. I haven't gone through the viommu and its Qemu branch
> > > yet.  The way we present nested-smmuv3/iommufd to the Qemu seems to
> > > have changed  with the above Qemu branch(multiple nested SMMUs).
> > > The old Qemu command line for nested setup doesn't work anymore.
> > >
> > > Could you please share an example Qemu command line  to verify this
> > > series(Sorry, if I missed it in the links/git).
> >
> > My bad. I updated those two "for_iommufd_" QEMU branches with a
> > README commit on top of each for the reference command.
> 
> Thanks. I did give it a go and this is my command line based on above,

> But it fails to boot very early:
> 
> root@ubuntu:/home/shameer/qemu-test# ./qemu_run-simple-iommufd-nicolin-2
> qemu-system-aarch64-nicolin-viommu: Illegal numa node 2
> 
> Any idea what am I missing? Do you any special config enabled while building Qemu?

Looks like you are running on a multi-SMMU platform :)

Would you please try syncing your local branch? That should work,
as the update also had a small change to the virt code:

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 161a28a311..a782909016 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1640,7 +1640,7 @@ static PCIBus *create_pcie_expander_bridge(VirtMachineState *vms, uint8_t idx)
     }

     qdev_prop_set_uint8(dev, "bus_nr", bus_nr);
-    qdev_prop_set_uint16(dev, "numa_node", idx);
+    qdev_prop_set_uint16(dev, "numa_node", 0);
     qdev_realize_and_unref(dev, BUS(bus), &error_fatal);

     /* Get the pxb bus */


Thanks
Nicolin
Jason Gunthorpe Aug. 28, 2024, 7:12 p.m. UTC | #6
On Tue, Aug 27, 2024 at 01:12:27PM -0700, Nicolin Chen wrote:
> > @@ -2693,6 +2718,15 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
> >  		 * one of them.
> >  		 */
> >  		spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> > +		if (smmu_domain->enforce_cache_coherency &&
> > +		    !(dev_iommu_fwspec_get(master->dev)->flags &
> > +		      IOMMU_FWSPEC_PCI_RC_CANWBS)) {
> 
> How about a small dev_enforce_cache_coherency() helper?

I added a

+static bool arm_smmu_master_canwbs(struct arm_smmu_master *master)
+{
+       return dev_iommu_fwspec_get(master->dev)->flags &
+              IOMMU_FWSPEC_PCI_RC_CANWBS;
+}
+

> > +			kfree(master_domain);
> > +			spin_unlock_irqrestore(&smmu_domain->devices_lock,
> > +					       flags);
> > +			return -EINVAL;
> 
> kfree() doesn't need to be locked.

Yep

Thanks,
Jason
Shameerali Kolothum Thodi Aug. 29, 2024, 1:14 p.m. UTC | #7
> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, August 28, 2024 7:13 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; acpica-devel@lists.linux.dev;
> Guohanjun (Hanjun Guo) <guohanjun@huawei.com>;
> iommu@lists.linux.dev; Joerg Roedel <joro@8bytes.org>; Kevin Tian
> <kevin.tian@intel.com>; kvm@vger.kernel.org; Len Brown
> <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J.
> Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>;
> Robin Murphy <robin.murphy@arm.com>; Sudeep Holla
> <sudeep.holla@arm.com>; Will Deacon <will@kernel.org>; Alex Williamson
> <alex.williamson@redhat.com>; Eric Auger <eric.auger@redhat.com>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Moritz Fischer
> <mdf@kernel.org>; Michael Shavit <mshavit@google.com>;
> patches@lists.linux.dev; Mostafa Saleh <smostafa@google.com>
> Subject: Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
> 
> On Wed, Aug 28, 2024 at 06:06:36PM +0000, Shameerali Kolothum Thodi
> wrote:
> > > > > As mentioned above, the VIOMMU series would be required to test
> the
> > > > > entire nesting feature, which now has a v2 rebasing on this series.
> > > > > I tested it with a paring QEMU branch. Please refer to:
> > > > > https://lore.kernel.org/linux-
> > > > > iommu/cover.1724776335.git.nicolinc@nvidia.com/
> > > >
> > > > Thanks for this. I haven't gone through the viommu and its Qemu
> branch
> > > > yet.  The way we present nested-smmuv3/iommufd to the Qemu seems
> to
> > > > have changed  with the above Qemu branch(multiple nested SMMUs).
> > > > The old Qemu command line for nested setup doesn't work anymore.
> > > >
> > > > Could you please share an example Qemu command line  to verify this
> > > > series(Sorry, if I missed it in the links/git).
> > >
> > > My bad. I updated those two "for_iommufd_" QEMU branches with a
> > > README commit on top of each for the reference command.
> >
> > Thanks. I did give it a go and this is my command line based on above,
> 
> > But it fails to boot very early:
> >
> > root@ubuntu:/home/shameer/qemu-test# ./qemu_run-simple-iommufd-
> nicolin-2
> > qemu-system-aarch64-nicolin-viommu: Illegal numa node 2
> >
> > Any idea what am I missing? Do you any special config enabled while
> building Qemu?
> 
> Looks like you are running on a multi-SMMU platform :)
> 
> Would you please try syncing your local branch? That should work,
> as the update also had a small change to the virt code:
> 
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 161a28a311..a782909016 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -1640,7 +1640,7 @@ static PCIBus
> *create_pcie_expander_bridge(VirtMachineState *vms, uint8_t idx)
>      }
> 
>      qdev_prop_set_uint8(dev, "bus_nr", bus_nr);
> -    qdev_prop_set_uint16(dev, "numa_node", idx);
> +    qdev_prop_set_uint16(dev, "numa_node", 0);
>      qdev_realize_and_unref(dev, BUS(bus), &error_fatal);

That makes some progress. But still I am not seeing the assigned
dev  in Guest.

-device vfio-pci-nohotplug,host=0000:75:00.1,iommufd=iommufd0

root@ubuntu:/# lspci -tv#

root@ubuntu:/# lspci -tv
-+-[0000:ca]---00.0-[cb]--
 \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
             +-01.0  Red Hat, Inc Virtio network device
             +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-06.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-07.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-08.0  Red Hat, Inc. QEMU PCIe Expander bridge
             \-09.0  Red Hat, Inc. QEMU PCIe Expander bridge

The new root port is created, but no device attached.

But without iommufd,
-device vfio-pci-nohotplug,host=0000:75:00.1

root@ubuntu:/# lspci -tv
-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
           +-01.0  Red Hat, Inc Virtio network device
           +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
           +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
           +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
           +-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
           +-06.0  Red Hat, Inc. QEMU PCIe Expander bridge
           +-07.0  Red Hat, Inc. QEMU PCIe Expander bridge
           +-08.0  Red Hat, Inc. QEMU PCIe Expander bridge
           +-09.0  Red Hat, Inc. QEMU PCIe Expander bridge
           \-0a.0  Huawei Technologies Co., Ltd. Device a251

We can see dev a251.

And yes the setup has multiple SMMUs(8).

Thanks,
Shameer
Shameerali Kolothum Thodi Aug. 29, 2024, 2:52 p.m. UTC | #8
> -----Original Message-----
> From: Shameerali Kolothum Thodi
> Sent: Thursday, August 29, 2024 2:15 PM
> To: 'Nicolin Chen' <nicolinc@nvidia.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; acpica-devel@lists.linux.dev;
> Guohanjun (Hanjun Guo) <guohanjun@huawei.com>;
> iommu@lists.linux.dev; Joerg Roedel <joro@8bytes.org>; Kevin Tian
> <kevin.tian@intel.com>; kvm@vger.kernel.org; Len Brown
> <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J.
> Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>;
> Robin Murphy <robin.murphy@arm.com>; Sudeep Holla
> <sudeep.holla@arm.com>; Will Deacon <will@kernel.org>; Alex Williamson
> <alex.williamson@redhat.com>; Eric Auger <eric.auger@redhat.com>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Moritz Fischer
> <mdf@kernel.org>; Michael Shavit <mshavit@google.com>;
> patches@lists.linux.dev; Mostafa Saleh <smostafa@google.com>
> Subject: RE: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
> 
> That makes some progress. But still I am not seeing the assigned dev  in
> Guest.
> 
> -device vfio-pci-nohotplug,host=0000:75:00.1,iommufd=iommufd0
> 
> root@ubuntu:/# lspci -tv#
> 
> root@ubuntu:/# lspci -tv
> -+-[0000:ca]---00.0-[cb]--
>  \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
>              +-01.0  Red Hat, Inc Virtio network device
>              +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-06.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-07.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-08.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              \-09.0  Red Hat, Inc. QEMU PCIe Expander bridge
> 
> The new root port is created, but no device attached.
It looks like Guest finds the config invalid:

[    0.283618] PCI host bridge to bus 0000:ca
[    0.284064] ACPI BIOS Error (bug): \_SB.PCF7.PCEE.PCE5.PCDC.PCD3.PCCA._DSM: Excess arguments - ASL declared 5, ACPI requires 4 (20240322/nsarguments-162)
[    0.285533] pci_bus 0000:ca: root bus resource [bus ca]
[    0.286214] pci 0000:ca:00.0: [1b36:000c] type 01 class 0x060400 PCIe Root Port
[    0.287717] pci 0000:ca:00.0: BAR 0 [mem 0x00000000-0x00000fff]
[    0.288431] pci 0000:ca:00.0: PCI bridge to [bus 00]
[    0.290649] pci 0000:ca:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    0.292476] pci_bus 0000:cb: busn_res: can not insert [bus cb-ca] under [bus ca] (conflicts with (null) [bus ca])
[    0.293597] pci_bus 0000:cb: busn_res: [bus cb-ca] end is updated to cb
[    0.294300] pci_bus 0000:cb: busn_res: can not insert [bus cb] under [bus ca] (conflicts with (null) [bus ca])

Let me know if you have any clue. 

Thanks,
Shameer
Nicolin Chen Aug. 29, 2024, 4:10 p.m. UTC | #9
On Thu, Aug 29, 2024 at 02:52:23PM +0000, Shameerali Kolothum Thodi wrote:
> > That makes some progress. But still I am not seeing the assigned dev  in
> > Guest.
> >
> > -device vfio-pci-nohotplug,host=0000:75:00.1,iommufd=iommufd0
> >
> > root@ubuntu:/# lspci -tv#
> >
> > root@ubuntu:/# lspci -tv
> > -+-[0000:ca]---00.0-[cb]--
> >  \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
> >              +-01.0  Red Hat, Inc Virtio network device
> >              +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
> >              +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
> >              +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
> >              +-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
> >              +-06.0  Red Hat, Inc. QEMU PCIe Expander bridge
> >              +-07.0  Red Hat, Inc. QEMU PCIe Expander bridge
> >              +-08.0  Red Hat, Inc. QEMU PCIe Expander bridge
> >              \-09.0  Red Hat, Inc. QEMU PCIe Expander bridge

Hmm, the tree looks correct..

> > The new root port is created, but no device attached.
> It looks like Guest finds the config invalid:
> 
> [    0.283618] PCI host bridge to bus 0000:ca
> [    0.284064] ACPI BIOS Error (bug): \_SB.PCF7.PCEE.PCE5.PCDC.PCD3.PCCA._DSM: Excess arguments - ASL declared 5, ACPI requires 4 (20240322/nsarguments-162)

Looks like the DSM change wasn't clean. Yet, this might not be the
root cause, as mine could boot with it.

Here is mine (I added a print to that conflict part, for success):

[    0.340733] ACPI BIOS Error (bug): \_SB.PCF7.PCEE.PCE5.PCDC._DSM: Excess arguments - ASL declared 5, ACPI requires 4 (20230628/nsarguments-162)
[    0.341776] pci 0000:dc:00.0: [1b36:000c] type 01 class 0x060400 PCIe Root Port
[    0.344895] pci 0000:dc:00.0: BAR 0 [mem 0x10400000-0x10400fff]
[    0.347935] pci 0000:dc:00.0: PCI bridge to [bus dd]
[    0.348410] pci 0000:dc:00.0:   bridge window [mem 0x10200000-0x103fffff]
[    0.349483] pci 0000:dc:00.0:   bridge window [mem 0x42000000000-0x44080ffffff 64bit pref]
[    0.351459] pci_bus 0000:dd: busn_res: insert [bus dd] under [bus dc-dd]

In my case:
[root bus (00)] <---[pxb (dc)] <--- [root-port (dd)] <--- dev

In your case:
[root bus (00)] <---[pxb (ca)] <--- [root-port (cb)] <--- dev

> [    0.285533] pci_bus 0000:ca: root bus resource [bus ca]
> [    0.286214] pci 0000:ca:00.0: [1b36:000c] type 01 class 0x060400 PCIe Root Port
> [    0.287717] pci 0000:ca:00.0: BAR 0 [mem 0x00000000-0x00000fff]
> [    0.288431] pci 0000:ca:00.0: PCI bridge to [bus 00]

This starts to diff. Somehow the link is reversed? It should be:
 [    0.288431] pci 0000:ca:00.0: PCI bridge to [bus cb]

> [    0.290649] pci 0000:ca:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> [    0.292476] pci_bus 0000:cb: busn_res: can not insert [bus cb-ca] under [bus ca] (conflicts with (null) [bus ca])
> [    0.293597] pci_bus 0000:cb: busn_res: [bus cb-ca] end is updated to cb
> [    0.294300] pci_bus 0000:cb: busn_res: can not insert [bus cb] under [bus ca] (conflicts with (null) [bus ca])

And then everything went south...

Would you please try adding some prints?
----------------------------------------------------------------------
@@ -1556,6 +1556,7 @@ static char *create_new_pcie_port(VirtNestedSmmu *nested_smmu, Error **errp)
     uint32_t bus_nr = pci_bus_num(nested_smmu->pci_bus);
     DeviceState *dev;
     char *name_port;
+    bool ret;
 
     /* Create a root port */
     dev = qdev_new("pcie-root-port");
@@ -1571,7 +1572,9 @@ static char *create_new_pcie_port(VirtNestedSmmu *nested_smmu, Error **errp)
     qdev_prop_set_uint32(dev, "chassis", chassis_nr);
     qdev_prop_set_uint32(dev, "slot", port_nr);
     qdev_prop_set_uint64(dev, "io-reserve", 0);
-    qdev_realize_and_unref(dev, BUS(nested_smmu->pci_bus), &error_fatal);
+    ret = qdev_realize_and_unref(dev, BUS(nested_smmu->pci_bus), &error_fatal);
+    fprintf(stderr, "ret=%d, pcie-root-port ID: %s, added to pxb_bus num: %x, chassis: %d\n",
+            ret, name_port, pci_bus_num(nested_smmu->pci_bus), chassis_nr);
     return name_port;
 }
 
----------------------------------------------------------------------

We should make sure that the 'bus_nr' and 'bus' are set correctly
and all the realize() returned true?:

Thanks
Nicolin
Shameerali Kolothum Thodi Aug. 30, 2024, 9:07 a.m. UTC | #10
> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, August 29, 2024 5:10 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; acpica-devel@lists.linux.dev;
> Guohanjun (Hanjun Guo) <guohanjun@huawei.com>; iommu@lists.linux.dev;
> Joerg Roedel <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>;
> kvm@vger.kernel.org; Len Brown <lenb@kernel.org>; linux-
> acpi@vger.kernel.org; linux-arm-kernel@lists.infradead.org; Lorenzo Pieralisi
> <lpieralisi@kernel.org>; Rafael J. Wysocki <rafael@kernel.org>; Robert Moore
> <robert.moore@intel.com>; Robin Murphy <robin.murphy@arm.com>; Sudeep
> Holla <sudeep.holla@arm.com>; Will Deacon <will@kernel.org>; Alex
> Williamson <alex.williamson@redhat.com>; Eric Auger
> <eric.auger@redhat.com>; Jean-Philippe Brucker <jean-philippe@linaro.org>;
> Moritz Fischer <mdf@kernel.org>; Michael Shavit <mshavit@google.com>;
> patches@lists.linux.dev; Mostafa Saleh <smostafa@google.com>
> Subject: Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
> 
> On Thu, Aug 29, 2024 at 02:52:23PM +0000, Shameerali Kolothum Thodi wrote:
> > > That makes some progress. But still I am not seeing the assigned dev  in
> > > Guest.
> > >
> > > -device vfio-pci-nohotplug,host=0000:75:00.1,iommufd=iommufd0
> > >
> > > root@ubuntu:/# lspci -tv#
> > >
> > > root@ubuntu:/# lspci -tv
> > > -+-[0000:ca]---00.0-[cb]--
> > >  \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
> > >              +-01.0  Red Hat, Inc Virtio network device
> > >              +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
> > >              +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
> > >              +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
> > >              +-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
> > >              +-06.0  Red Hat, Inc. QEMU PCIe Expander bridge
> > >              +-07.0  Red Hat, Inc. QEMU PCIe Expander bridge
> > >              +-08.0  Red Hat, Inc. QEMU PCIe Expander bridge
> > >              \-09.0  Red Hat, Inc. QEMU PCIe Expander bridge
> 
> Hmm, the tree looks correct..
> 
> > > The new root port is created, but no device attached.
> > It looks like Guest finds the config invalid:
> >
> > [    0.283618] PCI host bridge to bus 0000:ca
> > [    0.284064] ACPI BIOS Error (bug):
> \_SB.PCF7.PCEE.PCE5.PCDC.PCD3.PCCA._DSM: Excess arguments - ASL declared
> 5, ACPI requires 4 (20240322/nsarguments-162)
> 
> Looks like the DSM change wasn't clean. Yet, this might not be the
> root cause, as mine could boot with it.

Yes. This is not the culprit in this case and was reported earlier as well,

https://patchew.org/QEMU/20211005085313.493858-1-eric.auger@redhat.com/20211005085313.493858-2-eric.auger@redhat.com/

> Here is mine (I added a print to that conflict part, for success):
> 
> [    0.340733] ACPI BIOS Error (bug): \_SB.PCF7.PCEE.PCE5.PCDC._DSM: Excess
> arguments - ASL declared 5, ACPI requires 4 (20230628/nsarguments-162)
> [    0.341776] pci 0000:dc:00.0: [1b36:000c] type 01 class 0x060400 PCIe Root
> Port
> [    0.344895] pci 0000:dc:00.0: BAR 0 [mem 0x10400000-0x10400fff]
> [    0.347935] pci 0000:dc:00.0: PCI bridge to [bus dd]
> [    0.348410] pci 0000:dc:00.0:   bridge window [mem 0x10200000-0x103fffff]
> [    0.349483] pci 0000:dc:00.0:   bridge window [mem 0x42000000000-
> 0x44080ffffff 64bit pref]
> [    0.351459] pci_bus 0000:dd: busn_res: insert [bus dd] under [bus dc-dd]
> 
> In my case:
> [root bus (00)] <---[pxb (dc)] <--- [root-port (dd)] <--- dev
> 
> In your case:
> [root bus (00)] <---[pxb (ca)] <--- [root-port (cb)] <--- dev
> 
> > [    0.285533] pci_bus 0000:ca: root bus resource [bus ca]
> > [    0.286214] pci 0000:ca:00.0: [1b36:000c] type 01 class 0x060400 PCIe Root
> Port
> > [    0.287717] pci 0000:ca:00.0: BAR 0 [mem 0x00000000-0x00000fff]
> > [    0.288431] pci 0000:ca:00.0: PCI bridge to [bus 00]
> 
> This starts to diff. Somehow the link is reversed? It should be:
>  [    0.288431] pci 0000:ca:00.0: PCI bridge to [bus cb]
> 
> > [    0.290649] pci 0000:ca:00.0: bridge configuration invalid ([bus 00-00]),
> reconfiguring
> > [    0.292476] pci_bus 0000:cb: busn_res: can not insert [bus cb-ca] under [bus
> ca] (conflicts with (null) [bus ca])
> > [    0.293597] pci_bus 0000:cb: busn_res: [bus cb-ca] end is updated to cb
> > [    0.294300] pci_bus 0000:cb: busn_res: can not insert [bus cb] under [bus ca]
> (conflicts with (null) [bus ca])
> 
> And then everything went south...
> 
> Would you please try adding some prints?
> ----------------------------------------------------------------------
> @@ -1556,6 +1556,7 @@ static char *create_new_pcie_port(VirtNestedSmmu
> *nested_smmu, Error **errp)
>      uint32_t bus_nr = pci_bus_num(nested_smmu->pci_bus);
>      DeviceState *dev;
>      char *name_port;
> +    bool ret;
> 
>      /* Create a root port */
>      dev = qdev_new("pcie-root-port");
> @@ -1571,7 +1572,9 @@ static char *create_new_pcie_port(VirtNestedSmmu
> *nested_smmu, Error **errp)
>      qdev_prop_set_uint32(dev, "chassis", chassis_nr);
>      qdev_prop_set_uint32(dev, "slot", port_nr);
>      qdev_prop_set_uint64(dev, "io-reserve", 0);
> -    qdev_realize_and_unref(dev, BUS(nested_smmu->pci_bus), &error_fatal);
> +    ret = qdev_realize_and_unref(dev, BUS(nested_smmu->pci_bus),
> &error_fatal);
> +    fprintf(stderr, "ret=%d, pcie-root-port ID: %s, added to pxb_bus num: %x,
> chassis: %d\n",
> +            ret, name_port, pci_bus_num(nested_smmu->pci_bus), chassis_nr);
>      return name_port;
>  }

Print shows everything fine:
create_new_pcie_port: name_port smmu_bus0xca_port0, bus_nr 0xca chassis_nr 0xfd, nested_smmu->index 0x2, pci_bus_num 0xca, ret 1

It looks like a problem with old QEMU_EFI.fd(2022 build and before).
I tried with 2023 QEMU_EFI.fd and with that it looks fine.

root@ubuntu:/# lspci -tv
-+-[0000:ca]---00.0-[cb]----00.0  Huawei Technologies Co., Ltd. Device a251
 \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
             +-01.0  Red Hat, Inc Virtio network device
             +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-06.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-07.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-08.0  Red Hat, Inc. QEMU PCIe Expander bridge
             \-09.0  Red Hat, Inc. QEMU PCIe Expander bridge

So for now, I can proceed.

Thanks,
Shameer
Nicolin Chen Aug. 30, 2024, 5:01 p.m. UTC | #11
On Fri, Aug 30, 2024 at 09:07:16AM +0000, Shameerali Kolothum Thodi wrote:

> Print shows everything fine:
> create_new_pcie_port: name_port smmu_bus0xca_port0, bus_nr 0xca chassis_nr 0xfd, nested_smmu->index 0x2, pci_bus_num 0xca, ret 1
> 
> It looks like a problem with old QEMU_EFI.fd(2022 build and before).
> I tried with 2023 QEMU_EFI.fd and with that it looks fine.
> 
> root@ubuntu:/# lspci -tv
> -+-[0000:ca]---00.0-[cb]----00.0  Huawei Technologies Co., Ltd. Device a251
>  \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
>              +-01.0  Red Hat, Inc Virtio network device
>              +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-06.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-07.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-08.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              \-09.0  Red Hat, Inc. QEMU PCIe Expander bridge
> 
> So for now, I can proceed.

Nice! That's a relief, for now :)

Thanks
Nicolin
Zhangfei Gao Sept. 12, 2024, 3:42 a.m. UTC | #12
Hi, Nico

On Wed, 28 Aug 2024 at 05:32, Nicolin Chen <nicolinc@nvidia.com> wrote:
>
> On Tue, Aug 27, 2024 at 12:51:30PM -0300, Jason Gunthorpe wrote:
> > This brings support for the IOMMFD ioctls:
> >
> >  - IOMMU_GET_HW_INFO
> >  - IOMMU_HWPT_ALLOC_NEST_PARENT
> >  - IOMMU_DOMAIN_NESTED
> >  - ops->enforce_cache_coherency()
> >
> > This is quite straightforward as the nested STE can just be built in the
> > special NESTED domain op and fed through the generic update machinery.
> >
> > The design allows the user provided STE fragment to control several
> > aspects of the translation, including putting the STE into a "virtual
> > bypass" or a aborting state. This duplicates functionality available by
> > other means, but it allows trivially preserving the VMID in the STE as we
> > eventually move towards the VIOMMU owning the VMID.
> >
> > Nesting support requires the system to either support S2FWB or the
> > stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
> > cache and view incoherent data, currently VFIO lacks any cache flushing
> > that would make this safe.
> >
> > Yan has a series to add some of the needed infrastructure for VFIO cache
> > flushing here:
> >
> >  https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/
> >
> > Which may someday allow relaxing this further.
> >
> > Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by
> > this.
> >
> > This is the first series in what will be several to complete nesting
> > support. At least:
> >  - IOMMU_RESV_SW_MSI related fixups
> >     https://lore.kernel.org/linux-iommu/cover.1722644866.git.nicolinc@nvidia.com/
> >  - VIOMMU object support to allow ATS and CD invalidations
> >     https://lore.kernel.org/linux-iommu/cover.1723061377.git.nicolinc@nvidia.com/
> >  - vCMDQ hypervisor support for direct invalidation queue assignment
> >     https://lore.kernel.org/linux-iommu/cover.1712978212.git.nicolinc@nvidia.com/
> >  - KVM pinned VMID using VIOMMU for vBTM
> >     https://lore.kernel.org/linux-iommu/20240208151837.35068-1-shameerali.kolothum.thodi@huawei.com/
> >  - Cross instance S2 sharing
> >  - Virtual Machine Structure using VIOMMU (for vMPAM?)
> >  - Fault forwarding support through IOMMUFD's fault fd for vSVA
> >
> > The VIOMMU series is essential to allow the invalidations to be processed
> > for the CD as well.
> >
> > It is enough to allow qemu work to progress.
> >
> > This is on github: https://github.com/jgunthorpe/linux/commits/smmuv3_nesting
> >
> > v2:
>
> As mentioned above, the VIOMMU series would be required to test
> the entire nesting feature, which now has a v2 rebasing on this
> series. I tested it with a paring QEMU branch. Please refer to:
> https://lore.kernel.org/linux-iommu/cover.1724776335.git.nicolinc@nvidia.com/
> Also, there is another new VIRQ series on top of the VIOMMU one
> and this nesting series. And I tested it too. Please refer to:
> https://lore.kernel.org/linux-iommu/cover.1724777091.git.nicolinc@nvidia.com/
>
> With that,
>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
>
Have you tested the user page fault?

I got an issue, when a user page fault happens,
 group->attach_handle = iommu_attach_handle_get(pasid)
return NULL.

A bit confused here, only find IOMMU_NO_PASID is used when attaching

 __fault_domain_replace_dev
ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
&handle->handle);
curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);

not find where the code attach user pasid with the attach_handle.

Thanks
Nicolin Chen Sept. 12, 2024, 4:05 a.m. UTC | #13
On Thu, Sep 12, 2024 at 11:42:43AM +0800, Zhangfei Gao wrote:
> > > The VIOMMU series is essential to allow the invalidations to be processed
> > > for the CD as well.
> > >
> > > It is enough to allow qemu work to progress.
> > >
> > > This is on github: https://github.com/jgunthorpe/linux/commits/smmuv3_nesting
> > >
> > > v2:
> >
> > As mentioned above, the VIOMMU series would be required to test
> > the entire nesting feature, which now has a v2 rebasing on this
> > series. I tested it with a paring QEMU branch. Please refer to:
> > https://lore.kernel.org/linux-iommu/cover.1724776335.git.nicolinc@nvidia.com/
> > Also, there is another new VIRQ series on top of the VIOMMU one
> > and this nesting series. And I tested it too. Please refer to:
> > https://lore.kernel.org/linux-iommu/cover.1724777091.git.nicolinc@nvidia.com/
> >
> > With that,
> >
> > Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> >
> Have you tested the user page fault?

No, I don't have a HW to test PRI. So, I've little experience with
the IOPF and its counter part in QEMU. I recall that Shameer has a
series of changes in QEMU.

> I got an issue, when a user page fault happens,
>  group->attach_handle = iommu_attach_handle_get(pasid)
> return NULL.
> 
> A bit confused here, only find IOMMU_NO_PASID is used when attaching
> 
>  __fault_domain_replace_dev
> ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> &handle->handle);
> curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);
> 
> not find where the code attach user pasid with the attach_handle.

In SMMUv3 case (the latest design), a DOMAIN_NESTED is a CD/PASID
table. So one single attach_handle (domain/idev) should be enough
to cover the entire thing. Please refer to:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/io-pgfault.c?h=v6.11-rc7#n127

Thanks
Nicolin
Baolu Lu Sept. 12, 2024, 4:25 a.m. UTC | #14
On 9/12/24 11:42 AM, Zhangfei Gao wrote:
> On Wed, 28 Aug 2024 at 05:32, Nicolin Chen<nicolinc@nvidia.com>  wrote:
>> On Tue, Aug 27, 2024 at 12:51:30PM -0300, Jason Gunthorpe wrote:
>>> This brings support for the IOMMFD ioctls:
>>>
>>>   - IOMMU_GET_HW_INFO
>>>   - IOMMU_HWPT_ALLOC_NEST_PARENT
>>>   - IOMMU_DOMAIN_NESTED
>>>   - ops->enforce_cache_coherency()
>>>
>>> This is quite straightforward as the nested STE can just be built in the
>>> special NESTED domain op and fed through the generic update machinery.
>>>
>>> The design allows the user provided STE fragment to control several
>>> aspects of the translation, including putting the STE into a "virtual
>>> bypass" or a aborting state. This duplicates functionality available by
>>> other means, but it allows trivially preserving the VMID in the STE as we
>>> eventually move towards the VIOMMU owning the VMID.
>>>
>>> Nesting support requires the system to either support S2FWB or the
>>> stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
>>> cache and view incoherent data, currently VFIO lacks any cache flushing
>>> that would make this safe.
>>>
>>> Yan has a series to add some of the needed infrastructure for VFIO cache
>>> flushing here:
>>>
>>>   https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/
>>>
>>> Which may someday allow relaxing this further.
>>>
>>> Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by
>>> this.
>>>
>>> This is the first series in what will be several to complete nesting
>>> support. At least:
>>>   - IOMMU_RESV_SW_MSI related fixups
>>>      https://lore.kernel.org/linux-iommu/cover.1722644866.git.nicolinc@nvidia.com/
>>>   - VIOMMU object support to allow ATS and CD invalidations
>>>      https://lore.kernel.org/linux-iommu/cover.1723061377.git.nicolinc@nvidia.com/
>>>   - vCMDQ hypervisor support for direct invalidation queue assignment
>>>      https://lore.kernel.org/linux-iommu/cover.1712978212.git.nicolinc@nvidia.com/
>>>   - KVM pinned VMID using VIOMMU for vBTM
>>>      https://lore.kernel.org/linux-iommu/20240208151837.35068-1-shameerali.kolothum.thodi@huawei.com/
>>>   - Cross instance S2 sharing
>>>   - Virtual Machine Structure using VIOMMU (for vMPAM?)
>>>   - Fault forwarding support through IOMMUFD's fault fd for vSVA
>>>
>>> The VIOMMU series is essential to allow the invalidations to be processed
>>> for the CD as well.
>>>
>>> It is enough to allow qemu work to progress.
>>>
>>> This is on github:https://github.com/jgunthorpe/linux/commits/smmuv3_nesting
>>>
>>> v2:
>> As mentioned above, the VIOMMU series would be required to test
>> the entire nesting feature, which now has a v2 rebasing on this
>> series. I tested it with a paring QEMU branch. Please refer to:
>> https://lore.kernel.org/linux-iommu/cover.1724776335.git.nicolinc@nvidia.com/
>> Also, there is another new VIRQ series on top of the VIOMMU one
>> and this nesting series. And I tested it too. Please refer to:
>> https://lore.kernel.org/linux-iommu/cover.1724777091.git.nicolinc@nvidia.com/
>>
>> With that,
>>
>> Tested-by: Nicolin Chen<nicolinc@nvidia.com>
>>
> Have you tested the user page fault?
> 
> I got an issue, when a user page fault happens,
>   group->attach_handle = iommu_attach_handle_get(pasid)
> return NULL.
> 
> A bit confused here, only find IOMMU_NO_PASID is used when attaching
> 
>   __fault_domain_replace_dev
> ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> &handle->handle);
> curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);
> 
> not find where the code attach user pasid with the attach_handle.

Have you set iommu_ops::user_pasid_table for SMMUv3 driver?

Thanks,
baolu
Zhangfei Gao Sept. 12, 2024, 7:32 a.m. UTC | #15
On Thu, 12 Sept 2024 at 12:29, Baolu Lu <baolu.lu@linux.intel.com> wrote:
>
> On 9/12/24 11:42 AM, Zhangfei Gao wrote:
> > On Wed, 28 Aug 2024 at 05:32, Nicolin Chen<nicolinc@nvidia.com>  wrote:
> >> On Tue, Aug 27, 2024 at 12:51:30PM -0300, Jason Gunthorpe wrote:
> >>> This brings support for the IOMMFD ioctls:
> >>>
> >>>   - IOMMU_GET_HW_INFO
> >>>   - IOMMU_HWPT_ALLOC_NEST_PARENT
> >>>   - IOMMU_DOMAIN_NESTED
> >>>   - ops->enforce_cache_coherency()
> >>>
> >>> This is quite straightforward as the nested STE can just be built in the
> >>> special NESTED domain op and fed through the generic update machinery.
> >>>
> >>> The design allows the user provided STE fragment to control several
> >>> aspects of the translation, including putting the STE into a "virtual
> >>> bypass" or a aborting state. This duplicates functionality available by
> >>> other means, but it allows trivially preserving the VMID in the STE as we
> >>> eventually move towards the VIOMMU owning the VMID.
> >>>
> >>> Nesting support requires the system to either support S2FWB or the
> >>> stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
> >>> cache and view incoherent data, currently VFIO lacks any cache flushing
> >>> that would make this safe.
> >>>
> >>> Yan has a series to add some of the needed infrastructure for VFIO cache
> >>> flushing here:
> >>>
> >>>   https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/
> >>>
> >>> Which may someday allow relaxing this further.
> >>>
> >>> Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by
> >>> this.
> >>>
> >>> This is the first series in what will be several to complete nesting
> >>> support. At least:
> >>>   - IOMMU_RESV_SW_MSI related fixups
> >>>      https://lore.kernel.org/linux-iommu/cover.1722644866.git.nicolinc@nvidia.com/
> >>>   - VIOMMU object support to allow ATS and CD invalidations
> >>>      https://lore.kernel.org/linux-iommu/cover.1723061377.git.nicolinc@nvidia.com/
> >>>   - vCMDQ hypervisor support for direct invalidation queue assignment
> >>>      https://lore.kernel.org/linux-iommu/cover.1712978212.git.nicolinc@nvidia.com/
> >>>   - KVM pinned VMID using VIOMMU for vBTM
> >>>      https://lore.kernel.org/linux-iommu/20240208151837.35068-1-shameerali.kolothum.thodi@huawei.com/
> >>>   - Cross instance S2 sharing
> >>>   - Virtual Machine Structure using VIOMMU (for vMPAM?)
> >>>   - Fault forwarding support through IOMMUFD's fault fd for vSVA
> >>>
> >>> The VIOMMU series is essential to allow the invalidations to be processed
> >>> for the CD as well.
> >>>
> >>> It is enough to allow qemu work to progress.
> >>>
> >>> This is on github:https://github.com/jgunthorpe/linux/commits/smmuv3_nesting
> >>>
> >>> v2:
> >> As mentioned above, the VIOMMU series would be required to test
> >> the entire nesting feature, which now has a v2 rebasing on this
> >> series. I tested it with a paring QEMU branch. Please refer to:
> >> https://lore.kernel.org/linux-iommu/cover.1724776335.git.nicolinc@nvidia.com/
> >> Also, there is another new VIRQ series on top of the VIOMMU one
> >> and this nesting series. And I tested it too. Please refer to:
> >> https://lore.kernel.org/linux-iommu/cover.1724777091.git.nicolinc@nvidia.com/
> >>
> >> With that,
> >>
> >> Tested-by: Nicolin Chen<nicolinc@nvidia.com>
> >>
> > Have you tested the user page fault?
> >
> > I got an issue, when a user page fault happens,
> >   group->attach_handle = iommu_attach_handle_get(pasid)
> > return NULL.
> >
> > A bit confused here, only find IOMMU_NO_PASID is used when attaching
> >
> >   __fault_domain_replace_dev
> > ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> > &handle->handle);
> > curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);
> >
> > not find where the code attach user pasid with the attach_handle.
>
> Have you set iommu_ops::user_pasid_table for SMMUv3 driver?

Thanks Baolu, Nico

Yes, after arm_smmu_ops = {
+       .user_pasid_table       = 1,

find_fault_handler can go inside attach_handle =
iommu_attach_handle_get(IOMMU_NO_PASID);
qemu handler also gets called.

But hardware reports errors and needs reset, still in check.
[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0

Thanks
Zhangfei Gao Oct. 15, 2024, 3:21 a.m. UTC | #16
On Thu, 12 Sept 2024 at 12:29, Baolu Lu <baolu.lu@linux.intel.com> wrote:

> > Have you tested the user page fault?
> >
> > I got an issue, when a user page fault happens,
> >   group->attach_handle = iommu_attach_handle_get(pasid)
> > return NULL.
> >
> > A bit confused here, only find IOMMU_NO_PASID is used when attaching
> >
> >   __fault_domain_replace_dev
> > ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> > &handle->handle);
> > curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);
> >
> > not find where the code attach user pasid with the attach_handle.
>
> Have you set iommu_ops::user_pasid_table for SMMUv3 driver?

Thanks Baolu

Can we send a patch to make it as default?

+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3570,6 +3570,7 @@ static struct iommu_ops arm_smmu_ops = {
        .viommu_alloc           = arm_vsmmu_alloc,
        .pgsize_bitmap          = -1UL, /* Restricted during device attach */
        .owner                  = THIS_MODULE,
+       .user_pasid_table       = 1,


Thanks
Jason Gunthorpe Oct. 15, 2024, 1:09 p.m. UTC | #17
On Tue, Oct 15, 2024 at 11:21:54AM +0800, Zhangfei Gao wrote:
> On Thu, 12 Sept 2024 at 12:29, Baolu Lu <baolu.lu@linux.intel.com> wrote:
> 
> > > Have you tested the user page fault?
> > >
> > > I got an issue, when a user page fault happens,
> > >   group->attach_handle = iommu_attach_handle_get(pasid)
> > > return NULL.
> > >
> > > A bit confused here, only find IOMMU_NO_PASID is used when attaching
> > >
> > >   __fault_domain_replace_dev
> > > ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> > > &handle->handle);
> > > curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);
> > >
> > > not find where the code attach user pasid with the attach_handle.
> >
> > Have you set iommu_ops::user_pasid_table for SMMUv3 driver?
> 
> Thanks Baolu
> 
> Can we send a patch to make it as default?
> 
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -3570,6 +3570,7 @@ static struct iommu_ops arm_smmu_ops = {
>         .viommu_alloc           = arm_vsmmu_alloc,
>         .pgsize_bitmap          = -1UL, /* Restricted during device attach */
>         .owner                  = THIS_MODULE,
> +       .user_pasid_table       = 1,

You shouldn't need this right now as smmu3 doesn't support nesting
domains yet.

			if (!ops->user_pasid_table)
				return NULL;
			/*
			 * The iommu driver for this device supports user-
			 * managed PASID table. Therefore page faults for
			 * any PASID should go through the NESTING domain
			 * attached to the device RID.
			 */
			attach_handle = iommu_attach_handle_get(
					dev->iommu_group, IOMMU_NO_PASID,
					IOMMU_DOMAIN_NESTED);
			if (IS_ERR(attach_handle))
                        ^^^^^^^^^^^^^^^^^^^^^ Will always fail


But I will add it to the patch that adds IOMMU_DOMAIN_NESTED

Jason
Zhangfei Gao Oct. 16, 2024, 2:23 a.m. UTC | #18
Hi, Jason

On Tue, 27 Aug 2024 at 23:51, Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> This brings support for the IOMMFD ioctls:
>
>  - IOMMU_GET_HW_INFO
>  - IOMMU_HWPT_ALLOC_NEST_PARENT
>  - IOMMU_DOMAIN_NESTED
>  - ops->enforce_cache_coherency()
>
> This is quite straightforward as the nested STE can just be built in the
> special NESTED domain op and fed through the generic update machinery.
>
> The design allows the user provided STE fragment to control several
> aspects of the translation, including putting the STE into a "virtual
> bypass" or a aborting state. This duplicates functionality available by
> other means, but it allows trivially preserving the VMID in the STE as we
> eventually move towards the VIOMMU owning the VMID.
>
> Nesting support requires the system to either support S2FWB or the
> stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
> cache and view incoherent data, currently VFIO lacks any cache flushing
> that would make this safe.

What if the system does not support S2FWB or CANWBS, any workaround to
passthrough?
Currently I am testing nesting by ignoring this check.

Thanks
Jason Gunthorpe Oct. 16, 2024, 11:53 a.m. UTC | #19
On Wed, Oct 16, 2024 at 10:23:49AM +0800, Zhangfei Gao wrote:

> > Nesting support requires the system to either support S2FWB or the
> > stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
> > cache and view incoherent data, currently VFIO lacks any cache flushing
> > that would make this safe.
> 
> What if the system does not support S2FWB or CANWBS, any workaround to
> passthrough?

Eventually we can add the required cache flushing to VFIO, but that
would have to be a followup.

> Currently I am testing nesting by ignoring this check.

This is probably OK, but I wouldn't run it as a production environment
with a hostile VM.

Jason
Zhangfei Gao Oct. 17, 2024, 1:53 a.m. UTC | #20
On Tue, 15 Oct 2024 at 21:09, Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Oct 15, 2024 at 11:21:54AM +0800, Zhangfei Gao wrote:
> > On Thu, 12 Sept 2024 at 12:29, Baolu Lu <baolu.lu@linux.intel.com> wrote:
> >
> > > > Have you tested the user page fault?
> > > >
> > > > I got an issue, when a user page fault happens,
> > > >   group->attach_handle = iommu_attach_handle_get(pasid)
> > > > return NULL.
> > > >
> > > > A bit confused here, only find IOMMU_NO_PASID is used when attaching
> > > >
> > > >   __fault_domain_replace_dev
> > > > ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> > > > &handle->handle);
> > > > curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);
> > > >
> > > > not find where the code attach user pasid with the attach_handle.
> > >
> > > Have you set iommu_ops::user_pasid_table for SMMUv3 driver?
> >
> > Thanks Baolu
> >
> > Can we send a patch to make it as default?
> >
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -3570,6 +3570,7 @@ static struct iommu_ops arm_smmu_ops = {
> >         .viommu_alloc           = arm_vsmmu_alloc,
> >         .pgsize_bitmap          = -1UL, /* Restricted during device attach */
> >         .owner                  = THIS_MODULE,
> > +       .user_pasid_table       = 1,
>
> You shouldn't need this right now as smmu3 doesn't support nesting
> domains yet.

I am testing with  .user_pasid_table = 1 and IOMMU_NO_PASID
It works for user page faults.

>
>                         if (!ops->user_pasid_table)
>                                 return NULL;
>                         /*
>                          * The iommu driver for this device supports user-
>                          * managed PASID table. Therefore page faults for
>                          * any PASID should go through the NESTING domain
>                          * attached to the device RID.
>                          */
>                         attach_handle = iommu_attach_handle_get(
>                                         dev->iommu_group, IOMMU_NO_PASID,
>                                         IOMMU_DOMAIN_NESTED);
>                         if (IS_ERR(attach_handle))
>                         ^^^^^^^^^^^^^^^^^^^^^ Will always fail
>
>
> But I will add it to the patch that adds IOMMU_DOMAIN_NESTED

OK, cool.

Thanks
Jason Gunthorpe Oct. 17, 2024, 11:57 a.m. UTC | #21
On Thu, Oct 17, 2024 at 09:53:22AM +0800, Zhangfei Gao wrote:
> On Tue, 15 Oct 2024 at 21:09, Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Oct 15, 2024 at 11:21:54AM +0800, Zhangfei Gao wrote:
> > > On Thu, 12 Sept 2024 at 12:29, Baolu Lu <baolu.lu@linux.intel.com> wrote:
> > >
> > > > > Have you tested the user page fault?
> > > > >
> > > > > I got an issue, when a user page fault happens,
> > > > >   group->attach_handle = iommu_attach_handle_get(pasid)
> > > > > return NULL.
> > > > >
> > > > > A bit confused here, only find IOMMU_NO_PASID is used when attaching
> > > > >
> > > > >   __fault_domain_replace_dev
> > > > > ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> > > > > &handle->handle);
> > > > > curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);
> > > > >
> > > > > not find where the code attach user pasid with the attach_handle.
> > > >
> > > > Have you set iommu_ops::user_pasid_table for SMMUv3 driver?
> > >
> > > Thanks Baolu
> > >
> > > Can we send a patch to make it as default?
> > >
> > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > @@ -3570,6 +3570,7 @@ static struct iommu_ops arm_smmu_ops = {
> > >         .viommu_alloc           = arm_vsmmu_alloc,
> > >         .pgsize_bitmap          = -1UL, /* Restricted during device attach */
> > >         .owner                  = THIS_MODULE,
> > > +       .user_pasid_table       = 1,
> >
> > You shouldn't need this right now as smmu3 doesn't support nesting
> > domains yet.
> 
> I am testing with  .user_pasid_table = 1 and IOMMU_NO_PASID
> It works for user page faults.

You shouldn't need user_pasid_table for that case, it is only
necessary if you are doing nesting.

Jason