Message ID | 0-v1-54e734311a7f+14f72-smmuv3_nesting_jgg@nvidia.com |
---|---|
Headers | show |
Series | Initial support for SMMUv3 nested translation | expand |
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, August 7, 2024 12:41 AM > To: acpica-devel@lists.linux.dev; Alex Williamson > <alex.williamson@redhat.com>; Guohanjun (Hanjun Guo) > <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel > <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>; kvm@vger.kernel.org; > Len Brown <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm- > kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J. > Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>; Robin > Murphy <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>; > Will Deacon <will@kernel.org> > Cc: Eric Auger <eric.auger@redhat.com>; Jean-Philippe Brucker <jean- > philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>; Michael Shavit > <mshavit@google.com>; Nicolin Chen <nicolinc@nvidia.com>; > patches@lists.linux.dev; Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com> > Subject: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available > > Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field > works. When S2FWB is supported and enabled the IOPTE will force cachable > access to IOMMU_CACHE memory and deny cachable access otherwise. > > This is not especially meaningful for simple S2 domains, it apparently > doesn't even force PCI no-snoop access to be coherent. > > However, when used with a nested S1, FWB has the effect of preventing the > guest from choosing a MemAttr that would cause ordinary DMA to bypass the > cache. Consistent with KVM we wish to deny the guest the ability to become > incoherent with cached memory the hypervisor believes is cachable so we > don't have to flush it. > > Turn on S2FWB whenever the SMMU supports it and use it for all S2 > mappings. > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > --- > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 6 ++++++ > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 3 +++ > drivers/iommu/io-pgtable-arm.c | 24 +++++++++++++++++---- > include/linux/io-pgtable.h | 2 ++ > 4 files changed, 31 insertions(+), 4 deletions(-) > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > index 531125f231b662..7fe1e27d11586c 100644 > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > @@ -1612,6 +1612,8 @@ void arm_smmu_make_s2_domain_ste(struct > arm_smmu_ste *target, > FIELD_PREP(STRTAB_STE_1_EATS, > ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0)); > > + if (smmu->features & ARM_SMMU_FEAT_S2FWB) > + target->data[1] |= cpu_to_le64(STRTAB_STE_1_S2FWB); > if (smmu->features & ARM_SMMU_FEAT_ATTR_TYPES_OVR) > target->data[1] |= > cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG, > > STRTAB_STE_1_SHCFG_INCOMING)); > @@ -2400,6 +2402,8 @@ static int arm_smmu_domain_finalise(struct > arm_smmu_domain *smmu_domain, > pgtbl_cfg.oas = smmu->oas; > fmt = ARM_64_LPAE_S2; > finalise_stage_fn = arm_smmu_domain_finalise_s2; > + if (smmu->features & ARM_SMMU_FEAT_S2FWB) > + pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB; This probably requires an update in arm_64_lpae_alloc_pgtable_s2() quirks check. Thanks, Shameer
On Fri, Aug 09, 2024 at 02:26:13PM +0000, Shameerali Kolothum Thodi wrote: > > + if (smmu->features & ARM_SMMU_FEAT_S2FWB) > > + pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB; > > This probably requires an update in arm_64_lpae_alloc_pgtable_s2() quirks check. Yep, fixed I was hoping you had HW to test this.. Thanks, Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, August 9, 2024 4:12 PM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: acpica-devel@lists.linux.dev; Alex Williamson > <alex.williamson@redhat.com>; Guohanjun (Hanjun Guo) > <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel > <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>; kvm@vger.kernel.org; > Len Brown <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm- > kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J. > Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>; Robin > Murphy <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>; > Will Deacon <will@kernel.org>; Eric Auger <eric.auger@redhat.com>; Jean- > Philippe Brucker <jean-philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>; > Michael Shavit <mshavit@google.com>; Nicolin Chen <nicolinc@nvidia.com>; > patches@lists.linux.dev > Subject: Re: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available > > On Fri, Aug 09, 2024 at 02:26:13PM +0000, Shameerali Kolothum Thodi wrote: > > > + if (smmu->features & ARM_SMMU_FEAT_S2FWB) > > > + pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB; > > > > This probably requires an update in arm_64_lpae_alloc_pgtable_s2() quirks > check. > > Yep, fixed I was hoping you had HW to test this.. Let me see if I can get hold of a test setup that supports S2FWB. I do have another concern with respect to the hardware we have which doesn't support S2FWB, but those can claim CANWBS. The problem is, BIOS update is not a very liked/feasible solution to already deployed ones. But we can probably add an option/quirk in SMMUv3 driver for those platforms(based on ACPI_IORT_SMMU_V3_HISILICON_HI161X). I hope this is fine. Thanks, Shameer
On Thu, Aug 15, 2024 at 04:14:22PM +0000, Shameerali Kolothum Thodi wrote: > > On Fri, Aug 09, 2024 at 02:26:13PM +0000, Shameerali Kolothum Thodi wrote: > > > > + if (smmu->features & ARM_SMMU_FEAT_S2FWB) > > > > + pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB; > > > > > > This probably requires an update in arm_64_lpae_alloc_pgtable_s2() quirks > > check. > > > > Yep, fixed I was hoping you had HW to test this.. > > Let me see if I can get hold of a test setup that supports S2FWB. Thanks! > I do have another concern with respect to the hardware we have which doesn't > support S2FWB, but those can claim CANWBS. The problem is, BIOS update is not > a very liked/feasible solution to already deployed ones. But we can probably add > an option/quirk in SMMUv3 driver for those platforms(based on > ACPI_IORT_SMMU_V3_HISILICON_HI161X). I hope this is fine. I don't have an issue with doing that, if you can reliably identify the platform in some way a kernel quirk seems reasonable. Thanks, Jason
Hi Jason, On Tue, Aug 06, 2024 at 08:41:13PM -0300, Jason Gunthorpe wrote: > This brings support for the IOMMFD ioctls: > > - IOMMU_GET_HW_INFO > - IOMMU_HWPT_ALLOC_NEST_PARENT > - IOMMU_DOMAIN_NESTED > - ops->enforce_cache_coherency() > > This is quite straightforward as the nested STE can just be built in the > special NESTED domain op and fed through the generic update machinery. > > The design allows the user provided STE fragment to control several > aspects of the translation, including putting the STE into a "virtual > bypass" or a aborting state. This duplicates functionality available by > other means, but it allows trivially preserving the VMID in the STE as we > eventually move towards the VIOMMU owning the VMID. > > Nesting support requires the system to either support S2FWB or the > stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the > cache and view incoherent data, currently VFIO lacks any cache flushing > that would make this safe. > > Yan has a series to add some of the needed infrastructure for VFIO cache > flushing here: > > https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/ > > Which may someday allow relaxing this further. > > Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by > this. > > This is the first series in what will be several to complete nesting > support. At least: > - IOMMU_RESV_SW_MSI related fixups > - VIOMMU object support to allow ATS invalidations > - vCMDQ hypervisor support for direct invalidation queue assignment > - KVM pinned VMID using VIOMMU for vBTM > - Cross instance S2 sharing > - Virtual Machine Structure using VIOMMU (for vMPAM?) > - Fault forwarding support through IOMMUFD's fault fd for vSVA > > It is enough to allow significant amounts of qemu work to progress. > Are there any qemu patches to tests this? As I am confused with some of the user space bits and that would help. Thanks, Mostafa > This is on github: https://github.com/jgunthorpe/linux/commits/smmuv3_nesting > > Jason Gunthorpe (5): > vfio: Remove VFIO_TYPE1_NESTING_IOMMU > iommu/arm-smmu-v3: Use S2FWB when available > iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS > iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT > iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED > > Nicolin Chen (3): > ACPI/IORT: Support CANWBS memory access flag > iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct > arm_smmu_hw_info > iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user > > drivers/acpi/arm64/iort.c | 13 + > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 398 ++++++++++++++++++-- > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 27 ++ > drivers/iommu/arm/arm-smmu/arm-smmu.c | 16 - > drivers/iommu/io-pgtable-arm.c | 24 +- > drivers/iommu/iommu.c | 10 - > drivers/iommu/iommufd/vfio_compat.c | 7 +- > drivers/vfio/vfio_iommu_type1.c | 12 +- > include/acpi/actbl2.h | 1 + > include/linux/io-pgtable.h | 2 + > include/linux/iommu.h | 54 ++- > include/uapi/linux/iommufd.h | 79 ++++ > include/uapi/linux/vfio.h | 2 +- > 13 files changed, 572 insertions(+), 73 deletions(-) > > > base-commit: e5e288d94186b266b062b3e44c82c285dfe68712 > -- > 2.46.0 >
Hi Jason, On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote: > Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field > works. When S2FWB is supported and enabled the IOPTE will force cachable > access to IOMMU_CACHE memory and deny cachable access otherwise. > > This is not especially meaningful for simple S2 domains, it apparently > doesn't even force PCI no-snoop access to be coherent. > > However, when used with a nested S1, FWB has the effect of preventing the > guest from choosing a MemAttr that would cause ordinary DMA to bypass the > cache. Consistent with KVM we wish to deny the guest the ability to become > incoherent with cached memory the hypervisor believes is cachable so we > don't have to flush it. > > Turn on S2FWB whenever the SMMU supports it and use it for all S2 > mappings. I have been looking into this recently from the KVM side as it will use FWB for the CPU stage-2 unconditionally for guests(if supported), however that breaks for non-coherent devices when assigned, and limiting assigned devices to be coherent seems too restrictive. I have been looking into ways to notify KVM from VFIO as early as possible so it can configure the page table properly. But for SMMUv3, S2FWB is per stream, can’t we just use it if the master is DMA coherent? Thanks, Mostafa > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > --- > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 6 ++++++ > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 3 +++ > drivers/iommu/io-pgtable-arm.c | 24 +++++++++++++++++---- > include/linux/io-pgtable.h | 2 ++ > 4 files changed, 31 insertions(+), 4 deletions(-) > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > index 531125f231b662..7fe1e27d11586c 100644 > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > @@ -1612,6 +1612,8 @@ void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target, > FIELD_PREP(STRTAB_STE_1_EATS, > ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0)); > > + if (smmu->features & ARM_SMMU_FEAT_S2FWB) > + target->data[1] |= cpu_to_le64(STRTAB_STE_1_S2FWB); > if (smmu->features & ARM_SMMU_FEAT_ATTR_TYPES_OVR) > target->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG, > STRTAB_STE_1_SHCFG_INCOMING)); > @@ -2400,6 +2402,8 @@ static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain, > pgtbl_cfg.oas = smmu->oas; > fmt = ARM_64_LPAE_S2; > finalise_stage_fn = arm_smmu_domain_finalise_s2; > + if (smmu->features & ARM_SMMU_FEAT_S2FWB) > + pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB; > break; > default: > return -EINVAL; > @@ -4189,6 +4193,8 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) > > /* IDR3 */ > reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3); > + if (FIELD_GET(IDR3_FWB, reg)) > + smmu->features |= ARM_SMMU_FEAT_S2FWB; > if (FIELD_GET(IDR3_RIL, reg)) > smmu->features |= ARM_SMMU_FEAT_RANGE_INV; > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h > index 8851a7abb5f0f3..7e8d2f36faebf3 100644 > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h > @@ -55,6 +55,7 @@ > #define IDR1_SIDSIZE GENMASK(5, 0) > > #define ARM_SMMU_IDR3 0xc > +#define IDR3_FWB (1 << 8) > #define IDR3_RIL (1 << 10) > > #define ARM_SMMU_IDR5 0x14 > @@ -258,6 +259,7 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid) > #define STRTAB_STE_1_S1CSH GENMASK_ULL(7, 6) > > #define STRTAB_STE_1_S1STALLD (1UL << 27) > +#define STRTAB_STE_1_S2FWB (1UL << 25) > > #define STRTAB_STE_1_EATS GENMASK_ULL(29, 28) > #define STRTAB_STE_1_EATS_ABT 0UL > @@ -700,6 +702,7 @@ struct arm_smmu_device { > #define ARM_SMMU_FEAT_ATTR_TYPES_OVR (1 << 20) > #define ARM_SMMU_FEAT_HA (1 << 21) > #define ARM_SMMU_FEAT_HD (1 << 22) > +#define ARM_SMMU_FEAT_S2FWB (1 << 23) > u32 features; > > #define ARM_SMMU_OPT_SKIP_PREFETCH (1 << 0) > diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c > index f5d9fd1f45bf49..62bbb6037e1686 100644 > --- a/drivers/iommu/io-pgtable-arm.c > +++ b/drivers/iommu/io-pgtable-arm.c > @@ -106,6 +106,18 @@ > #define ARM_LPAE_PTE_HAP_FAULT (((arm_lpae_iopte)0) << 6) > #define ARM_LPAE_PTE_HAP_READ (((arm_lpae_iopte)1) << 6) > #define ARM_LPAE_PTE_HAP_WRITE (((arm_lpae_iopte)2) << 6) > +/* > + * For !FWB these code to: > + * 1111 = Normal outer write back cachable / Inner Write Back Cachable > + * Permit S1 to override > + * 0101 = Normal Non-cachable / Inner Non-cachable > + * 0001 = Device / Device-nGnRE > + * For S2FWB these code: > + * 0110 Force Normal Write Back > + * 0101 Normal* is forced Normal-NC, Device unchanged > + * 0001 Force Device-nGnRE > + */ > +#define ARM_LPAE_PTE_MEMATTR_FWB_WB (((arm_lpae_iopte)0x6) << 2) > #define ARM_LPAE_PTE_MEMATTR_OIWB (((arm_lpae_iopte)0xf) << 2) > #define ARM_LPAE_PTE_MEMATTR_NC (((arm_lpae_iopte)0x5) << 2) > #define ARM_LPAE_PTE_MEMATTR_DEV (((arm_lpae_iopte)0x1) << 2) > @@ -458,12 +470,16 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data, > */ > if (data->iop.fmt == ARM_64_LPAE_S2 || > data->iop.fmt == ARM_32_LPAE_S2) { > - if (prot & IOMMU_MMIO) > + if (prot & IOMMU_MMIO) { > pte |= ARM_LPAE_PTE_MEMATTR_DEV; > - else if (prot & IOMMU_CACHE) > - pte |= ARM_LPAE_PTE_MEMATTR_OIWB; > - else > + } else if (prot & IOMMU_CACHE) { > + if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_ARM_S2FWB) > + pte |= ARM_LPAE_PTE_MEMATTR_FWB_WB; > + else > + pte |= ARM_LPAE_PTE_MEMATTR_OIWB; > + } else { > pte |= ARM_LPAE_PTE_MEMATTR_NC; > + } > } else { > if (prot & IOMMU_MMIO) > pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV > diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h > index f9a81761bfceda..aff9b020b6dcc7 100644 > --- a/include/linux/io-pgtable.h > +++ b/include/linux/io-pgtable.h > @@ -87,6 +87,7 @@ struct io_pgtable_cfg { > * attributes set in the TCR for a non-coherent page-table walker. > * > * IO_PGTABLE_QUIRK_ARM_HD: Enables dirty tracking in stage 1 pagetable. > + * IO_PGTABLE_QUIRK_ARM_S2FWB: Use the FWB format for the MemAttrs bits > */ > #define IO_PGTABLE_QUIRK_ARM_NS BIT(0) > #define IO_PGTABLE_QUIRK_NO_PERMS BIT(1) > @@ -95,6 +96,7 @@ struct io_pgtable_cfg { > #define IO_PGTABLE_QUIRK_ARM_TTBR1 BIT(5) > #define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA BIT(6) > #define IO_PGTABLE_QUIRK_ARM_HD BIT(7) > + #define IO_PGTABLE_QUIRK_ARM_S2FWB BIT(8) > unsigned long quirks; > unsigned long pgsize_bitmap; > unsigned int ias; > -- > 2.46.0 >
On Tue, Aug 20, 2024 at 08:30:05AM +0000, Mostafa Saleh wrote: > Hi Jason, > > On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote: > > Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field > > works. When S2FWB is supported and enabled the IOPTE will force cachable > > access to IOMMU_CACHE memory and deny cachable access otherwise. > > > > This is not especially meaningful for simple S2 domains, it apparently > > doesn't even force PCI no-snoop access to be coherent. > > > > However, when used with a nested S1, FWB has the effect of preventing the > > guest from choosing a MemAttr that would cause ordinary DMA to bypass the > > cache. Consistent with KVM we wish to deny the guest the ability to become > > incoherent with cached memory the hypervisor believes is cachable so we > > don't have to flush it. > > > > Turn on S2FWB whenever the SMMU supports it and use it for all S2 > > mappings. > > I have been looking into this recently from the KVM side as it will > use FWB for the CPU stage-2 unconditionally for guests(if supported), > however that breaks for non-coherent devices when assigned, and > limiting assigned devices to be coherent seems too restrictive. kvm's CPU S2 doesn't care about non-DMA-coherent devices though? That concept is only relevant to the SMMU. The issue on the KVM side is you can't put device MMIO into the CPU S2 using S2FWB and Normal Cachable, it will break the MMIO programming model. That isn't "coherency" though. It has to be Normal-NC, which this patch does: https://lore.kernel.org/r/20240224150546.368-4-ankita@nvidia.com > But for SMMUv3, S2FWB is per stream, can’t we just use it if the master > is DMA coherent? Sure, that seems to be a weird corner. Lets add this: @@ -4575,7 +4575,12 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) /* IDR3 */ reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3); - if (FIELD_GET(IDR3_FWB, reg)) + /* + * If for some reason the HW does not support DMA coherency then using + * S2FWB won't work. This will also disable nesting support. + */ + if (FIELD_GET(IDR3_FWB, reg) && + (smmu->features & ARM_SMMU_FEAT_COHERENCY)) smmu->features |= ARM_SMMU_FEAT_S2FWB; if (FIELD_GET(IDR3_RIL, reg)) smmu->features |= ARM_SMMU_FEAT_RANGE_INV; IMHO it would be weird to make HW that has S2FWB but not coherency, but sure let's check it. Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set so we won't even get a chance to ask for a S2 domain. Jason
On Tue, Aug 20, 2024 at 09:01:02AM -0300, Jason Gunthorpe wrote: > Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set > so we won't even get a chance to ask for a S2 domain. And I should also say that without iommufd something like the DMA API could select a S2 with S2FWB enabled, but all that does is change the encoding of the memattr bits. Requests for !IOMMU_CACHE will still map to non-cacheble IO PTEs like before - just with a different encoding. The only thing at issue is nesting which will end up in the guest as forced cachable - however since VFIO doesn't support non-DMA-coherent devices at all this is not a problem. Jason
On Tue, Aug 20, 2024 at 08:20:32AM +0000, Mostafa Saleh wrote: > > This is the first series in what will be several to complete nesting > > support. At least: > > - IOMMU_RESV_SW_MSI related fixups > > - VIOMMU object support to allow ATS invalidations > > - vCMDQ hypervisor support for direct invalidation queue assignment > > - KVM pinned VMID using VIOMMU for vBTM > > - Cross instance S2 sharing > > - Virtual Machine Structure using VIOMMU (for vMPAM?) > > - Fault forwarding support through IOMMUFD's fault fd for vSVA > > > > It is enough to allow significant amounts of qemu work to progress. > > > Are there any qemu patches to tests this? > As I am confused with some of the user space bits and that would help. I have the qemu patches, but am running some backlogs to keep it updated, and don't have one exactly fitting to test this series. I collected a few remarks from Jason regarding the VIOMMU series. And I am reworking on it. I plan to post a testable QEMU branch with the next VIOMMU version. Will CC you and more folks. Thanks Nicolin
On Tue, Aug 20, 2024 at 09:01:02AM -0300, Jason Gunthorpe wrote: > On Tue, Aug 20, 2024 at 08:30:05AM +0000, Mostafa Saleh wrote: > > Hi Jason, > > > > On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote: > > > Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field > > > works. When S2FWB is supported and enabled the IOPTE will force cachable > > > access to IOMMU_CACHE memory and deny cachable access otherwise. > > > > > > This is not especially meaningful for simple S2 domains, it apparently > > > doesn't even force PCI no-snoop access to be coherent. > > > > > > However, when used with a nested S1, FWB has the effect of preventing the > > > guest from choosing a MemAttr that would cause ordinary DMA to bypass the > > > cache. Consistent with KVM we wish to deny the guest the ability to become > > > incoherent with cached memory the hypervisor believes is cachable so we > > > don't have to flush it. > > > > > > Turn on S2FWB whenever the SMMU supports it and use it for all S2 > > > mappings. > > > > I have been looking into this recently from the KVM side as it will > > use FWB for the CPU stage-2 unconditionally for guests(if supported), > > however that breaks for non-coherent devices when assigned, and > > limiting assigned devices to be coherent seems too restrictive. > > kvm's CPU S2 doesn't care about non-DMA-coherent devices though? That > concept is only relevant to the SMMU. > Why not? That would be a problem if a device is not dma coherent, and the VM knows that and maps it’s DMA memory as non cacheable. But it would be overridden by FWB in stage-2 to be cacheable, it would lead to coherency issues. > The issue on the KVM side is you can't put device MMIO into the CPU S2 > using S2FWB and Normal Cachable, it will break the MMIO programming > model. That isn't "coherency" though.> > It has to be Normal-NC, which this patch does: > > https://lore.kernel.org/r/20240224150546.368-4-ankita@nvidia.com Yes, that also breaks (although I think this is an easier problem to solve) > > > But for SMMUv3, S2FWB is per stream, can’t we just use it if the master > > is DMA coherent? > > Sure, that seems to be a weird corner. Lets add this: > > @@ -4575,7 +4575,12 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) > > /* IDR3 */ > reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3); > - if (FIELD_GET(IDR3_FWB, reg)) > + /* > + * If for some reason the HW does not support DMA coherency then using > + * S2FWB won't work. This will also disable nesting support. > + */ > + if (FIELD_GET(IDR3_FWB, reg) && > + (smmu->features & ARM_SMMU_FEAT_COHERENCY)) > smmu->features |= ARM_SMMU_FEAT_S2FWB; > if (FIELD_GET(IDR3_RIL, reg)) > smmu->features |= ARM_SMMU_FEAT_RANGE_INV; > > IMHO it would be weird to make HW that has S2FWB but not coherency, > but sure let's check it. > What I mean is the master itself not the SMMU (the SID basically), so in that case the STE shouldn’t have FWB enabled. > Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set > so we won't even get a chance to ask for a S2 domain. Oh, I think that is only for the SMMU, not for the master, the SMMU can be coherent (for pte, ste …) but the master can still be non coherent. Looking at how VFIO uses it, that seems to be a bug? Thanks, Mostafa > > Jason
On Tue, Aug 20, 2024 at 07:52:53PM +0000, Mostafa Saleh wrote: > On Tue, Aug 20, 2024 at 09:01:02AM -0300, Jason Gunthorpe wrote: > > On Tue, Aug 20, 2024 at 08:30:05AM +0000, Mostafa Saleh wrote: > > > Hi Jason, > > > > > > On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote: > > > > Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field > > > > works. When S2FWB is supported and enabled the IOPTE will force cachable > > > > access to IOMMU_CACHE memory and deny cachable access otherwise. > > > > > > > > This is not especially meaningful for simple S2 domains, it apparently > > > > doesn't even force PCI no-snoop access to be coherent. > > > > > > > > However, when used with a nested S1, FWB has the effect of preventing the > > > > guest from choosing a MemAttr that would cause ordinary DMA to bypass the > > > > cache. Consistent with KVM we wish to deny the guest the ability to become > > > > incoherent with cached memory the hypervisor believes is cachable so we > > > > don't have to flush it. > > > > > > > > Turn on S2FWB whenever the SMMU supports it and use it for all S2 > > > > mappings. > > > > > > I have been looking into this recently from the KVM side as it will > > > use FWB for the CPU stage-2 unconditionally for guests(if supported), > > > however that breaks for non-coherent devices when assigned, and > > > limiting assigned devices to be coherent seems too restrictive. > > > > kvm's CPU S2 doesn't care about non-DMA-coherent devices though? That > > concept is only relevant to the SMMU. > > Why not? That would be a problem if a device is not dma coherent, > and the VM knows that and maps it’s DMA memory as non cacheable. > But it would be overridden by FWB in stage-2 to be cacheable, > it would lead to coherency issues. Oh, from that perspective yes, but the entire point of S2FWB is that VM's can not create non-coherent access so it is a bit nonsense to ask for both S2FWB and try to assign a non-DMA coherent device. > Yes, that also breaks (although I think this is an easier problem to > solve) Well, it is easy to solve, just don't use S2FWB and manually flush the caches before the hypervisor touches any memory. :) > What I mean is the master itself not the SMMU (the SID basically), > so in that case the STE shouldn’t have FWB enabled. That doesn't matter, those cases will not pass in IOMMU_CACHE and they will work fine with S2FWB turned on. > > Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set > > so we won't even get a chance to ask for a S2 domain. > > Oh, I think that is only for the SMMU, not for the master, the > SMMU can be coherent (for pte, ste …) but the master can still be > non coherent. Looking at how VFIO uses it, that seems to be a bug? If there are mixes of SMMU feature and dev_is_dma_coherent() then it would be a bug yes.. I recall we started out trying to use dev_is_dma_coherent() but Christoph explained it doesn't work that generally: https://lore.kernel.org/kvm/20220406135150.GA21532@lst.de/ Seems we sort of gave up on it, too complicated. Robin had a nice observation of the complexity: Disregarding the complete disaster of PCIe No Snoop on Arm-Based systems, there's the more interesting effectively-opposite scenario where an SMMU bridges non-coherent devices to a coherent interconnect. It's not something we take advantage of yet in Linux, and it can only be properly described in ACPI, but there do exist situations where IOMMU_CACHE is capable of making the device's traffic snoop, but dev_is_dma_coherent() - and device_get_dma_attr() for external users - would still say non-coherent because they can't assume that the SMMU is enabled and programmed in just the right way. Anyhow, for the purposes of KVM and VFIO, devices that don't work with IOMMU_CACHE are not allowed. From an API perspective IOMMU_CAP_CACHE_COHERENCY is supposed to return if the struct device can use IOMMU_CACHE. The corner case where we have a ARM_SMMU_FEAT_COHERENCY SMMU but somehow specific devices don't support IOMMU_CACHE is not properly reflected in IOMMU_CAP_CACHE_COHERENCY. I don't know how to fix that, and we've been ignoring it for a long time now :) Jason
On Tue, Aug 20, 2024 at 05:21:38PM -0300, Jason Gunthorpe wrote: > On Tue, Aug 20, 2024 at 07:52:53PM +0000, Mostafa Saleh wrote: > > On Tue, Aug 20, 2024 at 09:01:02AM -0300, Jason Gunthorpe wrote: > > > On Tue, Aug 20, 2024 at 08:30:05AM +0000, Mostafa Saleh wrote: > > > > Hi Jason, > > > > > > > > On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote: > > > > > Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field > > > > > works. When S2FWB is supported and enabled the IOPTE will force cachable > > > > > access to IOMMU_CACHE memory and deny cachable access otherwise. > > > > > > > > > > This is not especially meaningful for simple S2 domains, it apparently > > > > > doesn't even force PCI no-snoop access to be coherent. > > > > > > > > > > However, when used with a nested S1, FWB has the effect of preventing the > > > > > guest from choosing a MemAttr that would cause ordinary DMA to bypass the > > > > > cache. Consistent with KVM we wish to deny the guest the ability to become > > > > > incoherent with cached memory the hypervisor believes is cachable so we > > > > > don't have to flush it. > > > > > > > > > > Turn on S2FWB whenever the SMMU supports it and use it for all S2 > > > > > mappings. > > > > > > > > I have been looking into this recently from the KVM side as it will > > > > use FWB for the CPU stage-2 unconditionally for guests(if supported), > > > > however that breaks for non-coherent devices when assigned, and > > > > limiting assigned devices to be coherent seems too restrictive. > > > > > > kvm's CPU S2 doesn't care about non-DMA-coherent devices though? That > > > concept is only relevant to the SMMU. > > > > Why not? That would be a problem if a device is not dma coherent, > > and the VM knows that and maps it’s DMA memory as non cacheable. > > But it would be overridden by FWB in stage-2 to be cacheable, > > it would lead to coherency issues. > > Oh, from that perspective yes, but the entire point of S2FWB is that > VM's can not create non-coherent access so it is a bit nonsense to ask > for both S2FWB and try to assign a non-DMA coherent device. Yes, but KVM sets FWB unconditionally and would use cacheable mapping for stage-2, and I expect the same for the nested SMMU. > > > Yes, that also breaks (although I think this is an easier problem to > > solve) > > Well, it is easy to solve, just don't use S2FWB and manually flush the > caches before the hypervisor touches any memory. :) Yes, although that means virtualized devices would have worse performance :/ but I guess there is nothing more to do here. I have some ideas about that, I can send patches to the kvm list as an RFC. > > > What I mean is the master itself not the SMMU (the SID basically), > > so in that case the STE shouldn’t have FWB enabled. > > That doesn't matter, those cases will not pass in IOMMU_CACHE and they > will work fine with S2FWB turned on. > But that won’t be the case in nested? Otherwise why we use FWB in the first place. > > > Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set > > > so we won't even get a chance to ask for a S2 domain. > > > > Oh, I think that is only for the SMMU, not for the master, the > > SMMU can be coherent (for pte, ste …) but the master can still be > > non coherent. Looking at how VFIO uses it, that seems to be a bug? > > If there are mixes of SMMU feature and dev_is_dma_coherent() then it > would be a bug yes.. > I think there is a bug, I was able to assign a “non-coherent” device with VFIO with no issues, and it allows it as long as the SMMU is coherent. > I recall we started out trying to use dev_is_dma_coherent() but > Christoph explained it doesn't work that generally: > > https://lore.kernel.org/kvm/20220406135150.GA21532@lst.de/ > > Seems we sort of gave up on it, too complicated. Robin had a nice > observation of the complexity: > > Disregarding the complete disaster of PCIe No Snoop on Arm-Based > systems, there's the more interesting effectively-opposite scenario > where an SMMU bridges non-coherent devices to a coherent interconnect. > It's not something we take advantage of yet in Linux, and it can only be > properly described in ACPI, but there do exist situations where > IOMMU_CACHE is capable of making the device's traffic snoop, but > dev_is_dma_coherent() - and device_get_dma_attr() for external users - > would still say non-coherent because they can't assume that the SMMU is > enabled and programmed in just the right way. > > Anyhow, for the purposes of KVM and VFIO, devices that don't work with > IOMMU_CACHE are not allowed. From an API perspective > IOMMU_CAP_CACHE_COHERENCY is supposed to return if the struct device > can use IOMMU_CACHE. > > The corner case where we have a ARM_SMMU_FEAT_COHERENCY SMMU but > somehow specific devices don't support IOMMU_CACHE is not properly > reflected in IOMMU_CAP_CACHE_COHERENCY. I don't know how to fix that, > and we've been ignoring it for a long time now :) Thanks a lot for the extra context! Maybe the SMMUv3 .capable, should be changed to check if the device is coherent (instead of using dev_is_dma_coherent, it can use lower level functions from the supported buses) Also, I think supporting IOMMU_CACHE is not enough, as the SMMU can support it but the device is still not coherent. Thanks, Mostafa > > Jason
On Wed, Aug 21, 2024 at 09:53:33AM +0000, Mostafa Saleh wrote: > > Oh, from that perspective yes, but the entire point of S2FWB is that > > VM's can not create non-coherent access so it is a bit nonsense to ask > > for both S2FWB and try to assign a non-DMA coherent device. > > Yes, but KVM sets FWB unconditionally and would use cacheable mapping > for stage-2, and I expect the same for the nested SMMU. Yes, you'd need some kind of handshake like Intel built for their GPU cache incoherence to turn that off. > > > What I mean is the master itself not the SMMU (the SID basically), > > > so in that case the STE shouldn’t have FWB enabled. > > > > That doesn't matter, those cases will not pass in IOMMU_CACHE and they > > will work fine with S2FWB turned on. > > But that won’t be the case in nested? Otherwise why we use FWB in the > first place. Right, without KVM support for guest cachability selection and cache flushing in VFIO, is infeasible to allow non-coherent devices. It is a medium sized problem if someone wants to tackle it. > Maybe the SMMUv3 .capable, should be changed to check if the device is > coherent (instead of using dev_is_dma_coherent, it can use lower level > functions from the supported buses) That would be the fix I expect. Either SMMUv3 does it, or the core code adds it on top in the .capable wrapper. It makes sense to me that the iommu driver should be aware of per-master coherence capability. > Also, I think supporting IOMMU_CACHE is not enough, as the SMMU can > support it but the device is still not coherent. IOMMU_CACHE is defined as requiring no cache maintenance on that memory. If specific devices can't guarentee that then IOMMU_CACHE should not be used on those devices and IOMMU_CAP_CACHE_COHERENCY for that device should be false. That is what I mean by support. Anyhow, I'm going to continue to leave this problem alone for nesting. Nothing gets worse by adding nesting on top of this. Even if we wrongly permit VFIO to open non-coherent devices they won't actually work correctly (VFIO forces IOMMU_CACHE and S2FWB). Most likely anything trying to use them will just crash/malfunction due to missing cache flushing. Jason