mbox series

[0/7] dma-mapping: Clean up arch_setup_dma_ops()

Message ID cover.1701268753.git.robin.murphy@arm.com
Headers show
Series dma-mapping: Clean up arch_setup_dma_ops() | expand

Message

Robin Murphy Nov. 29, 2023, 5:42 p.m. UTC
Hi all,

Prompted by Jason's proposal[1], here's a first step towards truly
unpicking the dma_configure vs. IOMMU mess. As I commented before, we
have an awful lot of accumulated cruft and technical debt here making
things more complicated than they need to be, and we already have hacks
on top of hacks trying to work around it, so polishing those hacks even
further is really not a desirable direction of travel. And I do know
they're hacks, because I wrote most of them and still remember enough of
the context of the time ;)

I'm taking a methodical bottom-up approach here, so step 1 is cleaning
*all* the out-of-date stuff from arch_setup_dma_ops() and simplifying
that interface, which gets it right out of the way for the next step of
pulling apart {of,acpi}_dma_configure(). This part is really a
dma-mapping series, but I'm not sure yet if would need to target the
IOMMU tree - nothing here should strictly depend on the pending IOMMU
change, but the follow-up patches might. Still working on those, so
hopefully I'll know soon...

Thanks,
Robin.

[1] https://lore.kernel.org/linux-iommu/0-v1-720585788a7d+811b-iommu_fwspec_p1_jgg@nvidia.com/


Robin Murphy (7):
  OF: Retire dma-ranges mask workaround
  OF: Simplify DMA range calculations
  ACPI/IORT: Handle memory address size limits as limits
  dma-mapping: Add helpers for dma_range_map bounds
  iommu/dma: Make limit checks self-contained
  iommu/dma: Centralise iommu_setup_dma_ops()
  dma-mapping: Simplify arch_setup_dma_ops()

 arch/arc/mm/dma.c               |  3 +--
 arch/arm/mm/dma-mapping-nommu.c |  3 +--
 arch/arm/mm/dma-mapping.c       | 12 ++++++----
 arch/arm64/mm/dma-mapping.c     |  5 +---
 arch/loongarch/kernel/dma.c     |  9 ++-----
 arch/mips/mm/dma-noncoherent.c  |  3 +--
 arch/riscv/mm/dma-noncoherent.c |  3 +--
 drivers/acpi/arm64/dma.c        | 17 ++++---------
 drivers/acpi/arm64/iort.c       | 18 +++++++-------
 drivers/acpi/scan.c             |  3 +--
 drivers/hv/hv_common.c          |  6 +----
 drivers/iommu/amd/iommu.c       |  8 -------
 drivers/iommu/dma-iommu.c       | 35 ++++++++++-----------------
 drivers/iommu/dma-iommu.h       |  6 +++++
 drivers/iommu/intel/iommu.c     |  7 ------
 drivers/iommu/iommu.c           |  2 ++
 drivers/iommu/s390-iommu.c      |  6 -----
 drivers/iommu/virtio-iommu.c    | 10 --------
 drivers/of/device.c             | 42 ++++++---------------------------
 include/linux/acpi_iort.h       |  4 ++--
 include/linux/dma-direct.h      | 18 ++++++++++++++
 include/linux/dma-map-ops.h     |  6 ++---
 include/linux/iommu.h           |  7 ------
 23 files changed, 78 insertions(+), 155 deletions(-)

Comments

Jason Gunthorpe Nov. 29, 2023, 8:36 p.m. UTC | #1
On Wed, Nov 29, 2023 at 05:42:57PM +0000, Robin Murphy wrote:
> Hi all,
> 
> Prompted by Jason's proposal[1], here's a first step towards truly
> unpicking the dma_configure vs. IOMMU mess. As I commented before, we
> have an awful lot of accumulated cruft and technical debt here making
> things more complicated than they need to be, and we already have hacks
> on top of hacks trying to work around it, so polishing those hacks even
> further is really not a desirable direction of travel. And I do know
> they're hacks, because I wrote most of them and still remember enough of
> the context of the time ;)

I quite like this, I was also looking at getting rid of those other
parameters.

I wanted to take smaller steps because it is all pretty hairy.

One thing that still concerns me is if the FW data restricts the valid
IOVA window that really should be reflected into the reserved ranges
and not just dumped into the struct device for use by the DMA API.

Or, perhaps, viof/iommufd should be using the struct device data to
generate some additional reserved ranges?

Either way, I would like to see the dma_iommu and the rest of the
subsystem agree on what the valid IOVA ranges actually are.

Jason
Jason Gunthorpe Nov. 29, 2023, 8:43 p.m. UTC | #2
On Wed, Nov 29, 2023 at 05:43:02PM +0000, Robin Murphy wrote:
> It's now easy to retrieve the device's DMA limits if we want to check
> them against the domain aperture, so do that ourselves instead of
> relying on them being passed through the callchain.
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
>  drivers/iommu/dma-iommu.c | 18 ++++++++----------
>  1 file changed, 8 insertions(+), 10 deletions(-)

When I spent some time noodling on this a few weeks ago I was looking
at putting the dma_range_map_min() effectively as a new reserved
region in the common reserved region code so it naturally flows out to
all the right places.

But this is no worse in that regard than what we have right now:

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

>  	/* Check the domain allows at least some access to the device... */
> -	if (domain->geometry.force_aperture) {
> +	if (map) {

Oh, I've been sitting on a patch to delete force_aperture now too..

Jason
Jason Gunthorpe Nov. 30, 2023, 12:39 a.m. UTC | #3
On Wed, Nov 29, 2023 at 05:43:00PM +0000, Robin Murphy wrote:
> Return the Root Complex/Named Component memory address size limit as an
> inclusive limit value, rather than an exclusive size.  This saves us
> having to special-case 64-bit overflow, and simplifies our caller too.
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
>  drivers/acpi/arm64/dma.c  |  9 +++------
>  drivers/acpi/arm64/iort.c | 18 ++++++++----------
>  include/linux/acpi_iort.h |  4 ++--
>  3 files changed, 13 insertions(+), 18 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason
Rob Herring Nov. 30, 2023, 2:46 p.m. UTC | #4
On Wed, Nov 29, 2023 at 11:43 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> From what I remember, the fixup adding 1 to the dma-ranges size was for
> the benefit of some early AMD Seattle DTs. Those are likely extinct by
> now, and anyone else who might have deserved to get the message has
> hopefully seen the warning in the 9 years we've had it there. The modern
> dma_range_map mechanism should happily handle odd-sized ranges with no
> ill effect, so there's little need to care anyway now. Clean it up.

The commit has a tested by for Seattle, but the series adding this was
for an issue on TI Keystone[1]. Looks like the patch adding this fixup
and warning did 2 things. It added 1 to the default mask when
'dma-ranges' was not present (which keystone needed) and added 1 if
the DT value was a mask along with the warning. It's not clear what
Seattle needed, but there was a fix to dma-ranges about a year
later[2].

I thought at some point we allowed 32-bit DTs to specify a ~0 size to
avoid having to use 2 cells to express 4G size which wouldn't have
been a warning, but I can't find any discussion on that. It would have
been earlier than 2015 I think... Anyways, there is no upstream dts
with that either, so I think we're good.

> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
>  drivers/of/device.c | 16 ----------------
>  1 file changed, 16 deletions(-)

Acked-by: Rob Herring <robh@kernel.org>

[1] https://lore.kernel.org/all/1425405134-24707-1-git-send-email-m-karicheri2@ti.com/
[2] https://lore.kernel.org/all/1455162671-16044-4-git-send-email-Suravee.Suthikulpanit@amd.com/
Rob Herring Nov. 30, 2023, 2:56 p.m. UTC | #5
On Wed, Nov 29, 2023 at 11:43 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> Juggling start, end, and size values for a range is somewhat redundant
> and a little hard to follow. Consolidate down to just using inclusive
> start and end, which saves us worrying about size overflows for full
> 64-bit ranges (note that passing a potentially-overflowed value through
> to arch_setup_dma_ops() is benign for all current implementations, and
> this is working towards removing that anyway).
>
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
>  drivers/of/device.c | 19 ++++++++-----------
>  1 file changed, 8 insertions(+), 11 deletions(-)

Acked-by: Rob Herring <robh@kernel.org>
Robin Murphy Dec. 1, 2023, 1:07 p.m. UTC | #6
On 29/11/2023 8:36 pm, Jason Gunthorpe wrote:
> On Wed, Nov 29, 2023 at 05:42:57PM +0000, Robin Murphy wrote:
>> Hi all,
>>
>> Prompted by Jason's proposal[1], here's a first step towards truly
>> unpicking the dma_configure vs. IOMMU mess. As I commented before, we
>> have an awful lot of accumulated cruft and technical debt here making
>> things more complicated than they need to be, and we already have hacks
>> on top of hacks trying to work around it, so polishing those hacks even
>> further is really not a desirable direction of travel. And I do know
>> they're hacks, because I wrote most of them and still remember enough of
>> the context of the time ;)
> 
> I quite like this, I was also looking at getting rid of those other
> parameters.
> 
> I wanted to take smaller steps because it is all pretty hairy.
> 
> One thing that still concerns me is if the FW data restricts the valid
> IOVA window that really should be reflected into the reserved ranges
> and not just dumped into the struct device for use by the DMA API.
> 
> Or, perhaps, viof/iommufd should be using the struct device data to
> generate some additional reserved ranges?
> 
> Either way, I would like to see the dma_iommu and the rest of the
> subsystem agree on what the valid IOVA ranges actually are.

Note that there is some intentional divergence where iommu-dma reserves 
IOVAs matching PCI outbound windows because it knows it wants to avoid 
clashing with potential peer-to-peer addresses and doesn't want to have 
to get into the details of ACS redirect etc., but we don't expose those 
as generic reserved regions because they're firmly a property of the PCI 
host bridge, not of the IOMMU group (and more practically, because we 
did do so briefly and it made QEMU unhappy). I think there may also have 
been some degree of conclusion that it's not the IOMMU API's place to 
get in the way of other domain users trying to do weird P2P stuff if 
they really want to.

Another issue is that the generic dma_range_map strictly represents 
device-specific constraints which may not always be desirable or 
appropriate to apply to a whole group. There wasn't really a conscious 
decision as such, but it kind of works out as why we still only consider 
PCI's bridge->dma_ranges (which comes from the same underlying data), 
since we can at least assume every device behind a bridge accesses 
memory through that bridge and so inherits its restrictions. However I 
don't recall any conscious decision for inbound windows to only be 
considered for DMA domain reservations rather than for proper reserved 
regions - pretty sure that's just a case of that code being added in the 
place where it seemed to fit best at the time (because hey it's more 
host bridge windows and we already have a thing for host bridge windows...)

Thanks,
Robin.
Jason Gunthorpe Dec. 1, 2023, 1:57 p.m. UTC | #7
On Fri, Dec 01, 2023 at 01:07:36PM +0000, Robin Murphy wrote:
> On 29/11/2023 8:36 pm, Jason Gunthorpe wrote:
> > On Wed, Nov 29, 2023 at 05:42:57PM +0000, Robin Murphy wrote:
> > > Hi all,
> > > 
> > > Prompted by Jason's proposal[1], here's a first step towards truly
> > > unpicking the dma_configure vs. IOMMU mess. As I commented before, we
> > > have an awful lot of accumulated cruft and technical debt here making
> > > things more complicated than they need to be, and we already have hacks
> > > on top of hacks trying to work around it, so polishing those hacks even
> > > further is really not a desirable direction of travel. And I do know
> > > they're hacks, because I wrote most of them and still remember enough of
> > > the context of the time ;)
> > 
> > I quite like this, I was also looking at getting rid of those other
> > parameters.
> > 
> > I wanted to take smaller steps because it is all pretty hairy.
> > 
> > One thing that still concerns me is if the FW data restricts the valid
> > IOVA window that really should be reflected into the reserved ranges
> > and not just dumped into the struct device for use by the DMA API.
> > 
> > Or, perhaps, viof/iommufd should be using the struct device data to
> > generate some additional reserved ranges?
> > 
> > Either way, I would like to see the dma_iommu and the rest of the
> > subsystem agree on what the valid IOVA ranges actually are.
> 
> Note that there is some intentional divergence where iommu-dma reserves
> IOVAs matching PCI outbound windows because it knows it wants to avoid
> clashing with potential peer-to-peer addresses and doesn't want to have to
> get into the details of ACS redirect etc., but we don't expose those as
> generic reserved regions because they're firmly a property of the PCI host
> bridge, not of the IOMMU group (and more practically, because we did do so
> briefly and it made QEMU unhappy). I think there may also have been some
> degree of conclusion that it's not the IOMMU API's place to get in the way
> of other domain users trying to do weird P2P stuff if they really want to.

I'm not sure this is the fully correct conclusion - eg if today we
take a NIC device on a non-ACS topology and run DPDK through VFIO it
has a chance of failure because some IOVA simply cannot be used by
DPDK for DMA at all.

qemu and kvm are a different situation that want different things. Eg
it would want to identity map the PCI BAR spaces to the IOVA they are
claiming.

It should still somehow carve out any other IOVA that is unusable due
to guest-invisible ACS and reflect it through FW tables into the VM.

I'm starting to see people build non-ACS systems and want it to work
with VFIO and I'm a little worried we have been too loose here.

> bridge and so inherits its restrictions. However I don't recall any
> conscious decision for inbound windows to only be considered for DMA domain
> reservations rather than for proper reserved regions - pretty sure that's
> just a case of that code being added in the place where it seemed to fit
> best at the time (because hey it's more host bridge windows and we already
> have a thing for host bridge windows...)

Yeah, and I don't think anyone actually cared much..

At least as a step it would be nice if the DMA API only restrictions
can come out as a special type of reserved region. Then the caller
could decide if they want to follow them or not. iommufd could provide
an opt-in API to DPDK that matches DMA API's safe IOVA allocator.

Jason