mbox series

[00/16] IOMMU memory observability

Message ID 20231128204938.1453583-1-pasha.tatashin@soleen.com
Headers show
Series IOMMU memory observability | expand

Message

Pasha Tatashin Nov. 28, 2023, 8:49 p.m. UTC
From: Pasha Tatashin <tatashin@google.com>

IOMMU subsystem may contain state that is in gigabytes. Majority of that
state is iommu page tables. Yet, there is currently, no way to observe
how much memory is actually used by the iommu subsystem.

This patch series solves this problem by adding both observability to
all pages that are allocated by IOMMU, and also accountability, so
admins can limit the amount if via cgroups.

The system-wide observability is using /proc/meminfo:
SecPageTables:    438176 kB

Contains IOMMU and KVM memory.

Per-node observability:
/sys/devices/system/node/nodeN/meminfo
Node N SecPageTables:    422204 kB

Contains IOMMU and KVM memory memory in the given NUMA node.

Per-node IOMMU only observability:
/sys/devices/system/node/nodeN/vmstat
nr_iommu_pages 105555

Contains number of pages IOMMU allocated in the given node.

Accountability: using sec_pagetables cgroup-v2 memory.stat entry.

With the change, iova_stress[1] stops as limit is reached:

# ./iova_stress
iova space:     0T      free memory:   497G
iova space:     1T      free memory:   495G
iova space:     2T      free memory:   493G
iova space:     3T      free memory:   491G

stops as limit is reached.

This series encorporates suggestions that came from the discussion
at LPC [2].

[1] https://github.com/soleen/iova_stress
[2] https://lpc.events/event/17/contributions/1466

Pasha Tatashin (16):
  iommu/vt-d: add wrapper functions for page allocations
  iommu/amd: use page allocation function provided by iommu-pages.h
  iommu/io-pgtable-arm: use page allocation function provided by
    iommu-pages.h
  iommu/io-pgtable-dart: use page allocation function provided by
    iommu-pages.h
  iommu/io-pgtable-arm-v7s: use page allocation function provided by
    iommu-pages.h
  iommu/dma: use page allocation function provided by iommu-pages.h
  iommu/exynos: use page allocation function provided by iommu-pages.h
  iommu/fsl: use page allocation function provided by iommu-pages.h
  iommu/iommufd: use page allocation function provided by iommu-pages.h
  iommu/rockchip: use page allocation function provided by iommu-pages.h
  iommu/sun50i: use page allocation function provided by iommu-pages.h
  iommu/tegra-smmu: use page allocation function provided by
    iommu-pages.h
  iommu: observability of the IOMMU allocations
  iommu: account IOMMU allocated memory
  vhost-vdpa: account iommu allocations
  vfio: account iommu allocations

 Documentation/admin-guide/cgroup-v2.rst |   2 +-
 Documentation/filesystems/proc.rst      |   4 +-
 drivers/iommu/amd/amd_iommu.h           |   8 -
 drivers/iommu/amd/init.c                |  91 +++++-----
 drivers/iommu/amd/io_pgtable.c          |  13 +-
 drivers/iommu/amd/io_pgtable_v2.c       |  20 +-
 drivers/iommu/amd/iommu.c               |  13 +-
 drivers/iommu/dma-iommu.c               |   8 +-
 drivers/iommu/exynos-iommu.c            |  14 +-
 drivers/iommu/fsl_pamu.c                |   5 +-
 drivers/iommu/intel/dmar.c              |  10 +-
 drivers/iommu/intel/iommu.c             |  47 ++---
 drivers/iommu/intel/iommu.h             |   2 -
 drivers/iommu/intel/irq_remapping.c     |  10 +-
 drivers/iommu/intel/pasid.c             |  12 +-
 drivers/iommu/intel/svm.c               |   7 +-
 drivers/iommu/io-pgtable-arm-v7s.c      |   9 +-
 drivers/iommu/io-pgtable-arm.c          |   7 +-
 drivers/iommu/io-pgtable-dart.c         |  37 ++--
 drivers/iommu/iommu-pages.h             | 231 ++++++++++++++++++++++++
 drivers/iommu/iommufd/iova_bitmap.c     |   6 +-
 drivers/iommu/rockchip-iommu.c          |  14 +-
 drivers/iommu/sun50i-iommu.c            |   7 +-
 drivers/iommu/tegra-smmu.c              |  18 +-
 drivers/vfio/vfio_iommu_type1.c         |   8 +-
 drivers/vhost/vdpa.c                    |   3 +-
 include/linux/mmzone.h                  |   5 +-
 mm/vmstat.c                             |   3 +
 28 files changed, 415 insertions(+), 199 deletions(-)
 create mode 100644 drivers/iommu/iommu-pages.h

Comments

Pasha Tatashin Nov. 28, 2023, 10:31 p.m. UTC | #1
On Tue, Nov 28, 2023 at 4:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Nov 28, 2023 at 12:49 PM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > From: Pasha Tatashin <tatashin@google.com>
> >
> > IOMMU subsystem may contain state that is in gigabytes. Majority of that
> > state is iommu page tables. Yet, there is currently, no way to observe
> > how much memory is actually used by the iommu subsystem.
> >
> > This patch series solves this problem by adding both observability to
> > all pages that are allocated by IOMMU, and also accountability, so
> > admins can limit the amount if via cgroups.
> >
> > The system-wide observability is using /proc/meminfo:
> > SecPageTables:    438176 kB
> >
> > Contains IOMMU and KVM memory.
> >
> > Per-node observability:
> > /sys/devices/system/node/nodeN/meminfo
> > Node N SecPageTables:    422204 kB
> >
> > Contains IOMMU and KVM memory memory in the given NUMA node.
> >
> > Per-node IOMMU only observability:
> > /sys/devices/system/node/nodeN/vmstat
> > nr_iommu_pages 105555
> >
> > Contains number of pages IOMMU allocated in the given node.
>
> Does it make sense to have a KVM-only entry there as well?
>
> In that case, if SecPageTables in /proc/meminfo is found to be
> suspiciously high, it should be easy to tell which component is
> contributing most usage through vmstat. I understand that users can do
> the subtraction, but we wouldn't want userspace depending on that, in
> case a third class of "secondary" page tables emerges that we want to
> add to SecPageTables. The in-kernel implementation can do the
> subtraction for now if it makes sense though.

Hi Yosry,

Yes, another counter for KVM could be added. On the other hand KVM
only can be computed by subtracting one from another as there are only
two types of secondary page tables, KVM and IOMMU:

/sys/devices/system/node/node0/meminfo
Node 0 SecPageTables:    422204 kB

 /sys/devices/system/node/nodeN/vmstat
nr_iommu_pages 105555

KVM only = SecPageTables - nr_iommu_pages * PAGE_SIZE / 1024

Pasha

>
> >
> > Accountability: using sec_pagetables cgroup-v2 memory.stat entry.
> >
> > With the change, iova_stress[1] stops as limit is reached:
> >
> > # ./iova_stress
> > iova space:     0T      free memory:   497G
> > iova space:     1T      free memory:   495G
> > iova space:     2T      free memory:   493G
> > iova space:     3T      free memory:   491G
> >
> > stops as limit is reached.
> >
> > This series encorporates suggestions that came from the discussion
> > at LPC [2].
> >
> > [1] https://github.com/soleen/iova_stress
> > [2] https://lpc.events/event/17/contributions/1466
> >
> > Pasha Tatashin (16):
> >   iommu/vt-d: add wrapper functions for page allocations
> >   iommu/amd: use page allocation function provided by iommu-pages.h
> >   iommu/io-pgtable-arm: use page allocation function provided by
> >     iommu-pages.h
> >   iommu/io-pgtable-dart: use page allocation function provided by
> >     iommu-pages.h
> >   iommu/io-pgtable-arm-v7s: use page allocation function provided by
> >     iommu-pages.h
> >   iommu/dma: use page allocation function provided by iommu-pages.h
> >   iommu/exynos: use page allocation function provided by iommu-pages.h
> >   iommu/fsl: use page allocation function provided by iommu-pages.h
> >   iommu/iommufd: use page allocation function provided by iommu-pages.h
> >   iommu/rockchip: use page allocation function provided by iommu-pages.h
> >   iommu/sun50i: use page allocation function provided by iommu-pages.h
> >   iommu/tegra-smmu: use page allocation function provided by
> >     iommu-pages.h
> >   iommu: observability of the IOMMU allocations
> >   iommu: account IOMMU allocated memory
> >   vhost-vdpa: account iommu allocations
> >   vfio: account iommu allocations
> >
> >  Documentation/admin-guide/cgroup-v2.rst |   2 +-
> >  Documentation/filesystems/proc.rst      |   4 +-
> >  drivers/iommu/amd/amd_iommu.h           |   8 -
> >  drivers/iommu/amd/init.c                |  91 +++++-----
> >  drivers/iommu/amd/io_pgtable.c          |  13 +-
> >  drivers/iommu/amd/io_pgtable_v2.c       |  20 +-
> >  drivers/iommu/amd/iommu.c               |  13 +-
> >  drivers/iommu/dma-iommu.c               |   8 +-
> >  drivers/iommu/exynos-iommu.c            |  14 +-
> >  drivers/iommu/fsl_pamu.c                |   5 +-
> >  drivers/iommu/intel/dmar.c              |  10 +-
> >  drivers/iommu/intel/iommu.c             |  47 ++---
> >  drivers/iommu/intel/iommu.h             |   2 -
> >  drivers/iommu/intel/irq_remapping.c     |  10 +-
> >  drivers/iommu/intel/pasid.c             |  12 +-
> >  drivers/iommu/intel/svm.c               |   7 +-
> >  drivers/iommu/io-pgtable-arm-v7s.c      |   9 +-
> >  drivers/iommu/io-pgtable-arm.c          |   7 +-
> >  drivers/iommu/io-pgtable-dart.c         |  37 ++--
> >  drivers/iommu/iommu-pages.h             | 231 ++++++++++++++++++++++++
> >  drivers/iommu/iommufd/iova_bitmap.c     |   6 +-
> >  drivers/iommu/rockchip-iommu.c          |  14 +-
> >  drivers/iommu/sun50i-iommu.c            |   7 +-
> >  drivers/iommu/tegra-smmu.c              |  18 +-
> >  drivers/vfio/vfio_iommu_type1.c         |   8 +-
> >  drivers/vhost/vdpa.c                    |   3 +-
> >  include/linux/mmzone.h                  |   5 +-
> >  mm/vmstat.c                             |   3 +
> >  28 files changed, 415 insertions(+), 199 deletions(-)
> >  create mode 100644 drivers/iommu/iommu-pages.h
> >
> > --
> > 2.43.0.rc2.451.g8631bc7472-goog
> >
> >
Pasha Tatashin Nov. 28, 2023, 10:50 p.m. UTC | #2
On Tue, Nov 28, 2023 at 5:34 PM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
> > Convert iommu/dma-iommu.c to use the new page allocation functions
> > provided in iommu-pages.h.
>
> These have nothing to do with IOMMU pagetables, they are DMA buffers and
> they belong to whoever called the corresponding dma_alloc_* function.

Hi Robin,

This is true, however, we want to account and observe the pages
allocated by IOMMU subsystem for DMA buffers, as they are essentially
unmovable locked pages. Should we separate IOMMU memory from KVM
memory all together and add another field to /proc/meminfo, something
like "iommu -> iommu pagetable and dma memory", or do we want to
export DMA memory separately from IOMMU page tables?

Since, I included DMA memory, I specifically removed mentioning of
IOMMU page tables in the most of places, and only report it as IOMMU
memory. However, since it is still bundled together with SecPageTables
it can be confusing.

Pasha
Robin Murphy Nov. 28, 2023, 10:59 p.m. UTC | #3
On 2023-11-28 10:50 pm, Pasha Tatashin wrote:
> On Tue, Nov 28, 2023 at 5:34 PM Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
>>> Convert iommu/dma-iommu.c to use the new page allocation functions
>>> provided in iommu-pages.h.
>>
>> These have nothing to do with IOMMU pagetables, they are DMA buffers and
>> they belong to whoever called the corresponding dma_alloc_* function.
> 
> Hi Robin,
> 
> This is true, however, we want to account and observe the pages
> allocated by IOMMU subsystem for DMA buffers, as they are essentially
> unmovable locked pages. Should we separate IOMMU memory from KVM
> memory all together and add another field to /proc/meminfo, something
> like "iommu -> iommu pagetable and dma memory", or do we want to
> export DMA memory separately from IOMMU page tables?

These are not allocated by "the IOMMU subsystem", they are allocated by 
the DMA API. Even if you want to claim that a driver pinning memory via 
iommu_dma_ops is somehow different from the same driver pinning the same 
amount of memory via dma-direct when iommu.passthrough=1, it's still 
nonsense because you're failing to account the pages which iommu_dma_ops 
gets from CMA, dma_common_alloc_pages(), dynamic SWIOTLB, the various 
pools, and so on.

Thanks,
Robin.

> Since, I included DMA memory, I specifically removed mentioning of
> IOMMU page tables in the most of places, and only report it as IOMMU
> memory. However, since it is still bundled together with SecPageTables
> it can be confusing.
> 
> Pasha
Pasha Tatashin Nov. 28, 2023, 11 p.m. UTC | #4
On Tue, Nov 28, 2023 at 5:53 PM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
> > Convert iommu/fsl_pamu.c to use the new page allocation functions
> > provided in iommu-pages.h.
>
> Again, this is not a pagetable. This thing doesn't even *have* pagetables.
>
> Similar to patches #1 and #2 where you're lumping in configuration
> tables which belong to the IOMMU driver itself, as opposed to pagetables
> which effectively belong to an IOMMU domain's user. But then there are
> still drivers where you're *not* accounting similar configuration
> structures, so I really struggle to see how this metric is useful when
> it's so completely inconsistent in what it's counting :/

The whole IOMMU subsystem allocates a significant amount of kernel
locked memory that we want to at least observe. The new field in
vmstat does just that: it reports ALL buddy allocator memory that
IOMMU allocates. However, for accounting purposes, I agree, we need to
do better, and separate at least iommu pagetables from the rest.

We can separate the metric into two:
iommu pagetable only
iommu everything

or into three:
iommu pagetable only
iommu dma
iommu everything

What do you think?

Pasha

>
> Thanks,
> Robin.
>
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >   drivers/iommu/fsl_pamu.c | 5 +++--
> >   1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/fsl_pamu.c b/drivers/iommu/fsl_pamu.c
> > index f37d3b044131..7bfb49940f0c 100644
> > --- a/drivers/iommu/fsl_pamu.c
> > +++ b/drivers/iommu/fsl_pamu.c
> > @@ -16,6 +16,7 @@
> >   #include <linux/platform_device.h>
> >
> >   #include <asm/mpc85xx.h>
> > +#include "iommu-pages.h"
> >
> >   /* define indexes for each operation mapping scenario */
> >   #define OMI_QMAN        0x00
> > @@ -828,7 +829,7 @@ static int fsl_pamu_probe(struct platform_device *pdev)
> >               (PAGE_SIZE << get_order(OMT_SIZE));
> >       order = get_order(mem_size);
> >
> > -     p = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
> > +     p = __iommu_alloc_pages(GFP_KERNEL, order);
> >       if (!p) {
> >               dev_err(dev, "unable to allocate PAACT/SPAACT/OMT block\n");
> >               ret = -ENOMEM;
> > @@ -916,7 +917,7 @@ static int fsl_pamu_probe(struct platform_device *pdev)
> >               iounmap(guts_regs);
> >
> >       if (ppaact)
> > -             free_pages((unsigned long)ppaact, order);
> > +             iommu_free_pages(ppaact, order);
> >
> >       ppaact = NULL;
> >
Yosry Ahmed Nov. 28, 2023, 11:03 p.m. UTC | #5
On Tue, Nov 28, 2023 at 2:32 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Tue, Nov 28, 2023 at 4:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, Nov 28, 2023 at 12:49 PM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> > >
> > > From: Pasha Tatashin <tatashin@google.com>
> > >
> > > IOMMU subsystem may contain state that is in gigabytes. Majority of that
> > > state is iommu page tables. Yet, there is currently, no way to observe
> > > how much memory is actually used by the iommu subsystem.
> > >
> > > This patch series solves this problem by adding both observability to
> > > all pages that are allocated by IOMMU, and also accountability, so
> > > admins can limit the amount if via cgroups.
> > >
> > > The system-wide observability is using /proc/meminfo:
> > > SecPageTables:    438176 kB
> > >
> > > Contains IOMMU and KVM memory.
> > >
> > > Per-node observability:
> > > /sys/devices/system/node/nodeN/meminfo
> > > Node N SecPageTables:    422204 kB
> > >
> > > Contains IOMMU and KVM memory memory in the given NUMA node.
> > >
> > > Per-node IOMMU only observability:
> > > /sys/devices/system/node/nodeN/vmstat
> > > nr_iommu_pages 105555
> > >
> > > Contains number of pages IOMMU allocated in the given node.
> >
> > Does it make sense to have a KVM-only entry there as well?
> >
> > In that case, if SecPageTables in /proc/meminfo is found to be
> > suspiciously high, it should be easy to tell which component is
> > contributing most usage through vmstat. I understand that users can do
> > the subtraction, but we wouldn't want userspace depending on that, in
> > case a third class of "secondary" page tables emerges that we want to
> > add to SecPageTables. The in-kernel implementation can do the
> > subtraction for now if it makes sense though.
>
> Hi Yosry,
>
> Yes, another counter for KVM could be added. On the other hand KVM
> only can be computed by subtracting one from another as there are only
> two types of secondary page tables, KVM and IOMMU:
>
> /sys/devices/system/node/node0/meminfo
> Node 0 SecPageTables:    422204 kB
>
>  /sys/devices/system/node/nodeN/vmstat
> nr_iommu_pages 105555
>
> KVM only = SecPageTables - nr_iommu_pages * PAGE_SIZE / 1024
>

Right, but as I mention above, if userspace starts depending on this
equation, we won't be able to add any more classes of "secondary" page
tables to SecPageTables. I'd like to avoid that if possible. We can do
the subtraction in the kernel.
Pasha Tatashin Nov. 28, 2023, 11:06 p.m. UTC | #6
On Tue, Nov 28, 2023 at 5:59 PM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2023-11-28 10:50 pm, Pasha Tatashin wrote:
> > On Tue, Nov 28, 2023 at 5:34 PM Robin Murphy <robin.murphy@arm.com> wrote:
> >>
> >> On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
> >>> Convert iommu/dma-iommu.c to use the new page allocation functions
> >>> provided in iommu-pages.h.
> >>
> >> These have nothing to do with IOMMU pagetables, they are DMA buffers and
> >> they belong to whoever called the corresponding dma_alloc_* function.
> >
> > Hi Robin,
> >
> > This is true, however, we want to account and observe the pages
> > allocated by IOMMU subsystem for DMA buffers, as they are essentially
> > unmovable locked pages. Should we separate IOMMU memory from KVM
> > memory all together and add another field to /proc/meminfo, something
> > like "iommu -> iommu pagetable and dma memory", or do we want to
> > export DMA memory separately from IOMMU page tables?
>
> These are not allocated by "the IOMMU subsystem", they are allocated by
> the DMA API. Even if you want to claim that a driver pinning memory via
> iommu_dma_ops is somehow different from the same driver pinning the same
> amount of memory via dma-direct when iommu.passthrough=1, it's still
> nonsense because you're failing to account the pages which iommu_dma_ops
> gets from CMA, dma_common_alloc_pages(), dynamic SWIOTLB, the various
> pools, and so on.
>
> Thanks,
> Robin.
>
> > Since, I included DMA memory, I specifically removed mentioning of
> > IOMMU page tables in the most of places, and only report it as IOMMU
> > memory. However, since it is still bundled together with SecPageTables
> > it can be confusing.
> >
> > Pasha
Pasha Tatashin Nov. 28, 2023, 11:08 p.m. UTC | #7
> > This is true, however, we want to account and observe the pages
> > allocated by IOMMU subsystem for DMA buffers, as they are essentially
> > unmovable locked pages. Should we separate IOMMU memory from KVM
> > memory all together and add another field to /proc/meminfo, something
> > like "iommu -> iommu pagetable and dma memory", or do we want to
> > export DMA memory separately from IOMMU page tables?
>
> These are not allocated by "the IOMMU subsystem", they are allocated by
> the DMA API. Even if you want to claim that a driver pinning memory via
> iommu_dma_ops is somehow different from the same driver pinning the same
> amount of memory via dma-direct when iommu.passthrough=1, it's still
> nonsense because you're failing to account the pages which iommu_dma_ops
> gets from CMA, dma_common_alloc_pages(), dynamic SWIOTLB, the various
> pools, and so on.

I see, IOMMU variants are used only for discontiguous allocations, and
the common ones are defined outside of driver/iommu. Alright, I can
remove all the changes for all no-page table related IOMMU
allocations.

Pasha
Jason Gunthorpe Nov. 28, 2023, 11:50 p.m. UTC | #8
On Tue, Nov 28, 2023 at 06:00:13PM -0500, Pasha Tatashin wrote:
> On Tue, Nov 28, 2023 at 5:53 PM Robin Murphy <robin.murphy@arm.com> wrote:
> >
> > On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
> > > Convert iommu/fsl_pamu.c to use the new page allocation functions
> > > provided in iommu-pages.h.
> >
> > Again, this is not a pagetable. This thing doesn't even *have* pagetables.
> >
> > Similar to patches #1 and #2 where you're lumping in configuration
> > tables which belong to the IOMMU driver itself, as opposed to pagetables
> > which effectively belong to an IOMMU domain's user. But then there are
> > still drivers where you're *not* accounting similar configuration
> > structures, so I really struggle to see how this metric is useful when
> > it's so completely inconsistent in what it's counting :/
> 
> The whole IOMMU subsystem allocates a significant amount of kernel
> locked memory that we want to at least observe. The new field in
> vmstat does just that: it reports ALL buddy allocator memory that
> IOMMU allocates. However, for accounting purposes, I agree, we need to
> do better, and separate at least iommu pagetables from the rest.
> 
> We can separate the metric into two:
> iommu pagetable only
> iommu everything
> 
> or into three:
> iommu pagetable only
> iommu dma
> iommu everything
> 
> What do you think?

I think I said this at LPC - if you want to have fine grained
accounting of memory by owner you need to go talk to the cgroup people
and come up with something generic. Adding ever open coded finer
category breakdowns just for iommu doesn't make alot of sense.

You can make some argument that the pagetable memory should be counted
because kvm counts it's shadow memory, but I wouldn't go into further
detail than that with hand coded counters..

Jason
Jason Gunthorpe Nov. 28, 2023, 11:52 p.m. UTC | #9
On Tue, Nov 28, 2023 at 03:03:30PM -0800, Yosry Ahmed wrote:
> > Yes, another counter for KVM could be added. On the other hand KVM
> > only can be computed by subtracting one from another as there are only
> > two types of secondary page tables, KVM and IOMMU:
> >
> > /sys/devices/system/node/node0/meminfo
> > Node 0 SecPageTables:    422204 kB
> >
> >  /sys/devices/system/node/nodeN/vmstat
> > nr_iommu_pages 105555
> >
> > KVM only = SecPageTables - nr_iommu_pages * PAGE_SIZE / 1024
> >
> 
> Right, but as I mention above, if userspace starts depending on this
> equation, we won't be able to add any more classes of "secondary" page
> tables to SecPageTables. I'd like to avoid that if possible. We can do
> the subtraction in the kernel.

What Sean had suggested was that SecPageTables was always intended to
account all the non-primary mmu memory used by page tables. If this is
the case we shouldn't be trying to break it apart into finer
counters. These are big picture counters, not detailed allocation by
owner counters.

Jason
Jason Gunthorpe Nov. 28, 2023, 11:52 p.m. UTC | #10
On Tue, Nov 28, 2023 at 08:49:31PM +0000, Pasha Tatashin wrote:
> Convert iommu/iommufd/* files to use the new page allocation functions
> provided in iommu-pages.h.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  drivers/iommu/iommufd/iova_bitmap.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

This is a short term allocation, it should not be counted, that is why
it is already not using GFP_KERNEL_ACCOUNT.

Jason
Yosry Ahmed Nov. 29, 2023, 12:25 a.m. UTC | #11
On Tue, Nov 28, 2023 at 3:52 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Nov 28, 2023 at 03:03:30PM -0800, Yosry Ahmed wrote:
> > > Yes, another counter for KVM could be added. On the other hand KVM
> > > only can be computed by subtracting one from another as there are only
> > > two types of secondary page tables, KVM and IOMMU:
> > >
> > > /sys/devices/system/node/node0/meminfo
> > > Node 0 SecPageTables:    422204 kB
> > >
> > >  /sys/devices/system/node/nodeN/vmstat
> > > nr_iommu_pages 105555
> > >
> > > KVM only = SecPageTables - nr_iommu_pages * PAGE_SIZE / 1024
> > >
> >
> > Right, but as I mention above, if userspace starts depending on this
> > equation, we won't be able to add any more classes of "secondary" page
> > tables to SecPageTables. I'd like to avoid that if possible. We can do
> > the subtraction in the kernel.
>
> What Sean had suggested was that SecPageTables was always intended to
> account all the non-primary mmu memory used by page tables. If this is
> the case we shouldn't be trying to break it apart into finer
> counters. These are big picture counters, not detailed allocation by
> owner counters.

Right, I agree with that, but if SecPageTables includes page tables
from multiple sources, and it is observed to be suspiciously high, the
logical next step is to try to find the culprit, right?

>
> Jason
Jason Gunthorpe Nov. 29, 2023, 12:28 a.m. UTC | #12
On Tue, Nov 28, 2023 at 04:25:03PM -0800, Yosry Ahmed wrote:

> > > Right, but as I mention above, if userspace starts depending on this
> > > equation, we won't be able to add any more classes of "secondary" page
> > > tables to SecPageTables. I'd like to avoid that if possible. We can do
> > > the subtraction in the kernel.
> >
> > What Sean had suggested was that SecPageTables was always intended to
> > account all the non-primary mmu memory used by page tables. If this is
> > the case we shouldn't be trying to break it apart into finer
> > counters. These are big picture counters, not detailed allocation by
> > owner counters.
> 
> Right, I agree with that, but if SecPageTables includes page tables
> from multiple sources, and it is observed to be suspiciously high, the
> logical next step is to try to find the culprit, right?

You can make that case already, if it is high wouldn't you want to
find the exact VMM process that was making it high?

It is a sign of fire, not a detailed debug tool.

Jason
Yosry Ahmed Nov. 29, 2023, 12:30 a.m. UTC | #13
On Tue, Nov 28, 2023 at 4:28 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Nov 28, 2023 at 04:25:03PM -0800, Yosry Ahmed wrote:
>
> > > > Right, but as I mention above, if userspace starts depending on this
> > > > equation, we won't be able to add any more classes of "secondary" page
> > > > tables to SecPageTables. I'd like to avoid that if possible. We can do
> > > > the subtraction in the kernel.
> > >
> > > What Sean had suggested was that SecPageTables was always intended to
> > > account all the non-primary mmu memory used by page tables. If this is
> > > the case we shouldn't be trying to break it apart into finer
> > > counters. These are big picture counters, not detailed allocation by
> > > owner counters.
> >
> > Right, I agree with that, but if SecPageTables includes page tables
> > from multiple sources, and it is observed to be suspiciously high, the
> > logical next step is to try to find the culprit, right?
>
> You can make that case already, if it is high wouldn't you want to
> find the exact VMM process that was making it high?
>
> It is a sign of fire, not a detailed debug tool.

Fair enough. We can always add separate counters later if needed,
potentially under KVM stats to get more fine-grained details as you
mentioned.

I am only worried about users subtracting the iommu-only counter to
get a KVM counter. We should at least document that  SecPageTables may
be expanded to include other sources later to avoid that.

>
> Jason
Jason Gunthorpe Nov. 29, 2023, 12:54 a.m. UTC | #14
On Tue, Nov 28, 2023 at 04:30:27PM -0800, Yosry Ahmed wrote:
> On Tue, Nov 28, 2023 at 4:28 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Tue, Nov 28, 2023 at 04:25:03PM -0800, Yosry Ahmed wrote:
> >
> > > > > Right, but as I mention above, if userspace starts depending on this
> > > > > equation, we won't be able to add any more classes of "secondary" page
> > > > > tables to SecPageTables. I'd like to avoid that if possible. We can do
> > > > > the subtraction in the kernel.
> > > >
> > > > What Sean had suggested was that SecPageTables was always intended to
> > > > account all the non-primary mmu memory used by page tables. If this is
> > > > the case we shouldn't be trying to break it apart into finer
> > > > counters. These are big picture counters, not detailed allocation by
> > > > owner counters.
> > >
> > > Right, I agree with that, but if SecPageTables includes page tables
> > > from multiple sources, and it is observed to be suspiciously high, the
> > > logical next step is to try to find the culprit, right?
> >
> > You can make that case already, if it is high wouldn't you want to
> > find the exact VMM process that was making it high?
> >
> > It is a sign of fire, not a detailed debug tool.
> 
> Fair enough. We can always add separate counters later if needed,
> potentially under KVM stats to get more fine-grained details as you
> mentioned.
> 
> I am only worried about users subtracting the iommu-only counter to
> get a KVM counter. We should at least document that  SecPageTables may
> be expanded to include other sources later to avoid that.

Well, we just broke it already, anyone thinking it was only kvm
counters is going to be sad now :) As I understand it was already
described to be more general that kvm so probably nothing to do really

Jason
Robin Murphy Nov. 29, 2023, 4:48 p.m. UTC | #15
On 28/11/2023 11:50 pm, Jason Gunthorpe wrote:
> On Tue, Nov 28, 2023 at 06:00:13PM -0500, Pasha Tatashin wrote:
>> On Tue, Nov 28, 2023 at 5:53 PM Robin Murphy <robin.murphy@arm.com> wrote:
>>>
>>> On 2023-11-28 8:49 pm, Pasha Tatashin wrote:
>>>> Convert iommu/fsl_pamu.c to use the new page allocation functions
>>>> provided in iommu-pages.h.
>>>
>>> Again, this is not a pagetable. This thing doesn't even *have* pagetables.
>>>
>>> Similar to patches #1 and #2 where you're lumping in configuration
>>> tables which belong to the IOMMU driver itself, as opposed to pagetables
>>> which effectively belong to an IOMMU domain's user. But then there are
>>> still drivers where you're *not* accounting similar configuration
>>> structures, so I really struggle to see how this metric is useful when
>>> it's so completely inconsistent in what it's counting :/
>>
>> The whole IOMMU subsystem allocates a significant amount of kernel
>> locked memory that we want to at least observe. The new field in
>> vmstat does just that: it reports ALL buddy allocator memory that
>> IOMMU allocates. However, for accounting purposes, I agree, we need to
>> do better, and separate at least iommu pagetables from the rest.
>>
>> We can separate the metric into two:
>> iommu pagetable only
>> iommu everything
>>
>> or into three:
>> iommu pagetable only
>> iommu dma
>> iommu everything
>>
>> What do you think?
> 
> I think I said this at LPC - if you want to have fine grained
> accounting of memory by owner you need to go talk to the cgroup people
> and come up with something generic. Adding ever open coded finer
> category breakdowns just for iommu doesn't make alot of sense.
> 
> You can make some argument that the pagetable memory should be counted
> because kvm counts it's shadow memory, but I wouldn't go into further
> detail than that with hand coded counters..

Right, pagetable memory is interesting since it's something that any 
random kernel user can indirectly allocate via iommu_domain_alloc() and 
iommu_map(), and some of those users may even be doing so on behalf of 
userspace. I have no objection to accounting and potentially applying 
limits to *that*.

Beyond that, though, there is nothing special about "the IOMMU 
subsystem". The amount of memory an IOMMU driver needs to allocate for 
itself in order to function is not of interest beyond curiosity, it just 
is what it is; limiting it would only break the IOMMU, and if a user 
thinks it's "too much", the only actionable thing that might help is to 
physically remove devices from the system. Similar for DMA buffers; it 
might be intriguing to account those, but it's not really an actionable 
metric - in the overwhelming majority of cases you can't simply tell a 
driver to allocate less than what it needs. And that is of course 
assuming if we were to account *all* DMA buffers, since whether they 
happen to have an IOMMU translation or not is irrelevant (we'd have 
already accounted the pagetables as pagetables if so).

I bet "the networking subsystem" also consumes significant memory on the 
same kind of big systems where IOMMU pagetables would be of any concern. 
I believe some of the some of the "serious" NICs can easily run up 
hundreds of megabytes if not gigabytes worth of queues, SKB pools, etc. 
- would you propose accounting those too?

Thanks,
Robin.
Pasha Tatashin Nov. 29, 2023, 7:45 p.m. UTC | #16
> >> We can separate the metric into two:
> >> iommu pagetable only
> >> iommu everything
> >>
> >> or into three:
> >> iommu pagetable only
> >> iommu dma
> >> iommu everything
> >>
> >> What do you think?
> >
> > I think I said this at LPC - if you want to have fine grained
> > accounting of memory by owner you need to go talk to the cgroup people
> > and come up with something generic. Adding ever open coded finer
> > category breakdowns just for iommu doesn't make alot of sense.
> >
> > You can make some argument that the pagetable memory should be counted
> > because kvm counts it's shadow memory, but I wouldn't go into further
> > detail than that with hand coded counters..
>
> Right, pagetable memory is interesting since it's something that any
> random kernel user can indirectly allocate via iommu_domain_alloc() and
> iommu_map(), and some of those users may even be doing so on behalf of
> userspace. I have no objection to accounting and potentially applying
> limits to *that*.

Yes, in the next version, I will separate pagetable only from the
rest, for the limits.

> Beyond that, though, there is nothing special about "the IOMMU
> subsystem". The amount of memory an IOMMU driver needs to allocate for
> itself in order to function is not of interest beyond curiosity, it just
> is what it is; limiting it would only break the IOMMU, and if a user

Agree about the amount of memory IOMMU allocates for itself, but that
should be small, if it is not, we have to at least show where the
memory is used.

> thinks it's "too much", the only actionable thing that might help is to
> physically remove devices from the system. Similar for DMA buffers; it
> might be intriguing to account those, but it's not really an actionable
> metric - in the overwhelming majority of cases you can't simply tell a
> driver to allocate less than what it needs. And that is of course
> assuming if we were to account *all* DMA buffers, since whether they
> happen to have an IOMMU translation or not is irrelevant (we'd have
> already accounted the pagetables as pagetables if so).

DMA mappings should be observable (do not have to be limited). At the
very least, it can help with explaining the kernel memory overhead
anomalies on production systems.

> I bet "the networking subsystem" also consumes significant memory on the

It does, and GPU drivers also may consume a significant amount of memory.

> same kind of big systems where IOMMU pagetables would be of any concern.
> I believe some of the some of the "serious" NICs can easily run up
> hundreds of megabytes if not gigabytes worth of queues, SKB pools, etc.
> - would you propose accounting those too?

Yes. Any kind of kernel memory that is proportional to the workload
should be accountable. Someone is using those resources compared to
the idling system, and that someone should be charged.

Pasha
Jason Gunthorpe Nov. 29, 2023, 8:03 p.m. UTC | #17
On Wed, Nov 29, 2023 at 02:45:03PM -0500, Pasha Tatashin wrote:

> > same kind of big systems where IOMMU pagetables would be of any concern.
> > I believe some of the some of the "serious" NICs can easily run up
> > hundreds of megabytes if not gigabytes worth of queues, SKB pools, etc.
> > - would you propose accounting those too?
> 
> Yes. Any kind of kernel memory that is proportional to the workload
> should be accountable. Someone is using those resources compared to
> the idling system, and that someone should be charged.

There is a difference between charged and accounted

You should be running around adding GFP_KERNEL_ACCOUNT, yes. I already
did a bunch of that work. Split that out from this series and send it
to the right maintainers.

Adding a counter for allocations and showing in procfs is a very
different question. IMHO that should not be done in micro, the
threshold to add a new counter should be high.

There is definately room for a generic debugging feature to break down
GFP_KERNEL_ACCOUNT by owernship somehow. Maybe it can already be done
with BPF. IDK

Jason
Pasha Tatashin Nov. 29, 2023, 8:44 p.m. UTC | #18
On Wed, Nov 29, 2023 at 3:03 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Nov 29, 2023 at 02:45:03PM -0500, Pasha Tatashin wrote:
>
> > > same kind of big systems where IOMMU pagetables would be of any concern.
> > > I believe some of the some of the "serious" NICs can easily run up
> > > hundreds of megabytes if not gigabytes worth of queues, SKB pools, etc.
> > > - would you propose accounting those too?
> >
> > Yes. Any kind of kernel memory that is proportional to the workload
> > should be accountable. Someone is using those resources compared to
> > the idling system, and that someone should be charged.
>
> There is a difference between charged and accounted
>
> You should be running around adding GFP_KERNEL_ACCOUNT, yes. I already
> did a bunch of that work. Split that out from this series and send it
> to the right maintainers.

I will do that.

>
> Adding a counter for allocations and showing in procfs is a very
> different question. IMHO that should not be done in micro, the
> threshold to add a new counter should be high.

I agree, /proc/meminfo, should not include everything, however overall
network consumption that includes memory allocated by network driver
would be useful to have, may be it should be exported by device
drivers and added to the protocol memory. We already have network
protocol memory consumption in procfs:

# awk '{printf "%-10s %s\n", $1, $4}' /proc/net/protocols | grep  -v '\-1'
protocol   memory
UDPv6      22673
TCPv6      16961

> There is definately room for a generic debugging feature to break down
> GFP_KERNEL_ACCOUNT by owernship somehow. Maybe it can already be done
> with BPF. IDK
Pasha Tatashin Nov. 29, 2023, 9:59 p.m. UTC | #19
On Tue, Nov 28, 2023 at 6:52 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Nov 28, 2023 at 08:49:31PM +0000, Pasha Tatashin wrote:
> > Convert iommu/iommufd/* files to use the new page allocation functions
> > provided in iommu-pages.h.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >  drivers/iommu/iommufd/iova_bitmap.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
>
> This is a short term allocation, it should not be counted, that is why
> it is already not using GFP_KERNEL_ACCOUNT.

I made this change for completeness. I changed all calls to
get_free_page/alloc_page etc under driver/iommu to use the
iommu_alloc_* variants, this also helps future developers in this area
to use the right allocation functions.
The accounting is implemented using cheap per-cpu counters, so should
not affect the performance, I think it is OK to keep them here.
Jason Gunthorpe Nov. 30, 2023, 12:02 a.m. UTC | #20
On Wed, Nov 29, 2023 at 04:59:43PM -0500, Pasha Tatashin wrote:
> On Tue, Nov 28, 2023 at 6:52 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Tue, Nov 28, 2023 at 08:49:31PM +0000, Pasha Tatashin wrote:
> > > Convert iommu/iommufd/* files to use the new page allocation functions
> > > provided in iommu-pages.h.
> > >
> > > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > > ---
> > >  drivers/iommu/iommufd/iova_bitmap.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > This is a short term allocation, it should not be counted, that is why
> > it is already not using GFP_KERNEL_ACCOUNT.
> 
> I made this change for completeness. I changed all calls to
> get_free_page/alloc_page etc under driver/iommu to use the
> iommu_alloc_* variants, this also helps future developers in this area
> to use the right allocation functions.
> The accounting is implemented using cheap per-cpu counters, so should
> not affect the performance, I think it is OK to keep them here.

Except it is a mis use of an API that should only be used for page
table memory :(

Jason