diff mbox series

[RFC] drm/panfrost: Add support for mapping BOs on GPU page faults

Message ID 20190610170440.28525-1-robh@kernel.org
State New
Headers show
Series [RFC] drm/panfrost: Add support for mapping BOs on GPU page faults | expand

Commit Message

Rob Herring June 10, 2019, 5:04 p.m. UTC
The midgard/bifrost GPUs need to allocate GPU memory which is allocated
on GPU page faults and not pinned in memory. The vendor driver calls
this functionality GROW_ON_GPF.

This implementation assumes that BOs allocated with the
PANFROST_BO_NOMAP flag are never mmapped or exported. Both of those may
actually work, but I'm unsure if there's some interaction there. It
would cause the whole object to be pinned in memory which would defeat
the point of this.

Issues/questions/thoughts:

What's the difference between i_mapping and f_mapping?

What kind of clean-up on close is needed? Based on vgem faults, there
doesn't seem to be any refcounting. Assume userspace is responsible for
not freeing the BO while a page fault can occur?

What about evictions? Need to call mapping_set_unevictable()? Maybe we
want these pages to be swappable, but then we need some notification to
unmap them.

No need for runtime PM, because faults should only occur during job
submit?

Need to fault in 2MB sections when possible. It seems this has to be
done by getting an array of struct pages and then converting to a
scatterlist. Not sure if there is a better way to do that. There's also
some errata on T604 needing to fault in at least 8 pages at a time which
isn't handled.

Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Cc: Tomeu Vizoso <tomeu@tomeuvizoso.net>
Signed-off-by: Rob Herring <robh@kernel.org>
---

 drivers/gpu/drm/panfrost/TODO           |  2 -
 drivers/gpu/drm/panfrost/panfrost_drv.c | 22 +++++--
 drivers/gpu/drm/panfrost/panfrost_gem.c |  3 +-
 drivers/gpu/drm/panfrost/panfrost_gem.h |  8 +++
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 76 +++++++++++++++++++++++--
 include/uapi/drm/panfrost_drm.h         |  2 +
 6 files changed, 100 insertions(+), 13 deletions(-)

Comments

Tomeu Vizoso June 12, 2019, 12:54 p.m. UTC | #1
On Mon, 10 Jun 2019 at 19:06, Rob Herring <robh@kernel.org> wrote:
>
> The midgard/bifrost GPUs need to allocate GPU memory which is allocated
> on GPU page faults and not pinned in memory. The vendor driver calls
> this functionality GROW_ON_GPF.
>
> This implementation assumes that BOs allocated with the
> PANFROST_BO_NOMAP flag are never mmapped or exported. Both of those may
> actually work, but I'm unsure if there's some interaction there. It
> would cause the whole object to be pinned in memory which would defeat
> the point of this.
>
> Issues/questions/thoughts:
>
> What's the difference between i_mapping and f_mapping?
>
> What kind of clean-up on close is needed? Based on vgem faults, there
> doesn't seem to be any refcounting. Assume userspace is responsible for
> not freeing the BO while a page fault can occur?

Aren't we taking a reference on all BOs that a job relates to and
unreferencing them once the job is done? I would think that that's
enough, or am I missing something?

> What about evictions? Need to call mapping_set_unevictable()? Maybe we
> want these pages to be swappable, but then we need some notification to
> unmap them.

I'm not sure there's much point in swapping out pages with lifetimes
of a few milliseconds.

> No need for runtime PM, because faults should only occur during job
> submit?

Makes sense to me.

> Need to fault in 2MB sections when possible. It seems this has to be
> done by getting an array of struct pages and then converting to a
> scatterlist. Not sure if there is a better way to do that. There's also
> some errata on T604 needing to fault in at least 8 pages at a time which
> isn't handled.

The old GPUs will be the most interesting to support...

Will give this patch some testing tomorrow.

Thanks,

Tomeu

> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Steven Price <steven.price@arm.com>
> Cc: Alyssa Rosenzweig <alyssa@rosenzweig.io>
> Cc: Tomeu Vizoso <tomeu@tomeuvizoso.net>
> Signed-off-by: Rob Herring <robh@kernel.org>
> ---
>
>  drivers/gpu/drm/panfrost/TODO           |  2 -
>  drivers/gpu/drm/panfrost/panfrost_drv.c | 22 +++++--
>  drivers/gpu/drm/panfrost/panfrost_gem.c |  3 +-
>  drivers/gpu/drm/panfrost/panfrost_gem.h |  8 +++
>  drivers/gpu/drm/panfrost/panfrost_mmu.c | 76 +++++++++++++++++++++++--
>  include/uapi/drm/panfrost_drm.h         |  2 +
>  6 files changed, 100 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/panfrost/TODO b/drivers/gpu/drm/panfrost/TODO
> index c2e44add37d8..64129bf73933 100644
> --- a/drivers/gpu/drm/panfrost/TODO
> +++ b/drivers/gpu/drm/panfrost/TODO
> @@ -14,8 +14,6 @@
>    The hard part is handling when more address spaces are needed than what
>    the h/w provides.
>
> -- Support pinning pages on demand (GPU page faults).
> -
>  - Support userspace controlled GPU virtual addresses. Needed for Vulkan. (Tomeu)
>
>  - Support for madvise and a shrinker.
> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
> index d11e2281dde6..f08d41d5186a 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> @@ -44,9 +44,11 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data,
>  {
>         int ret;
>         struct drm_gem_shmem_object *shmem;
> +       struct panfrost_gem_object *bo;
>         struct drm_panfrost_create_bo *args = data;
>
> -       if (!args->size || args->flags || args->pad)
> +       if (!args->size || args->pad ||
> +           (args->flags & ~PANFROST_BO_NOMAP))
>                 return -EINVAL;
>
>         shmem = drm_gem_shmem_create_with_handle(file, dev, args->size,
> @@ -54,11 +56,16 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data,
>         if (IS_ERR(shmem))
>                 return PTR_ERR(shmem);
>
> -       ret = panfrost_mmu_map(to_panfrost_bo(&shmem->base));
> -       if (ret)
> -               goto err_free;
> +       bo = to_panfrost_bo(&shmem->base);
>
> -       args->offset = to_panfrost_bo(&shmem->base)->node.start << PAGE_SHIFT;
> +       if (!(args->flags & PANFROST_BO_NOMAP)) {
> +               ret = panfrost_mmu_map(bo);
> +               if (ret)
> +                       goto err_free;
> +       } else
> +               bo->no_map = 1;
> +
> +       args->offset = bo->node.start << PAGE_SHIFT;
>
>         return 0;
>
> @@ -269,6 +276,11 @@ static int panfrost_ioctl_mmap_bo(struct drm_device *dev, void *data,
>                 return -ENOENT;
>         }
>
> +       if (to_panfrost_bo(gem_obj)->no_map) {
> +               DRM_DEBUG("Can't mmap 'no_map' GEM BO %d\n", args->handle);
> +               return -EINVAL;
> +       }
> +
>         ret = drm_gem_create_mmap_offset(gem_obj);
>         if (ret == 0)
>                 args->offset = drm_vma_node_offset_addr(&gem_obj->vma_node);
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c b/drivers/gpu/drm/panfrost/panfrost_gem.c
> index 886875ae31d3..ed0b4f79610a 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gem.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_gem.c
> @@ -19,7 +19,8 @@ static void panfrost_gem_free_object(struct drm_gem_object *obj)
>         struct panfrost_gem_object *bo = to_panfrost_bo(obj);
>         struct panfrost_device *pfdev = obj->dev->dev_private;
>
> -       panfrost_mmu_unmap(bo);
> +       if (!bo->no_map)
> +               panfrost_mmu_unmap(bo);
>
>         spin_lock(&pfdev->mm_lock);
>         drm_mm_remove_node(&bo->node);
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h b/drivers/gpu/drm/panfrost/panfrost_gem.h
> index 045000eb5fcf..aa210d7cbf3f 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gem.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_gem.h
> @@ -11,6 +11,7 @@ struct panfrost_gem_object {
>         struct drm_gem_shmem_object base;
>
>         struct drm_mm_node node;
> +       unsigned no_map: 1;
>  };
>
>  static inline
> @@ -19,6 +20,13 @@ struct  panfrost_gem_object *to_panfrost_bo(struct drm_gem_object *obj)
>         return container_of(to_drm_gem_shmem_obj(obj), struct panfrost_gem_object, base);
>  }
>
> +static inline
> +struct  panfrost_gem_object *drm_mm_node_to_panfrost_bo(struct drm_mm_node *node)
> +{
> +       return container_of(node, struct panfrost_gem_object, node);
> +}
> +
> +
>  struct drm_gem_object *panfrost_gem_create_object(struct drm_device *dev, size_t size);
>
>  struct drm_gem_object *
> diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> index 41d184fab07f..5a36495a38f3 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> @@ -3,6 +3,7 @@
>  /* Copyright (C) 2019 Arm Ltd. */
>  #include <linux/bitfield.h>
>  #include <linux/delay.h>
> +#include <linux/dma-mapping.h>
>  #include <linux/interrupt.h>
>  #include <linux/io.h>
>  #include <linux/iopoll.h>
> @@ -10,6 +11,7 @@
>  #include <linux/iommu.h>
>  #include <linux/platform_device.h>
>  #include <linux/pm_runtime.h>
> +#include <linux/shmem_fs.h>
>  #include <linux/sizes.h>
>
>  #include "panfrost_device.h"
> @@ -277,6 +279,60 @@ static const struct iommu_gather_ops mmu_tlb_ops = {
>         .tlb_sync       = mmu_tlb_sync_context,
>  };
>
> +static struct drm_mm_node *addr_to_drm_mm_node(struct panfrost_device *pfdev, int as, u64 addr)
> +{
> +       struct drm_mm_node *node;
> +       u64 offset = addr >> PAGE_SHIFT;
> +
> +       drm_mm_for_each_node(node, &pfdev->mm) {
> +               if (offset >= node->start && offset < (node->start + node->size))
> +                       return node;
> +       }
> +       return NULL;
> +}
> +
> +int panfrost_mmu_map_fault_addr(struct panfrost_device *pfdev, int as, u64 addr)
> +{
> +       struct page *page;
> +       dma_addr_t dma_addr;
> +       struct drm_mm_node *node;
> +       struct panfrost_gem_object *bo;
> +       struct io_pgtable_ops *ops = pfdev->mmu->pgtbl_ops;
> +       pgoff_t page_offset = addr >> PAGE_SHIFT;
> +
> +       addr &= PAGE_MASK;
> +
> +       node = addr_to_drm_mm_node(pfdev, as, addr);
> +       if (!node)
> +               return -ENOENT;
> +
> +       page_offset -= node->start;
> +       bo = drm_mm_node_to_panfrost_bo(node);
> +       if (WARN_ON(!bo->no_map))
> +               return -EINVAL;
> +
> +       dev_info(pfdev->dev, "mapping page fault @ %llx", addr);
> +
> +       page = shmem_read_mapping_page(bo->base.base.filp->f_mapping,
> +                                      page_offset);
> +       if (IS_ERR(page))
> +               return PTR_ERR(page);
> +
> +       dma_addr = dma_map_page(pfdev->dev, page, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
> +
> +       mutex_lock(&pfdev->mmu->lock);
> +
> +       ops->map(ops, addr, dma_addr, PAGE_SIZE, IOMMU_WRITE | IOMMU_READ);
> +
> +       mmu_write(pfdev, MMU_INT_CLEAR, BIT(as));
> +
> +       mmu_hw_do_operation(pfdev, as, addr, PAGE_SIZE, AS_COMMAND_FLUSH_PT);
> +
> +       mutex_unlock(&pfdev->mmu->lock);
> +
> +       return 0;
> +}
> +
>  static const char *access_type_name(struct panfrost_device *pfdev,
>                 u32 fault_status)
>  {
> @@ -302,13 +358,11 @@ static irqreturn_t panfrost_mmu_irq_handler(int irq, void *data)
>  {
>         struct panfrost_device *pfdev = data;
>         u32 status = mmu_read(pfdev, MMU_INT_STAT);
> -       int i;
> +       int i, ret;
>
>         if (!status)
>                 return IRQ_NONE;
>
> -       dev_err(pfdev->dev, "mmu irq status=%x\n", status);
> -
>         for (i = 0; status; i++) {
>                 u32 mask = BIT(i) | BIT(i + 16);
>                 u64 addr;
> @@ -329,6 +383,17 @@ static irqreturn_t panfrost_mmu_irq_handler(int irq, void *data)
>                 access_type = (fault_status >> 8) & 0x3;
>                 source_id = (fault_status >> 16);
>
> +               /* Page fault only */
> +               if ((status & mask) == BIT(i)) {
> +                       WARN_ON(exception_type < 0xC1 || exception_type > 0xC4);
> +
> +                       ret = panfrost_mmu_map_fault_addr(pfdev, i, addr);
> +                       if (!ret) {
> +                               status &= ~mask;
> +                               continue;
> +                       }
> +               }
> +
>                 /* terminal fault, print info about the fault */
>                 dev_err(pfdev->dev,
>                         "Unhandled Page fault in AS%d at VA 0x%016llX\n"
> @@ -370,8 +435,9 @@ int panfrost_mmu_init(struct panfrost_device *pfdev)
>         if (irq <= 0)
>                 return -ENODEV;
>
> -       err = devm_request_irq(pfdev->dev, irq, panfrost_mmu_irq_handler,
> -                              IRQF_SHARED, "mmu", pfdev);
> +       err = devm_request_threaded_irq(pfdev->dev, irq, NULL,
> +                                       panfrost_mmu_irq_handler,
> +                                       IRQF_ONESHOT, "mmu", pfdev);
>
>         if (err) {
>                 dev_err(pfdev->dev, "failed to request mmu irq");
> diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
> index a52e0283b90d..6d83baa0e4c1 100644
> --- a/include/uapi/drm/panfrost_drm.h
> +++ b/include/uapi/drm/panfrost_drm.h
> @@ -71,6 +71,8 @@ struct drm_panfrost_wait_bo {
>         __s64 timeout_ns;       /* absolute */
>  };
>
> +#define PANFROST_BO_NOMAP      1
> +
>  /**
>   * struct drm_panfrost_create_bo - ioctl argument for creating Panfrost BOs.
>   *
> --
> 2.20.1
>
Daniel Vetter June 12, 2019, 3 p.m. UTC | #2
On Wed, Jun 12, 2019 at 02:54:56PM +0200, Tomeu Vizoso wrote:
> On Mon, 10 Jun 2019 at 19:06, Rob Herring <robh@kernel.org> wrote:
> >
> > The midgard/bifrost GPUs need to allocate GPU memory which is allocated
> > on GPU page faults and not pinned in memory. The vendor driver calls
> > this functionality GROW_ON_GPF.
> >
> > This implementation assumes that BOs allocated with the
> > PANFROST_BO_NOMAP flag are never mmapped or exported. Both of those may
> > actually work, but I'm unsure if there's some interaction there. It
> > would cause the whole object to be pinned in memory which would defeat
> > the point of this.
> >
> > Issues/questions/thoughts:
> >
> > What's the difference between i_mapping and f_mapping?
> >
> > What kind of clean-up on close is needed? Based on vgem faults, there
> > doesn't seem to be any refcounting. Assume userspace is responsible for
> > not freeing the BO while a page fault can occur?

This really shouldn't ever be possible, "rely on userspace to not oops the
kernel" is how dri1 worked :-)
> 
> Aren't we taking a reference on all BOs that a job relates to and
> unreferencing them once the job is done? I would think that that's
> enough, or am I missing something?

Yup, this is how this is supposed to work.
-Daniel

> 
> > What about evictions? Need to call mapping_set_unevictable()? Maybe we
> > want these pages to be swappable, but then we need some notification to
> > unmap them.
> 
> I'm not sure there's much point in swapping out pages with lifetimes
> of a few milliseconds.
> 
> > No need for runtime PM, because faults should only occur during job
> > submit?
> 
> Makes sense to me.
> 
> > Need to fault in 2MB sections when possible. It seems this has to be
> > done by getting an array of struct pages and then converting to a
> > scatterlist. Not sure if there is a better way to do that. There's also
> > some errata on T604 needing to fault in at least 8 pages at a time which
> > isn't handled.
> 
> The old GPUs will be the most interesting to support...
> 
> Will give this patch some testing tomorrow.
> 
> Thanks,
> 
> Tomeu
> 
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Cc: Steven Price <steven.price@arm.com>
> > Cc: Alyssa Rosenzweig <alyssa@rosenzweig.io>
> > Cc: Tomeu Vizoso <tomeu@tomeuvizoso.net>
> > Signed-off-by: Rob Herring <robh@kernel.org>
> > ---
> >
> >  drivers/gpu/drm/panfrost/TODO           |  2 -
> >  drivers/gpu/drm/panfrost/panfrost_drv.c | 22 +++++--
> >  drivers/gpu/drm/panfrost/panfrost_gem.c |  3 +-
> >  drivers/gpu/drm/panfrost/panfrost_gem.h |  8 +++
> >  drivers/gpu/drm/panfrost/panfrost_mmu.c | 76 +++++++++++++++++++++++--
> >  include/uapi/drm/panfrost_drm.h         |  2 +
> >  6 files changed, 100 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/panfrost/TODO b/drivers/gpu/drm/panfrost/TODO
> > index c2e44add37d8..64129bf73933 100644
> > --- a/drivers/gpu/drm/panfrost/TODO
> > +++ b/drivers/gpu/drm/panfrost/TODO
> > @@ -14,8 +14,6 @@
> >    The hard part is handling when more address spaces are needed than what
> >    the h/w provides.
> >
> > -- Support pinning pages on demand (GPU page faults).
> > -
> >  - Support userspace controlled GPU virtual addresses. Needed for Vulkan. (Tomeu)
> >
> >  - Support for madvise and a shrinker.
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
> > index d11e2281dde6..f08d41d5186a 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> > @@ -44,9 +44,11 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data,
> >  {
> >         int ret;
> >         struct drm_gem_shmem_object *shmem;
> > +       struct panfrost_gem_object *bo;
> >         struct drm_panfrost_create_bo *args = data;
> >
> > -       if (!args->size || args->flags || args->pad)
> > +       if (!args->size || args->pad ||
> > +           (args->flags & ~PANFROST_BO_NOMAP))
> >                 return -EINVAL;
> >
> >         shmem = drm_gem_shmem_create_with_handle(file, dev, args->size,
> > @@ -54,11 +56,16 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data,
> >         if (IS_ERR(shmem))
> >                 return PTR_ERR(shmem);
> >
> > -       ret = panfrost_mmu_map(to_panfrost_bo(&shmem->base));
> > -       if (ret)
> > -               goto err_free;
> > +       bo = to_panfrost_bo(&shmem->base);
> >
> > -       args->offset = to_panfrost_bo(&shmem->base)->node.start << PAGE_SHIFT;
> > +       if (!(args->flags & PANFROST_BO_NOMAP)) {
> > +               ret = panfrost_mmu_map(bo);
> > +               if (ret)
> > +                       goto err_free;
> > +       } else
> > +               bo->no_map = 1;
> > +
> > +       args->offset = bo->node.start << PAGE_SHIFT;
> >
> >         return 0;
> >
> > @@ -269,6 +276,11 @@ static int panfrost_ioctl_mmap_bo(struct drm_device *dev, void *data,
> >                 return -ENOENT;
> >         }
> >
> > +       if (to_panfrost_bo(gem_obj)->no_map) {
> > +               DRM_DEBUG("Can't mmap 'no_map' GEM BO %d\n", args->handle);
> > +               return -EINVAL;
> > +       }
> > +
> >         ret = drm_gem_create_mmap_offset(gem_obj);
> >         if (ret == 0)
> >                 args->offset = drm_vma_node_offset_addr(&gem_obj->vma_node);
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c b/drivers/gpu/drm/panfrost/panfrost_gem.c
> > index 886875ae31d3..ed0b4f79610a 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_gem.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_gem.c
> > @@ -19,7 +19,8 @@ static void panfrost_gem_free_object(struct drm_gem_object *obj)
> >         struct panfrost_gem_object *bo = to_panfrost_bo(obj);
> >         struct panfrost_device *pfdev = obj->dev->dev_private;
> >
> > -       panfrost_mmu_unmap(bo);
> > +       if (!bo->no_map)
> > +               panfrost_mmu_unmap(bo);
> >
> >         spin_lock(&pfdev->mm_lock);
> >         drm_mm_remove_node(&bo->node);
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h b/drivers/gpu/drm/panfrost/panfrost_gem.h
> > index 045000eb5fcf..aa210d7cbf3f 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_gem.h
> > +++ b/drivers/gpu/drm/panfrost/panfrost_gem.h
> > @@ -11,6 +11,7 @@ struct panfrost_gem_object {
> >         struct drm_gem_shmem_object base;
> >
> >         struct drm_mm_node node;
> > +       unsigned no_map: 1;
> >  };
> >
> >  static inline
> > @@ -19,6 +20,13 @@ struct  panfrost_gem_object *to_panfrost_bo(struct drm_gem_object *obj)
> >         return container_of(to_drm_gem_shmem_obj(obj), struct panfrost_gem_object, base);
> >  }
> >
> > +static inline
> > +struct  panfrost_gem_object *drm_mm_node_to_panfrost_bo(struct drm_mm_node *node)
> > +{
> > +       return container_of(node, struct panfrost_gem_object, node);
> > +}
> > +
> > +
> >  struct drm_gem_object *panfrost_gem_create_object(struct drm_device *dev, size_t size);
> >
> >  struct drm_gem_object *
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> > index 41d184fab07f..5a36495a38f3 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> > @@ -3,6 +3,7 @@
> >  /* Copyright (C) 2019 Arm Ltd. */
> >  #include <linux/bitfield.h>
> >  #include <linux/delay.h>
> > +#include <linux/dma-mapping.h>
> >  #include <linux/interrupt.h>
> >  #include <linux/io.h>
> >  #include <linux/iopoll.h>
> > @@ -10,6 +11,7 @@
> >  #include <linux/iommu.h>
> >  #include <linux/platform_device.h>
> >  #include <linux/pm_runtime.h>
> > +#include <linux/shmem_fs.h>
> >  #include <linux/sizes.h>
> >
> >  #include "panfrost_device.h"
> > @@ -277,6 +279,60 @@ static const struct iommu_gather_ops mmu_tlb_ops = {
> >         .tlb_sync       = mmu_tlb_sync_context,
> >  };
> >
> > +static struct drm_mm_node *addr_to_drm_mm_node(struct panfrost_device *pfdev, int as, u64 addr)
> > +{
> > +       struct drm_mm_node *node;
> > +       u64 offset = addr >> PAGE_SHIFT;
> > +
> > +       drm_mm_for_each_node(node, &pfdev->mm) {
> > +               if (offset >= node->start && offset < (node->start + node->size))
> > +                       return node;
> > +       }
> > +       return NULL;
> > +}
> > +
> > +int panfrost_mmu_map_fault_addr(struct panfrost_device *pfdev, int as, u64 addr)
> > +{
> > +       struct page *page;
> > +       dma_addr_t dma_addr;
> > +       struct drm_mm_node *node;
> > +       struct panfrost_gem_object *bo;
> > +       struct io_pgtable_ops *ops = pfdev->mmu->pgtbl_ops;
> > +       pgoff_t page_offset = addr >> PAGE_SHIFT;
> > +
> > +       addr &= PAGE_MASK;
> > +
> > +       node = addr_to_drm_mm_node(pfdev, as, addr);
> > +       if (!node)
> > +               return -ENOENT;
> > +
> > +       page_offset -= node->start;
> > +       bo = drm_mm_node_to_panfrost_bo(node);
> > +       if (WARN_ON(!bo->no_map))
> > +               return -EINVAL;
> > +
> > +       dev_info(pfdev->dev, "mapping page fault @ %llx", addr);
> > +
> > +       page = shmem_read_mapping_page(bo->base.base.filp->f_mapping,
> > +                                      page_offset);
> > +       if (IS_ERR(page))
> > +               return PTR_ERR(page);
> > +
> > +       dma_addr = dma_map_page(pfdev->dev, page, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
> > +
> > +       mutex_lock(&pfdev->mmu->lock);
> > +
> > +       ops->map(ops, addr, dma_addr, PAGE_SIZE, IOMMU_WRITE | IOMMU_READ);
> > +
> > +       mmu_write(pfdev, MMU_INT_CLEAR, BIT(as));
> > +
> > +       mmu_hw_do_operation(pfdev, as, addr, PAGE_SIZE, AS_COMMAND_FLUSH_PT);
> > +
> > +       mutex_unlock(&pfdev->mmu->lock);
> > +
> > +       return 0;
> > +}
> > +
> >  static const char *access_type_name(struct panfrost_device *pfdev,
> >                 u32 fault_status)
> >  {
> > @@ -302,13 +358,11 @@ static irqreturn_t panfrost_mmu_irq_handler(int irq, void *data)
> >  {
> >         struct panfrost_device *pfdev = data;
> >         u32 status = mmu_read(pfdev, MMU_INT_STAT);
> > -       int i;
> > +       int i, ret;
> >
> >         if (!status)
> >                 return IRQ_NONE;
> >
> > -       dev_err(pfdev->dev, "mmu irq status=%x\n", status);
> > -
> >         for (i = 0; status; i++) {
> >                 u32 mask = BIT(i) | BIT(i + 16);
> >                 u64 addr;
> > @@ -329,6 +383,17 @@ static irqreturn_t panfrost_mmu_irq_handler(int irq, void *data)
> >                 access_type = (fault_status >> 8) & 0x3;
> >                 source_id = (fault_status >> 16);
> >
> > +               /* Page fault only */
> > +               if ((status & mask) == BIT(i)) {
> > +                       WARN_ON(exception_type < 0xC1 || exception_type > 0xC4);
> > +
> > +                       ret = panfrost_mmu_map_fault_addr(pfdev, i, addr);
> > +                       if (!ret) {
> > +                               status &= ~mask;
> > +                               continue;
> > +                       }
> > +               }
> > +
> >                 /* terminal fault, print info about the fault */
> >                 dev_err(pfdev->dev,
> >                         "Unhandled Page fault in AS%d at VA 0x%016llX\n"
> > @@ -370,8 +435,9 @@ int panfrost_mmu_init(struct panfrost_device *pfdev)
> >         if (irq <= 0)
> >                 return -ENODEV;
> >
> > -       err = devm_request_irq(pfdev->dev, irq, panfrost_mmu_irq_handler,
> > -                              IRQF_SHARED, "mmu", pfdev);
> > +       err = devm_request_threaded_irq(pfdev->dev, irq, NULL,
> > +                                       panfrost_mmu_irq_handler,
> > +                                       IRQF_ONESHOT, "mmu", pfdev);
> >
> >         if (err) {
> >                 dev_err(pfdev->dev, "failed to request mmu irq");
> > diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
> > index a52e0283b90d..6d83baa0e4c1 100644
> > --- a/include/uapi/drm/panfrost_drm.h
> > +++ b/include/uapi/drm/panfrost_drm.h
> > @@ -71,6 +71,8 @@ struct drm_panfrost_wait_bo {
> >         __s64 timeout_ns;       /* absolute */
> >  };
> >
> > +#define PANFROST_BO_NOMAP      1
> > +
> >  /**
> >   * struct drm_panfrost_create_bo - ioctl argument for creating Panfrost BOs.
> >   *
> > --
> > 2.20.1
> >
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Chris Wilson June 12, 2019, 3:03 p.m. UTC | #3
Quoting Rob Herring (2019-06-10 18:04:40)
> The midgard/bifrost GPUs need to allocate GPU memory which is allocated
> on GPU page faults and not pinned in memory. The vendor driver calls
> this functionality GROW_ON_GPF.
> 
> This implementation assumes that BOs allocated with the
> PANFROST_BO_NOMAP flag are never mmapped or exported. Both of those may
> actually work, but I'm unsure if there's some interaction there. It
> would cause the whole object to be pinned in memory which would defeat
> the point of this.
> 
> Issues/questions/thoughts:
> 
> What's the difference between i_mapping and f_mapping?
> 
> What kind of clean-up on close is needed? Based on vgem faults, there
> doesn't seem to be any refcounting. Assume userspace is responsible for
> not freeing the BO while a page fault can occur?

See drm_gem_vm_open and drm_gem_vm_close.
-Chris
Rob Herring June 14, 2019, 9:22 p.m. UTC | #4
On Wed, Jun 12, 2019 at 6:55 AM Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
>
> On Mon, 10 Jun 2019 at 19:06, Rob Herring <robh@kernel.org> wrote:
> >
> > The midgard/bifrost GPUs need to allocate GPU memory which is allocated
> > on GPU page faults and not pinned in memory. The vendor driver calls
> > this functionality GROW_ON_GPF.
> >
> > This implementation assumes that BOs allocated with the
> > PANFROST_BO_NOMAP flag are never mmapped or exported. Both of those may
> > actually work, but I'm unsure if there's some interaction there. It
> > would cause the whole object to be pinned in memory which would defeat
> > the point of this.
> >
> > Issues/questions/thoughts:
> >
> > What's the difference between i_mapping and f_mapping?
> >
> > What kind of clean-up on close is needed? Based on vgem faults, there
> > doesn't seem to be any refcounting. Assume userspace is responsible for
> > not freeing the BO while a page fault can occur?
>
> Aren't we taking a reference on all BOs that a job relates to and
> unreferencing them once the job is done? I would think that that's
> enough, or am I missing something?

No, I think we're fine.

> > What about evictions? Need to call mapping_set_unevictable()? Maybe we
> > want these pages to be swappable, but then we need some notification to
> > unmap them.
>
> I'm not sure there's much point in swapping out pages with lifetimes
> of a few milliseconds.

The lifetime is *forever* though. If we don't allow swapping, then the
heap is grow only until the FD is closed. IIRC, the maximum size is on
the order of 1GB. Seems like you'd want to shrink it with some
trigger.

Rob
Tomeu Vizoso June 17, 2019, 5:15 a.m. UTC | #5
On Fri, 14 Jun 2019 at 23:22, Rob Herring <robh@kernel.org> wrote:
>
> On Wed, Jun 12, 2019 at 6:55 AM Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
> >
> > On Mon, 10 Jun 2019 at 19:06, Rob Herring <robh@kernel.org> wrote:
> > >
> > > The midgard/bifrost GPUs need to allocate GPU memory which is allocated
> > > on GPU page faults and not pinned in memory. The vendor driver calls
> > > this functionality GROW_ON_GPF.
> > >
> > > This implementation assumes that BOs allocated with the
> > > PANFROST_BO_NOMAP flag are never mmapped or exported. Both of those may
> > > actually work, but I'm unsure if there's some interaction there. It
> > > would cause the whole object to be pinned in memory which would defeat
> > > the point of this.
> > >
> > > Issues/questions/thoughts:
> > >
> > > What's the difference between i_mapping and f_mapping?
> > >
> > > What kind of clean-up on close is needed? Based on vgem faults, there
> > > doesn't seem to be any refcounting. Assume userspace is responsible for
> > > not freeing the BO while a page fault can occur?
> >
> > Aren't we taking a reference on all BOs that a job relates to and
> > unreferencing them once the job is done? I would think that that's
> > enough, or am I missing something?
>
> No, I think we're fine.
>
> > > What about evictions? Need to call mapping_set_unevictable()? Maybe we
> > > want these pages to be swappable, but then we need some notification to
> > > unmap them.
> >
> > I'm not sure there's much point in swapping out pages with lifetimes
> > of a few milliseconds.
>
> The lifetime is *forever* though. If we don't allow swapping, then the
> heap is grow only until the FD is closed. IIRC, the maximum size is on
> the order of 1GB. Seems like you'd want to shrink it with some
> trigger.

I thought that the lifetime of the *contents* of the heap was that of
the job chain that wrote them? Otherwise, only the GPU would know what
can be discarded.

Cheers,

Tomeu
Rob Herring June 17, 2019, 2:56 p.m. UTC | #6
On Sun, Jun 16, 2019 at 11:15 PM Tomeu Vizoso
<tomeu.vizoso@collabora.com> wrote:
>
> On Fri, 14 Jun 2019 at 23:22, Rob Herring <robh@kernel.org> wrote:
> >
> > On Wed, Jun 12, 2019 at 6:55 AM Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
> > >
> > > On Mon, 10 Jun 2019 at 19:06, Rob Herring <robh@kernel.org> wrote:
> > > >
> > > > The midgard/bifrost GPUs need to allocate GPU memory which is allocated
> > > > on GPU page faults and not pinned in memory. The vendor driver calls
> > > > this functionality GROW_ON_GPF.
> > > >
> > > > This implementation assumes that BOs allocated with the
> > > > PANFROST_BO_NOMAP flag are never mmapped or exported. Both of those may
> > > > actually work, but I'm unsure if there's some interaction there. It
> > > > would cause the whole object to be pinned in memory which would defeat
> > > > the point of this.
> > > >
> > > > Issues/questions/thoughts:
> > > >
> > > > What's the difference between i_mapping and f_mapping?
> > > >
> > > > What kind of clean-up on close is needed? Based on vgem faults, there
> > > > doesn't seem to be any refcounting. Assume userspace is responsible for
> > > > not freeing the BO while a page fault can occur?
> > >
> > > Aren't we taking a reference on all BOs that a job relates to and
> > > unreferencing them once the job is done? I would think that that's
> > > enough, or am I missing something?
> >
> > No, I think we're fine.
> >
> > > > What about evictions? Need to call mapping_set_unevictable()? Maybe we
> > > > want these pages to be swappable, but then we need some notification to
> > > > unmap them.
> > >
> > > I'm not sure there's much point in swapping out pages with lifetimes
> > > of a few milliseconds.
> >
> > The lifetime is *forever* though. If we don't allow swapping, then the
> > heap is grow only until the FD is closed. IIRC, the maximum size is on
> > the order of 1GB. Seems like you'd want to shrink it with some
> > trigger.
>
> I thought that the lifetime of the *contents* of the heap was that of
> the job chain that wrote them? Otherwise, only the GPU would know what
> can be discarded.

Yes, that's probably true. To take that to the extreme, we could
allocate and free the heap BO on each job chain. But we don't do that
because of the overhead. So mapping and unmapping is a similar trade
off of frequency vs. overhead. The question is when do we allow pages
to be swapped out (as that is unhandled in the current patch)? Our
choices are:

- at any time. This is what the patch currently does as we don't
prevent eviction. Though we'd need some mechanism to be notified when
a page is evicted which is currently missing.
- when a job finishes. We'd have to iterate thru BO's and mark pages
evict-able on NOMAP BOs. Not sure where we do that in the driver.
- when the BO is freed. This is the easiest to implement...

Rob
Tomeu Vizoso June 20, 2019, 5:50 a.m. UTC | #7
On Mon, 17 Jun 2019 at 16:56, Rob Herring <robh@kernel.org> wrote:
>
> On Sun, Jun 16, 2019 at 11:15 PM Tomeu Vizoso
> <tomeu.vizoso@collabora.com> wrote:
> >
> > On Fri, 14 Jun 2019 at 23:22, Rob Herring <robh@kernel.org> wrote:
> > >
> > > On Wed, Jun 12, 2019 at 6:55 AM Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
> > > >
> > > > On Mon, 10 Jun 2019 at 19:06, Rob Herring <robh@kernel.org> wrote:
> > > > >
> > > > > The midgard/bifrost GPUs need to allocate GPU memory which is allocated
> > > > > on GPU page faults and not pinned in memory. The vendor driver calls
> > > > > this functionality GROW_ON_GPF.
> > > > >
> > > > > This implementation assumes that BOs allocated with the
> > > > > PANFROST_BO_NOMAP flag are never mmapped or exported. Both of those may
> > > > > actually work, but I'm unsure if there's some interaction there. It
> > > > > would cause the whole object to be pinned in memory which would defeat
> > > > > the point of this.
> > > > >
> > > > > Issues/questions/thoughts:
> > > > >
> > > > > What's the difference between i_mapping and f_mapping?
> > > > >
> > > > > What kind of clean-up on close is needed? Based on vgem faults, there
> > > > > doesn't seem to be any refcounting. Assume userspace is responsible for
> > > > > not freeing the BO while a page fault can occur?
> > > >
> > > > Aren't we taking a reference on all BOs that a job relates to and
> > > > unreferencing them once the job is done? I would think that that's
> > > > enough, or am I missing something?
> > >
> > > No, I think we're fine.
> > >
> > > > > What about evictions? Need to call mapping_set_unevictable()? Maybe we
> > > > > want these pages to be swappable, but then we need some notification to
> > > > > unmap them.
> > > >
> > > > I'm not sure there's much point in swapping out pages with lifetimes
> > > > of a few milliseconds.
> > >
> > > The lifetime is *forever* though. If we don't allow swapping, then the
> > > heap is grow only until the FD is closed. IIRC, the maximum size is on
> > > the order of 1GB. Seems like you'd want to shrink it with some
> > > trigger.
> >
> > I thought that the lifetime of the *contents* of the heap was that of
> > the job chain that wrote them? Otherwise, only the GPU would know what
> > can be discarded.
>
> Yes, that's probably true. To take that to the extreme, we could
> allocate and free the heap BO on each job chain. But we don't do that
> because of the overhead. So mapping and unmapping is a similar trade
> off of frequency vs. overhead. The question is when do we allow pages
> to be swapped out (as that is unhandled in the current patch)? Our
> choices are:
>
> - at any time. This is what the patch currently does as we don't
> prevent eviction. Though we'd need some mechanism to be notified when
> a page is evicted which is currently missing.
> - when a job finishes. We'd have to iterate thru BO's and mark pages
> evict-able on NOMAP BOs. Not sure where we do that in the driver.

My understanding is that any contents of NOMAP memory aren't expected
to persist across jobs. So, when a job finishes we can unmap all pages
that the job faulted on.

If we do that, then we won't have a strong need to allow NOMAP pages
to be swapped out because jobs aren't expected to take that long.

I could very well be missing something that is needed by Arm's blob
and not by Panfrost atm, but I don't see in kbase any mechanism for
the kernel to know when the GPU is done with a page, other than the
job that mapped it having finished.

Btw, I tested this patch locally and things seemed to work just fine.

Thanks,

Tomeu

> - when the BO is freed. This is the easiest to implement...
>
> Rob
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Steven Price June 27, 2019, 10:57 a.m. UTC | #8
Sorry for the slow response, I've been on holiday for a few weeks.

On 20/06/2019 06:50, Tomeu Vizoso wrote:
> On Mon, 17 Jun 2019 at 16:56, Rob Herring <robh@kernel.org> wrote:
>>
>> On Sun, Jun 16, 2019 at 11:15 PM Tomeu Vizoso
>> <tomeu.vizoso@collabora.com> wrote:
>>>
>>> On Fri, 14 Jun 2019 at 23:22, Rob Herring <robh@kernel.org> wrote:
>>>>
>>>> On Wed, Jun 12, 2019 at 6:55 AM Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
>>>>>
>>>>> On Mon, 10 Jun 2019 at 19:06, Rob Herring <robh@kernel.org> wrote:
>>>>>>
>>>>>> The midgard/bifrost GPUs need to allocate GPU memory which is allocated
>>>>>> on GPU page faults and not pinned in memory. The vendor driver calls
>>>>>> this functionality GROW_ON_GPF.
>>>>>>
>>>>>> This implementation assumes that BOs allocated with the
>>>>>> PANFROST_BO_NOMAP flag are never mmapped or exported. Both of those may
>>>>>> actually work, but I'm unsure if there's some interaction there. It
>>>>>> would cause the whole object to be pinned in memory which would defeat
>>>>>> the point of this.

Although in normal usage user space will never care about the contents
of growable memory it can be useful to be able to access it for
debugging (although not critical to have it working immediately). In
particular it allow submitting the jobs in a job chain separately.
Exporting I can't see a use-case for.

So personally I'd prefer not using a "NOMAP" flag to mean "grow on fault".

>>>>>> Issues/questions/thoughts:
>>>>>>
>>>>>> What's the difference between i_mapping and f_mapping?
>>>>>>
>>>>>> What kind of clean-up on close is needed? Based on vgem faults, there
>>>>>> doesn't seem to be any refcounting. Assume userspace is responsible for
>>>>>> not freeing the BO while a page fault can occur?
>>>>>
>>>>> Aren't we taking a reference on all BOs that a job relates to and
>>>>> unreferencing them once the job is done? I would think that that's
>>>>> enough, or am I missing something?
>>>>
>>>> No, I think we're fine.
>>>>
>>>>>> What about evictions? Need to call mapping_set_unevictable()? Maybe we
>>>>>> want these pages to be swappable, but then we need some notification to
>>>>>> unmap them.
>>>>>
>>>>> I'm not sure there's much point in swapping out pages with lifetimes
>>>>> of a few milliseconds.
>>>>
>>>> The lifetime is *forever* though. If we don't allow swapping, then the
>>>> heap is grow only until the FD is closed. IIRC, the maximum size is on
>>>> the order of 1GB. Seems like you'd want to shrink it with some
>>>> trigger.
>>>
>>> I thought that the lifetime of the *contents* of the heap was that of
>>> the job chain that wrote them? Otherwise, only the GPU would know what
>>> can be discarded.
>>
>> Yes, that's probably true. To take that to the extreme, we could
>> allocate and free the heap BO on each job chain. But we don't do that
>> because of the overhead. So mapping and unmapping is a similar trade
>> off of frequency vs. overhead. The question is when do we allow pages
>> to be swapped out (as that is unhandled in the current patch)? Our
>> choices are:
>>
>> - at any time. This is what the patch currently does as we don't
>> prevent eviction. Though we'd need some mechanism to be notified when
>> a page is evicted which is currently missing.
>> - when a job finishes. We'd have to iterate thru BO's and mark pages
>> evict-able on NOMAP BOs. Not sure where we do that in the driver.
> 
> My understanding is that any contents of NOMAP memory aren't expected
> to persist across jobs. So, when a job finishes we can unmap all pages
> that the job faulted on.
> 
> If we do that, then we won't have a strong need to allow NOMAP pages
> to be swapped out because jobs aren't expected to take that long.

It would certainly seem reasonable that the contents of NOMAP memory can
be thrown away when the job chain has been completed. But, there is a
potential performance improvement by not immediately unmapping/freeing
the memory but leaving it in the assumption a similar job will be
submitted later requiring roughly the same amount of memory.

Arm's blob/kernel have various mechanisms for freeing memory either
after a period of being idle (in the blob) or when a shrinker is called
(in kbase). The idea is that the heap memory is grown once to whatever
the content needs and then the same buffer (or small set of buffers) is
reused repeatedly. kbase has a mechanism called "ephemeral memory" (or
evictable) which is memory which normally remains mapped on the GPU, but
under memory pressure it can be freed (and later faulted in with empty
pages if accessed again). A pinning mechanism is used to ensure that
this doesn't happen in the middle of a job chain which uses the buffer.
This mechanism is referred to as "JIT" (Just In Time allocation) in places.

> I could very well be missing something that is needed by Arm's blob
> and not by Panfrost atm, but I don't see in kbase any mechanism for
> the kernel to know when the GPU is done with a page, other than the
> job that mapped it having finished.

Much of the memory management is done by the user space blob. The kernel
driver usually doesn't actually know what memory a job will access.
There are exceptions though, in particular: ephemeral memory (through
JIT) and imported memory.

Steve

> Btw, I tested this patch locally and things seemed to work just fine.
> 
> Thanks,
> 
> Tomeu
> 
>> - when the BO is freed. This is the easiest to implement...
>>
>> Rob
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
Steven Price June 27, 2019, 11:19 a.m. UTC | #9
On 10/06/2019 18:04, Rob Herring wrote:
> The midgard/bifrost GPUs need to allocate GPU memory which is allocated
> on GPU page faults and not pinned in memory. The vendor driver calls
> this functionality GROW_ON_GPF.
> 
> This implementation assumes that BOs allocated with the
> PANFROST_BO_NOMAP flag are never mmapped or exported. Both of those may
> actually work, but I'm unsure if there's some interaction there. It
> would cause the whole object to be pinned in memory which would defeat
> the point of this.
> 
> Issues/questions/thoughts:
> 
> What's the difference between i_mapping and f_mapping?
> 
> What kind of clean-up on close is needed? Based on vgem faults, there
> doesn't seem to be any refcounting. Assume userspace is responsible for
> not freeing the BO while a page fault can occur?
> 
> What about evictions? Need to call mapping_set_unevictable()? Maybe we
> want these pages to be swappable, but then we need some notification to
> unmap them.
> 
> No need for runtime PM, because faults should only occur during job
> submit?

Although normally faults should only occur when a job is running I
believe in some situations it is possible for the job to have completed
while there is a page fault outstanding (I think this can occur when the
job fails at the same time as the page fault happens). In this situation
the GPU might be considered idle when the MMU irq is being handled.

> Need to fault in 2MB sections when possible. It seems this has to be
> done by getting an array of struct pages and then converting to a
> scatterlist. Not sure if there is a better way to do that. There's also
> some errata on T604 needing to fault in at least 8 pages at a time which
> isn't handled.

kbase simply doesn't support a page-by-page mapping of memory. There is
a region at the beginning of a buffer which is mapped and after that the
buffer is unmapped. It is then quite simple to round up the region to
map to the next 2MB section/8-page boundary. It's also a good
performance improvement because page faults are relatively expensive and
you might need to grow the buffer by a large size for the job to
complete (e.g. on the first few frames). There aren't many use-cases for
having a buffer with multiple holes in - although Vulkan does have a
"sparse" mechanism which might require something similar (but unlikely
to require faulting in).

> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Steven Price <steven.price@arm.com>
> Cc: Alyssa Rosenzweig <alyssa@rosenzweig.io>
> Cc: Tomeu Vizoso <tomeu@tomeuvizoso.net>
> Signed-off-by: Rob Herring <robh@kernel.org>
> ---
> 
>  drivers/gpu/drm/panfrost/TODO           |  2 -
>  drivers/gpu/drm/panfrost/panfrost_drv.c | 22 +++++--
>  drivers/gpu/drm/panfrost/panfrost_gem.c |  3 +-
>  drivers/gpu/drm/panfrost/panfrost_gem.h |  8 +++
>  drivers/gpu/drm/panfrost/panfrost_mmu.c | 76 +++++++++++++++++++++++--
>  include/uapi/drm/panfrost_drm.h         |  2 +
>  6 files changed, 100 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/TODO b/drivers/gpu/drm/panfrost/TODO
> index c2e44add37d8..64129bf73933 100644
> --- a/drivers/gpu/drm/panfrost/TODO
> +++ b/drivers/gpu/drm/panfrost/TODO
> @@ -14,8 +14,6 @@
>    The hard part is handling when more address spaces are needed than what
>    the h/w provides.
>  
> -- Support pinning pages on demand (GPU page faults).
> -
>  - Support userspace controlled GPU virtual addresses. Needed for Vulkan. (Tomeu)
>  
>  - Support for madvise and a shrinker.
> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
> index d11e2281dde6..f08d41d5186a 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> @@ -44,9 +44,11 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data,
>  {
>  	int ret;
>  	struct drm_gem_shmem_object *shmem;
> +	struct panfrost_gem_object *bo;
>  	struct drm_panfrost_create_bo *args = data;
>  
> -	if (!args->size || args->flags || args->pad)
> +	if (!args->size || args->pad ||
> +	    (args->flags & ~PANFROST_BO_NOMAP))
>  		return -EINVAL;
>  
>  	shmem = drm_gem_shmem_create_with_handle(file, dev, args->size,
> @@ -54,11 +56,16 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data,
>  	if (IS_ERR(shmem))
>  		return PTR_ERR(shmem);
>  
> -	ret = panfrost_mmu_map(to_panfrost_bo(&shmem->base));
> -	if (ret)
> -		goto err_free;
> +	bo = to_panfrost_bo(&shmem->base);
>  
> -	args->offset = to_panfrost_bo(&shmem->base)->node.start << PAGE_SHIFT;
> +	if (!(args->flags & PANFROST_BO_NOMAP)) {
> +		ret = panfrost_mmu_map(bo);
> +		if (ret)
> +			goto err_free;
> +	} else
> +		bo->no_map = 1;
> +
> +	args->offset = bo->node.start << PAGE_SHIFT;
>  
>  	return 0;
>  
> @@ -269,6 +276,11 @@ static int panfrost_ioctl_mmap_bo(struct drm_device *dev, void *data,
>  		return -ENOENT;
>  	}
>  
> +	if (to_panfrost_bo(gem_obj)->no_map) {
> +		DRM_DEBUG("Can't mmap 'no_map' GEM BO %d\n", args->handle);
> +		return -EINVAL;
> +	}
> +
>  	ret = drm_gem_create_mmap_offset(gem_obj);
>  	if (ret == 0)
>  		args->offset = drm_vma_node_offset_addr(&gem_obj->vma_node);
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c b/drivers/gpu/drm/panfrost/panfrost_gem.c
> index 886875ae31d3..ed0b4f79610a 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gem.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_gem.c
> @@ -19,7 +19,8 @@ static void panfrost_gem_free_object(struct drm_gem_object *obj)
>  	struct panfrost_gem_object *bo = to_panfrost_bo(obj);
>  	struct panfrost_device *pfdev = obj->dev->dev_private;
>  
> -	panfrost_mmu_unmap(bo);
> +	if (!bo->no_map)
> +		panfrost_mmu_unmap(bo);
>  
>  	spin_lock(&pfdev->mm_lock);
>  	drm_mm_remove_node(&bo->node);
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h b/drivers/gpu/drm/panfrost/panfrost_gem.h
> index 045000eb5fcf..aa210d7cbf3f 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gem.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_gem.h
> @@ -11,6 +11,7 @@ struct panfrost_gem_object {
>  	struct drm_gem_shmem_object base;
>  
>  	struct drm_mm_node node;
> +	unsigned no_map: 1;
>  };
>  
>  static inline
> @@ -19,6 +20,13 @@ struct  panfrost_gem_object *to_panfrost_bo(struct drm_gem_object *obj)
>  	return container_of(to_drm_gem_shmem_obj(obj), struct panfrost_gem_object, base);
>  }
>  
> +static inline
> +struct  panfrost_gem_object *drm_mm_node_to_panfrost_bo(struct drm_mm_node *node)
> +{
> +	return container_of(node, struct panfrost_gem_object, node);
> +}
> +
> +
>  struct drm_gem_object *panfrost_gem_create_object(struct drm_device *dev, size_t size);
>  
>  struct drm_gem_object *
> diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> index 41d184fab07f..5a36495a38f3 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> @@ -3,6 +3,7 @@
>  /* Copyright (C) 2019 Arm Ltd. */
>  #include <linux/bitfield.h>
>  #include <linux/delay.h>
> +#include <linux/dma-mapping.h>
>  #include <linux/interrupt.h>
>  #include <linux/io.h>
>  #include <linux/iopoll.h>
> @@ -10,6 +11,7 @@
>  #include <linux/iommu.h>
>  #include <linux/platform_device.h>
>  #include <linux/pm_runtime.h>
> +#include <linux/shmem_fs.h>
>  #include <linux/sizes.h>
>  
>  #include "panfrost_device.h"
> @@ -277,6 +279,60 @@ static const struct iommu_gather_ops mmu_tlb_ops = {
>  	.tlb_sync	= mmu_tlb_sync_context,
>  };
>  
> +static struct drm_mm_node *addr_to_drm_mm_node(struct panfrost_device *pfdev, int as, u64 addr)
> +{
> +	struct drm_mm_node *node;
> +	u64 offset = addr >> PAGE_SHIFT;
> +
> +	drm_mm_for_each_node(node, &pfdev->mm) {
> +		if (offset >= node->start && offset < (node->start + node->size))
> +			return node;
> +	}
> +	return NULL;
> +}
> +
> +int panfrost_mmu_map_fault_addr(struct panfrost_device *pfdev, int as, u64 addr)
> +{
> +	struct page *page;
> +	dma_addr_t dma_addr;
> +	struct drm_mm_node *node;
> +	struct panfrost_gem_object *bo;
> +	struct io_pgtable_ops *ops = pfdev->mmu->pgtbl_ops;
> +	pgoff_t page_offset = addr >> PAGE_SHIFT;
> +
> +	addr &= PAGE_MASK;
> +
> +	node = addr_to_drm_mm_node(pfdev, as, addr);
> +	if (!node)
> +		return -ENOENT;
> +
> +	page_offset -= node->start;
> +	bo = drm_mm_node_to_panfrost_bo(node);
> +	if (WARN_ON(!bo->no_map))
> +		return -EINVAL;
> +
> +	dev_info(pfdev->dev, "mapping page fault @ %llx", addr);
> +
> +	page = shmem_read_mapping_page(bo->base.base.filp->f_mapping,
> +				       page_offset);
> +	if (IS_ERR(page))
> +		return PTR_ERR(page);
> +
> +	dma_addr = dma_map_page(pfdev->dev, page, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
> +
> +	mutex_lock(&pfdev->mmu->lock);
> +
> +	ops->map(ops, addr, dma_addr, PAGE_SIZE, IOMMU_WRITE | IOMMU_READ);
> +
> +	mmu_write(pfdev, MMU_INT_CLEAR, BIT(as));
> +
> +	mmu_hw_do_operation(pfdev, as, addr, PAGE_SIZE, AS_COMMAND_FLUSH_PT);
> +
> +	mutex_unlock(&pfdev->mmu->lock);
> +
> +	return 0;
> +}
> +
>  static const char *access_type_name(struct panfrost_device *pfdev,
>  		u32 fault_status)
>  {
> @@ -302,13 +358,11 @@ static irqreturn_t panfrost_mmu_irq_handler(int irq, void *data)
>  {
>  	struct panfrost_device *pfdev = data;
>  	u32 status = mmu_read(pfdev, MMU_INT_STAT);
> -	int i;
> +	int i, ret;
>  
>  	if (!status)
>  		return IRQ_NONE;
>  
> -	dev_err(pfdev->dev, "mmu irq status=%x\n", status);
> -
>  	for (i = 0; status; i++) {
>  		u32 mask = BIT(i) | BIT(i + 16);
>  		u64 addr;
> @@ -329,6 +383,17 @@ static irqreturn_t panfrost_mmu_irq_handler(int irq, void *data)
>  		access_type = (fault_status >> 8) & 0x3;
>  		source_id = (fault_status >> 16);
>  
> +		/* Page fault only */
> +		if ((status & mask) == BIT(i)) {
> +			WARN_ON(exception_type < 0xC1 || exception_type > 0xC4);
> +
> +			ret = panfrost_mmu_map_fault_addr(pfdev, i, addr);
> +			if (!ret) {
> +				status &= ~mask;
> +				continue;
> +			}
> +		}
> +
>  		/* terminal fault, print info about the fault */
>  		dev_err(pfdev->dev,
>  			"Unhandled Page fault in AS%d at VA 0x%016llX\n"
> @@ -370,8 +435,9 @@ int panfrost_mmu_init(struct panfrost_device *pfdev)
>  	if (irq <= 0)
>  		return -ENODEV;
>  
> -	err = devm_request_irq(pfdev->dev, irq, panfrost_mmu_irq_handler,
> -			       IRQF_SHARED, "mmu", pfdev);
> +	err = devm_request_threaded_irq(pfdev->dev, irq, NULL,
> +					panfrost_mmu_irq_handler,
> +					IRQF_ONESHOT, "mmu", pfdev);

Is it intentional to drop the SHARED flag? I have to admit I can't think
of a (real) platform which does share the IRQ, but kbase has always
supported sharing the GPU interrupts.

Steve

>  
>  	if (err) {
>  		dev_err(pfdev->dev, "failed to request mmu irq");
> diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
> index a52e0283b90d..6d83baa0e4c1 100644
> --- a/include/uapi/drm/panfrost_drm.h
> +++ b/include/uapi/drm/panfrost_drm.h
> @@ -71,6 +71,8 @@ struct drm_panfrost_wait_bo {
>  	__s64 timeout_ns;	/* absolute */
>  };
>  
> +#define PANFROST_BO_NOMAP	1
> +
>  /**
>   * struct drm_panfrost_create_bo - ioctl argument for creating Panfrost BOs.
>   *
>
Rob Herring June 27, 2019, 4:38 p.m. UTC | #10
On Thu, Jun 27, 2019 at 4:57 AM Steven Price <steven.price@arm.com> wrote:
>
> Sorry for the slow response, I've been on holiday for a few weeks.

Welcome back.

>
> On 20/06/2019 06:50, Tomeu Vizoso wrote:
> > On Mon, 17 Jun 2019 at 16:56, Rob Herring <robh@kernel.org> wrote:
> >>
> >> On Sun, Jun 16, 2019 at 11:15 PM Tomeu Vizoso
> >> <tomeu.vizoso@collabora.com> wrote:
> >>>
> >>> On Fri, 14 Jun 2019 at 23:22, Rob Herring <robh@kernel.org> wrote:
> >>>>
> >>>> On Wed, Jun 12, 2019 at 6:55 AM Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
> >>>>>
> >>>>> On Mon, 10 Jun 2019 at 19:06, Rob Herring <robh@kernel.org> wrote:
> >>>>>>
> >>>>>> The midgard/bifrost GPUs need to allocate GPU memory which is allocated
> >>>>>> on GPU page faults and not pinned in memory. The vendor driver calls
> >>>>>> this functionality GROW_ON_GPF.
> >>>>>>
> >>>>>> This implementation assumes that BOs allocated with the
> >>>>>> PANFROST_BO_NOMAP flag are never mmapped or exported. Both of those may
> >>>>>> actually work, but I'm unsure if there's some interaction there. It
> >>>>>> would cause the whole object to be pinned in memory which would defeat
> >>>>>> the point of this.
>
> Although in normal usage user space will never care about the contents
> of growable memory it can be useful to be able to access it for
> debugging (although not critical to have it working immediately). In
> particular it allow submitting the jobs in a job chain separately.
> Exporting I can't see a use-case for.
>
> So personally I'd prefer not using a "NOMAP" flag to mean "grow on fault".

NOMAP means 'no gpu map on alloc'. The CPU mapping part is just a
limitation in the implementation which could be handled if needed.

NOPIN? It's not really 'growing' either as the total/max size is
fixed. No sure if that's the same for kbase. Maybe faults happen to be
sequential in addresses and it grows in that sense.

Maybe just saying what the buffer is used for (HEAP) would be best?
Speaking of alloc flags, Alyssa also mentioned we need a way to align
shader buffers. My suggestion there is an executable flag. That way we
can also set pages to XN. Though maybe alignment requirements need to
be explicit?

> >>>>>> Issues/questions/thoughts:
> >>>>>>
> >>>>>> What's the difference between i_mapping and f_mapping?
> >>>>>>
> >>>>>> What kind of clean-up on close is needed? Based on vgem faults, there
> >>>>>> doesn't seem to be any refcounting. Assume userspace is responsible for
> >>>>>> not freeing the BO while a page fault can occur?
> >>>>>
> >>>>> Aren't we taking a reference on all BOs that a job relates to and
> >>>>> unreferencing them once the job is done? I would think that that's
> >>>>> enough, or am I missing something?
> >>>>
> >>>> No, I think we're fine.
> >>>>
> >>>>>> What about evictions? Need to call mapping_set_unevictable()? Maybe we
> >>>>>> want these pages to be swappable, but then we need some notification to
> >>>>>> unmap them.
> >>>>>
> >>>>> I'm not sure there's much point in swapping out pages with lifetimes
> >>>>> of a few milliseconds.
> >>>>
> >>>> The lifetime is *forever* though. If we don't allow swapping, then the
> >>>> heap is grow only until the FD is closed. IIRC, the maximum size is on
> >>>> the order of 1GB. Seems like you'd want to shrink it with some
> >>>> trigger.
> >>>
> >>> I thought that the lifetime of the *contents* of the heap was that of
> >>> the job chain that wrote them? Otherwise, only the GPU would know what
> >>> can be discarded.
> >>
> >> Yes, that's probably true. To take that to the extreme, we could
> >> allocate and free the heap BO on each job chain. But we don't do that
> >> because of the overhead. So mapping and unmapping is a similar trade
> >> off of frequency vs. overhead. The question is when do we allow pages
> >> to be swapped out (as that is unhandled in the current patch)? Our
> >> choices are:
> >>
> >> - at any time. This is what the patch currently does as we don't
> >> prevent eviction. Though we'd need some mechanism to be notified when
> >> a page is evicted which is currently missing.
> >> - when a job finishes. We'd have to iterate thru BO's and mark pages
> >> evict-able on NOMAP BOs. Not sure where we do that in the driver.
> >
> > My understanding is that any contents of NOMAP memory aren't expected
> > to persist across jobs. So, when a job finishes we can unmap all pages
> > that the job faulted on.
> >
> > If we do that, then we won't have a strong need to allow NOMAP pages
> > to be swapped out because jobs aren't expected to take that long.
>
> It would certainly seem reasonable that the contents of NOMAP memory can
> be thrown away when the job chain has been completed. But, there is a
> potential performance improvement by not immediately unmapping/freeing
> the memory but leaving it in the assumption a similar job will be
> submitted later requiring roughly the same amount of memory.
>
> Arm's blob/kernel have various mechanisms for freeing memory either
> after a period of being idle (in the blob) or when a shrinker is called
> (in kbase). The idea is that the heap memory is grown once to whatever
> the content needs and then the same buffer (or small set of buffers) is
> reused repeatedly. kbase has a mechanism called "ephemeral memory" (or
> evictable) which is memory which normally remains mapped on the GPU, but
> under memory pressure it can be freed (and later faulted in with empty
> pages if accessed again). A pinning mechanism is used to ensure that
> this doesn't happen in the middle of a job chain which uses the buffer.
> This mechanism is referred to as "JIT" (Just In Time allocation) in places.

That's a bit simpler than what I assumed JIT was.

So there's 2 different cases of memory not pinned on alloc. The first
is the heap memory which is just faulted on demand (i.e during jobs)
and the 2nd is the JIT which is pinned some time between alloc and a
job submit. Is that correct? Is that 2 different allocation flags or 1
flag with 2 different ways to get pages pinned?


> > I could very well be missing something that is needed by Arm's blob
> > and not by Panfrost atm, but I don't see in kbase any mechanism for
> > the kernel to know when the GPU is done with a page, other than the
> > job that mapped it having finished.
>
> Much of the memory management is done by the user space blob. The kernel
> driver usually doesn't actually know what memory a job will access.
> There are exceptions though, in particular: ephemeral memory (through
> JIT) and imported memory.

Presumably that's a difference. We have a complete list of BOs for each job.

Rob
Steven Price June 28, 2019, 10:34 a.m. UTC | #11
On 27/06/2019 17:38, Rob Herring wrote:
> On Thu, Jun 27, 2019 at 4:57 AM Steven Price <steven.price@arm.com> wrote:
>>
>> Sorry for the slow response, I've been on holiday for a few weeks.
> 
> Welcome back.

Thanks!

>>
>> On 20/06/2019 06:50, Tomeu Vizoso wrote:
>>> On Mon, 17 Jun 2019 at 16:56, Rob Herring <robh@kernel.org> wrote:
>>>>
>>>> On Sun, Jun 16, 2019 at 11:15 PM Tomeu Vizoso
>>>> <tomeu.vizoso@collabora.com> wrote:
>>>>>
>>>>> On Fri, 14 Jun 2019 at 23:22, Rob Herring <robh@kernel.org> wrote:
>>>>>>
>>>>>> On Wed, Jun 12, 2019 at 6:55 AM Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
>>>>>>>
>>>>>>> On Mon, 10 Jun 2019 at 19:06, Rob Herring <robh@kernel.org> wrote:
>>>>>>>>
>>>>>>>> The midgard/bifrost GPUs need to allocate GPU memory which is allocated
>>>>>>>> on GPU page faults and not pinned in memory. The vendor driver calls
>>>>>>>> this functionality GROW_ON_GPF.
>>>>>>>>
>>>>>>>> This implementation assumes that BOs allocated with the
>>>>>>>> PANFROST_BO_NOMAP flag are never mmapped or exported. Both of those may
>>>>>>>> actually work, but I'm unsure if there's some interaction there. It
>>>>>>>> would cause the whole object to be pinned in memory which would defeat
>>>>>>>> the point of this.
>>
>> Although in normal usage user space will never care about the contents
>> of growable memory it can be useful to be able to access it for
>> debugging (although not critical to have it working immediately). In
>> particular it allow submitting the jobs in a job chain separately.
>> Exporting I can't see a use-case for.
>>
>> So personally I'd prefer not using a "NOMAP" flag to mean "grow on fault".
> 
> NOMAP means 'no gpu map on alloc'. The CPU mapping part is just a
> limitation in the implementation which could be handled if needed.

Ah, well my confusion might be another indication it's not a great name ;)

> NOPIN? It's not really 'growing' either as the total/max size is
> fixed. No sure if that's the same for kbase. Maybe faults happen to be
> sequential in addresses and it grows in that sense.

It depends what you understand by pinning. To me pinning means that the
memory cannot be swapped out - which isn't the API level feature (e.g.
we might introduce support for swapping when the GPU isn't using the
memory). In kbase we call it "growing" because the amount of memory
allocated can increase - and indeed it grows in a similar way to a stack
on a CPU.

> Maybe just saying what the buffer is used for (HEAP) would be best?

That seems like a good name to me. User space doesn't really care how
the kernel manages the memory - it just wants to communicate that this
is temporary heap memory for the GPU to use.

> Speaking of alloc flags, Alyssa also mentioned we need a way to align
> shader buffers. My suggestion there is an executable flag. That way we
> can also set pages to XN. Though maybe alignment requirements need to
> be explicit?

kbase mostly handles this with a executable flag, so yes that seems a
reasonable way of handling it. Note, however, that there are a bunch of
wacky optimisation ideas that have been considered that require
particular alignment constraints. In particular kbase ended up with
BASE_MEM_TILER_ALIGN_TOP[1] which is somewhat of a hack to specify the
odd alignment requirement without adding extra fields to the ioctl.

[1]
https://gitlab.freedesktop.org/panfrost/mali_kbase/blob/master/driver/product/kernel/drivers/gpu/arm/midgard/mali_base_kernel.h#L197

One other thing that I don't think is well supported in panfrost at the
moment is that some units don't actually store the full VA address. The
most notable one is the PC - this is either 32 bit or 24 bit depending
on the GPU (although kbase always assumes 24 bit). This means that the
shader code must be aligned to not cross a 24 bit boundary. kbase also
has BASE_MEM_GPU_VA_SAME_4GB_PAGE for the same idea but restricted to a
32 bit size.

There's also a nasty limitation for executable memory - it can't start
(or end) on a 4GB boundary, see the code here which avoids picking those
addresses:

https://gitlab.freedesktop.org/panfrost/mali_kbase/blob/master/driver/product/kernel/drivers/gpu/arm/midgard/mali_kbase_mem.c#L279

Finally kbase has kbase_ioctl_mem_alias which allows creating aliases of
existing mappings with appropriate strides between them. This is an
optimisation for rending to multiple render targets efficiently and is
only needed for some GPUs. But I think we can leave that one for now.

[...]
>> It would certainly seem reasonable that the contents of NOMAP memory can
>> be thrown away when the job chain has been completed. But, there is a
>> potential performance improvement by not immediately unmapping/freeing
>> the memory but leaving it in the assumption a similar job will be
>> submitted later requiring roughly the same amount of memory.
>>
>> Arm's blob/kernel have various mechanisms for freeing memory either
>> after a period of being idle (in the blob) or when a shrinker is called
>> (in kbase). The idea is that the heap memory is grown once to whatever
>> the content needs and then the same buffer (or small set of buffers) is
>> reused repeatedly. kbase has a mechanism called "ephemeral memory" (or
>> evictable) which is memory which normally remains mapped on the GPU, but
>> under memory pressure it can be freed (and later faulted in with empty
>> pages if accessed again). A pinning mechanism is used to ensure that
>> this doesn't happen in the middle of a job chain which uses the buffer.
>> This mechanism is referred to as "JIT" (Just In Time allocation) in places.
> 
> That's a bit simpler than what I assumed JIT was.

Ah, well I have simplified it a bit in that description :)

There are effectively two features. Ephemeral memory is the DONT_NEED
flag which enables the memory to be freed under memory pressure when
it's not in use. JIT is then built on top of that and provides a
mechanism for the kernel to allocated ephemeral memory regions "just in
time" immediately before the jobs are sent to the GPU. This offloads the
decision about how many memory regions are needed to the kernel in the
hope that the kernel can dynamically choose the trade-off between
allocating lots of buffers (gives maximum flexibility in terms of job
scheduling) or saving memory by immediately running the fragment job so
the heap buffers can be recycled.

All I can say is that it's a locking nightmare (shrinkers can be called
some very annoying contexts). It's also not clear that the kernel is in
a better position to make the memory/performance trade-off decision than
user space.

> So there's 2 different cases of memory not pinned on alloc. The first
> is the heap memory which is just faulted on demand (i.e during jobs)
> and the 2nd is the JIT which is pinned some time between alloc and a
> job submit. Is that correct? Is that 2 different allocation flags or 1
> flag with 2 different ways to get pages pinned?

Yes that's correct. "Heap memory"[2] is just GROW_ON_GPF memory
allocated by user space. JIT memory is allocated by a 'soft-job'
(BASE_JD_REQ_SOFT_JIT_ALLOC) that user space inserts before the real GPU
jobs. This soft-job is responsible for allocating (or reusing) a buffer
(which is internally marked as GROW_ON_GPF) and ensuring it's pinned
(removing any DONT_NEED flag). After the GPU jobs have run there's
another soft-job (BASE_JD_REQ_SOFT_JIT_FREE) which will return the
buffer to a pool and set the DONT_NEED flag on it.

[2] We don't really have a term for this internally, it's just "growable
memory".

So both types are "grow on fault", the difference is that the
user-allocated "heap memory" well not be discarded or automatically
reused by the kernel, whereas JIT memory will be under the control of
the kernel after the soft-job frees it and so can be recycled/freed at
any time.

>>> I could very well be missing something that is needed by Arm's blob
>>> and not by Panfrost atm, but I don't see in kbase any mechanism for
>>> the kernel to know when the GPU is done with a page, other than the
>>> job that mapped it having finished.
>>
>> Much of the memory management is done by the user space blob. The kernel
>> driver usually doesn't actually know what memory a job will access.
>> There are exceptions though, in particular: ephemeral memory (through
>> JIT) and imported memory.
> 
> Presumably that's a difference. We have a complete list of BOs for each job.

Yes - that's something that I've repeatedly wished the blob driver had.
However it was an early design decision that the driver wouldn't need to
track what memory regions would be used. This meant for the exceptions
there has to be explicit tracking of the regions, which unfortunately
means imported memory ends up being quite 'special'.

Steve

> Rob
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/panfrost/TODO b/drivers/gpu/drm/panfrost/TODO
index c2e44add37d8..64129bf73933 100644
--- a/drivers/gpu/drm/panfrost/TODO
+++ b/drivers/gpu/drm/panfrost/TODO
@@ -14,8 +14,6 @@ 
   The hard part is handling when more address spaces are needed than what
   the h/w provides.
 
-- Support pinning pages on demand (GPU page faults).
-
 - Support userspace controlled GPU virtual addresses. Needed for Vulkan. (Tomeu)
 
 - Support for madvise and a shrinker.
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c
index d11e2281dde6..f08d41d5186a 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -44,9 +44,11 @@  static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data,
 {
 	int ret;
 	struct drm_gem_shmem_object *shmem;
+	struct panfrost_gem_object *bo;
 	struct drm_panfrost_create_bo *args = data;
 
-	if (!args->size || args->flags || args->pad)
+	if (!args->size || args->pad ||
+	    (args->flags & ~PANFROST_BO_NOMAP))
 		return -EINVAL;
 
 	shmem = drm_gem_shmem_create_with_handle(file, dev, args->size,
@@ -54,11 +56,16 @@  static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data,
 	if (IS_ERR(shmem))
 		return PTR_ERR(shmem);
 
-	ret = panfrost_mmu_map(to_panfrost_bo(&shmem->base));
-	if (ret)
-		goto err_free;
+	bo = to_panfrost_bo(&shmem->base);
 
-	args->offset = to_panfrost_bo(&shmem->base)->node.start << PAGE_SHIFT;
+	if (!(args->flags & PANFROST_BO_NOMAP)) {
+		ret = panfrost_mmu_map(bo);
+		if (ret)
+			goto err_free;
+	} else
+		bo->no_map = 1;
+
+	args->offset = bo->node.start << PAGE_SHIFT;
 
 	return 0;
 
@@ -269,6 +276,11 @@  static int panfrost_ioctl_mmap_bo(struct drm_device *dev, void *data,
 		return -ENOENT;
 	}
 
+	if (to_panfrost_bo(gem_obj)->no_map) {
+		DRM_DEBUG("Can't mmap 'no_map' GEM BO %d\n", args->handle);
+		return -EINVAL;
+	}
+
 	ret = drm_gem_create_mmap_offset(gem_obj);
 	if (ret == 0)
 		args->offset = drm_vma_node_offset_addr(&gem_obj->vma_node);
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c b/drivers/gpu/drm/panfrost/panfrost_gem.c
index 886875ae31d3..ed0b4f79610a 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.c
@@ -19,7 +19,8 @@  static void panfrost_gem_free_object(struct drm_gem_object *obj)
 	struct panfrost_gem_object *bo = to_panfrost_bo(obj);
 	struct panfrost_device *pfdev = obj->dev->dev_private;
 
-	panfrost_mmu_unmap(bo);
+	if (!bo->no_map)
+		panfrost_mmu_unmap(bo);
 
 	spin_lock(&pfdev->mm_lock);
 	drm_mm_remove_node(&bo->node);
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h b/drivers/gpu/drm/panfrost/panfrost_gem.h
index 045000eb5fcf..aa210d7cbf3f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem.h
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.h
@@ -11,6 +11,7 @@  struct panfrost_gem_object {
 	struct drm_gem_shmem_object base;
 
 	struct drm_mm_node node;
+	unsigned no_map: 1;
 };
 
 static inline
@@ -19,6 +20,13 @@  struct  panfrost_gem_object *to_panfrost_bo(struct drm_gem_object *obj)
 	return container_of(to_drm_gem_shmem_obj(obj), struct panfrost_gem_object, base);
 }
 
+static inline
+struct  panfrost_gem_object *drm_mm_node_to_panfrost_bo(struct drm_mm_node *node)
+{
+	return container_of(node, struct panfrost_gem_object, node);
+}
+
+
 struct drm_gem_object *panfrost_gem_create_object(struct drm_device *dev, size_t size);
 
 struct drm_gem_object *
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index 41d184fab07f..5a36495a38f3 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -3,6 +3,7 @@ 
 /* Copyright (C) 2019 Arm Ltd. */
 #include <linux/bitfield.h>
 #include <linux/delay.h>
+#include <linux/dma-mapping.h>
 #include <linux/interrupt.h>
 #include <linux/io.h>
 #include <linux/iopoll.h>
@@ -10,6 +11,7 @@ 
 #include <linux/iommu.h>
 #include <linux/platform_device.h>
 #include <linux/pm_runtime.h>
+#include <linux/shmem_fs.h>
 #include <linux/sizes.h>
 
 #include "panfrost_device.h"
@@ -277,6 +279,60 @@  static const struct iommu_gather_ops mmu_tlb_ops = {
 	.tlb_sync	= mmu_tlb_sync_context,
 };
 
+static struct drm_mm_node *addr_to_drm_mm_node(struct panfrost_device *pfdev, int as, u64 addr)
+{
+	struct drm_mm_node *node;
+	u64 offset = addr >> PAGE_SHIFT;
+
+	drm_mm_for_each_node(node, &pfdev->mm) {
+		if (offset >= node->start && offset < (node->start + node->size))
+			return node;
+	}
+	return NULL;
+}
+
+int panfrost_mmu_map_fault_addr(struct panfrost_device *pfdev, int as, u64 addr)
+{
+	struct page *page;
+	dma_addr_t dma_addr;
+	struct drm_mm_node *node;
+	struct panfrost_gem_object *bo;
+	struct io_pgtable_ops *ops = pfdev->mmu->pgtbl_ops;
+	pgoff_t page_offset = addr >> PAGE_SHIFT;
+
+	addr &= PAGE_MASK;
+
+	node = addr_to_drm_mm_node(pfdev, as, addr);
+	if (!node)
+		return -ENOENT;
+
+	page_offset -= node->start;
+	bo = drm_mm_node_to_panfrost_bo(node);
+	if (WARN_ON(!bo->no_map))
+		return -EINVAL;
+
+	dev_info(pfdev->dev, "mapping page fault @ %llx", addr);
+
+	page = shmem_read_mapping_page(bo->base.base.filp->f_mapping,
+				       page_offset);
+	if (IS_ERR(page))
+		return PTR_ERR(page);
+
+	dma_addr = dma_map_page(pfdev->dev, page, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
+
+	mutex_lock(&pfdev->mmu->lock);
+
+	ops->map(ops, addr, dma_addr, PAGE_SIZE, IOMMU_WRITE | IOMMU_READ);
+
+	mmu_write(pfdev, MMU_INT_CLEAR, BIT(as));
+
+	mmu_hw_do_operation(pfdev, as, addr, PAGE_SIZE, AS_COMMAND_FLUSH_PT);
+
+	mutex_unlock(&pfdev->mmu->lock);
+
+	return 0;
+}
+
 static const char *access_type_name(struct panfrost_device *pfdev,
 		u32 fault_status)
 {
@@ -302,13 +358,11 @@  static irqreturn_t panfrost_mmu_irq_handler(int irq, void *data)
 {
 	struct panfrost_device *pfdev = data;
 	u32 status = mmu_read(pfdev, MMU_INT_STAT);
-	int i;
+	int i, ret;
 
 	if (!status)
 		return IRQ_NONE;
 
-	dev_err(pfdev->dev, "mmu irq status=%x\n", status);
-
 	for (i = 0; status; i++) {
 		u32 mask = BIT(i) | BIT(i + 16);
 		u64 addr;
@@ -329,6 +383,17 @@  static irqreturn_t panfrost_mmu_irq_handler(int irq, void *data)
 		access_type = (fault_status >> 8) & 0x3;
 		source_id = (fault_status >> 16);
 
+		/* Page fault only */
+		if ((status & mask) == BIT(i)) {
+			WARN_ON(exception_type < 0xC1 || exception_type > 0xC4);
+
+			ret = panfrost_mmu_map_fault_addr(pfdev, i, addr);
+			if (!ret) {
+				status &= ~mask;
+				continue;
+			}
+		}
+
 		/* terminal fault, print info about the fault */
 		dev_err(pfdev->dev,
 			"Unhandled Page fault in AS%d at VA 0x%016llX\n"
@@ -370,8 +435,9 @@  int panfrost_mmu_init(struct panfrost_device *pfdev)
 	if (irq <= 0)
 		return -ENODEV;
 
-	err = devm_request_irq(pfdev->dev, irq, panfrost_mmu_irq_handler,
-			       IRQF_SHARED, "mmu", pfdev);
+	err = devm_request_threaded_irq(pfdev->dev, irq, NULL,
+					panfrost_mmu_irq_handler,
+					IRQF_ONESHOT, "mmu", pfdev);
 
 	if (err) {
 		dev_err(pfdev->dev, "failed to request mmu irq");
diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
index a52e0283b90d..6d83baa0e4c1 100644
--- a/include/uapi/drm/panfrost_drm.h
+++ b/include/uapi/drm/panfrost_drm.h
@@ -71,6 +71,8 @@  struct drm_panfrost_wait_bo {
 	__s64 timeout_ns;	/* absolute */
 };
 
+#define PANFROST_BO_NOMAP	1
+
 /**
  * struct drm_panfrost_create_bo - ioctl argument for creating Panfrost BOs.
  *