mbox series

[v2,00/23] drm/msm+PM+icc: Make job_run() reclaim-safe

Message ID 20230320144356.803762-1-robdclark@gmail.com
Headers show
Series drm/msm+PM+icc: Make job_run() reclaim-safe | expand

Message

Rob Clark March 20, 2023, 2:43 p.m. UTC
From: Rob Clark <robdclark@chromium.org>

Inspired by https://lore.kernel.org/dri-devel/20200604081224.863494-10-daniel.vetter@ffwll.ch/
it seemed like a good idea to get rid of memory allocation in job_run()
fence signaling path, and use lockdep annotations to yell at us about
anything that could deadlock against shrinker/reclaim.  Anything that
can trigger reclaim, or block on any other thread that has triggered
reclaim, can block the GPU shrinker from releasing memory if it is
waiting the job to complete, causing deadlock.

The first patch pre-allocates the hw_fence, splitting allocation and
initialization, to avoid allocation in the job_run() path.  The next
eight decouple the obj lock from job_run(), as the obj lock is required
to pin/unpin backing pages (ie. holding an obj lock in job_run() could
deadlock the shrinker by blocking forward progress towards pinned buffers
becoming idle).  Followed by two so that we could idr_preload() in order
to avoid memory allocations under locks indirectly connected to the
shrinker path.

Next are three paths to decouple initialization (where allocations are
needed) from GPU runpm and devfreq, to avoid allocations in the fence
signaling path.  Followed by various PM devfreq/qos and interconnect
locking fixes to decouple initialization (allocation) from runtime.

And finally, the last patch is a modified version of danvet's patch to
add lockdep annotations to gpu scheduler, but does so conditionally so
that drivers can opt-in.

v2: Switch from embedding hw_fence in submit/job object to preallocating
    the hw_fence.  Rework "fenced unpin" locking to drop obj lock from
    fence signaling path (ie. the part that was still WIP in the first
    iteration of the patchset).  Adds the final patch to enable fence
    signaling annotations now that job_run() and job_free() are safe.
    The PM devfreq/QoS and interconnect patches are unchanged.

Rob Clark (23):
  drm/msm: Pre-allocate hw_fence
  drm/msm: Move submit bo flags update from obj lock
  drm/msm/gem: Tidy up VMA API
  drm/msm: Decouple vma tracking from obj lock
  drm/msm/gem: Simplify vmap vs LRU tracking
  drm/gem: Export drm_gem_lru_move_tail_locked()
  drm/msm/gem: Move update_lru()
  drm/msm/gem: Protect pin_count/madv by LRU lock
  drm/msm/gem: Avoid obj lock in job_run()
  drm/msm: Switch idr_lock to spinlock
  drm/msm: Use idr_preload()
  drm/msm/gpu: Move fw loading out of hw_init() path
  drm/msm/gpu: Move BO allocation out of hw_init
  drm/msm/a6xx: Move ioremap out of hw_init path
  PM / devfreq: Drop unneed locking to appease lockdep
  PM / devfreq: Teach lockdep about locking order
  PM / QoS: Fix constraints alloc vs reclaim locking
  PM / QoS: Decouple request alloc from dev_pm_qos_mtx
  PM / QoS: Teach lockdep about dev_pm_qos_mtx locking order
  soc: qcom: smd-rpm: Use GFP_ATOMIC in write path
  interconnect: Fix locking for runpm vs reclaim
  interconnect: Teach lockdep about icc_bw_lock order
  drm/sched: Add (optional) fence signaling annotation

 drivers/base/power/qos.c                   |  83 +++++++++---
 drivers/devfreq/devfreq.c                  |  52 ++++----
 drivers/gpu/drm/drm_gem.c                  |  11 +-
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c      |  48 ++++---
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c      |  18 ++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c      |  46 ++++---
 drivers/gpu/drm/msm/adreno/adreno_device.c |   6 +
 drivers/gpu/drm/msm/adreno/adreno_gpu.c    |   9 +-
 drivers/gpu/drm/msm/msm_drv.c              |   6 +-
 drivers/gpu/drm/msm/msm_fence.c            |  12 +-
 drivers/gpu/drm/msm/msm_fence.h            |   3 +-
 drivers/gpu/drm/msm/msm_gem.c              | 145 ++++++++++++++-------
 drivers/gpu/drm/msm/msm_gem.h              |  29 +++--
 drivers/gpu/drm/msm/msm_gem_submit.c       |  27 ++--
 drivers/gpu/drm/msm/msm_gem_vma.c          |  91 ++++++++++---
 drivers/gpu/drm/msm/msm_gpu.h              |   8 +-
 drivers/gpu/drm/msm/msm_ringbuffer.c       |   9 +-
 drivers/gpu/drm/msm/msm_submitqueue.c      |   2 +-
 drivers/gpu/drm/scheduler/sched_main.c     |   9 ++
 drivers/interconnect/core.c                |  18 ++-
 drivers/soc/qcom/smd-rpm.c                 |   2 +-
 include/drm/drm_gem.h                      |   1 +
 include/drm/gpu_scheduler.h                |   2 +
 23 files changed, 416 insertions(+), 221 deletions(-)

Comments

Luben Tuikov March 21, 2023, 2:55 a.m. UTC | #1
This is good. Hopefully it helps us catch lockdep bugs.

Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>

Regards,
Luben

On 2023-03-20 10:43, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> Based on
> https://lore.kernel.org/dri-devel/20200604081224.863494-10-daniel.vetter@ffwll.ch/
> but made to be optional.
> 
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/msm/msm_ringbuffer.c   | 1 +
>  drivers/gpu/drm/scheduler/sched_main.c | 9 +++++++++
>  include/drm/gpu_scheduler.h            | 2 ++
>  3 files changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
> index b60199184409..7e42baf16cd0 100644
> --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
> +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
> @@ -93,6 +93,7 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,
>  	 /* currently managing hangcheck ourselves: */
>  	sched_timeout = MAX_SCHEDULE_TIMEOUT;
>  
> +	ring->sched.fence_signaling = true;
>  	ret = drm_sched_init(&ring->sched, &msm_sched_ops,
>  			num_hw_submissions, 0, sched_timeout,
>  			NULL, NULL, to_msm_bo(ring->bo)->name, gpu->dev->dev);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 4e6ad6e122bc..c2ee44d6224b 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -978,10 +978,15 @@ static bool drm_sched_blocked(struct drm_gpu_scheduler *sched)
>  static int drm_sched_main(void *param)
>  {
>  	struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
> +	const bool fence_signaling = sched->fence_signaling;
> +	bool fence_cookie;
>  	int r;
>  
>  	sched_set_fifo_low(current);
>  
> +	if (fence_signaling)
> +		fence_cookie = dma_fence_begin_signalling();
> +
>  	while (!kthread_should_stop()) {
>  		struct drm_sched_entity *entity = NULL;
>  		struct drm_sched_fence *s_fence;
> @@ -1039,6 +1044,10 @@ static int drm_sched_main(void *param)
>  
>  		wake_up(&sched->job_scheduled);
>  	}
> +
> +	if (fence_signaling)
> +		dma_fence_end_signalling(fence_cookie);
> +
>  	return 0;
>  }
>  
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 9db9e5e504ee..8f23ea522e22 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -483,6 +483,7 @@ struct drm_sched_backend_ops {
>   * @ready: marks if the underlying HW is ready to work
>   * @free_guilty: A hit to time out handler to free the guilty job.
>   * @dev: system &struct device
> + * @fence_signaling: Opt in to fence signaling annotations
>   *
>   * One scheduler is implemented for each hardware ring.
>   */
> @@ -507,6 +508,7 @@ struct drm_gpu_scheduler {
>  	bool				ready;
>  	bool				free_guilty;
>  	struct device			*dev;
> +	bool 				fence_signaling;
>  };
>  
>  int drm_sched_init(struct drm_gpu_scheduler *sched,
Rafael J. Wysocki March 27, 2023, 5:53 p.m. UTC | #2
On Mon, Mar 20, 2023 at 3:45 PM Rob Clark <robdclark@gmail.com> wrote:
>
> From: Rob Clark <robdclark@chromium.org>
>
> In the process of adding lockdep annotation for drm GPU scheduler's
> job_run() to detect potential deadlock against shrinker/reclaim, I hit
> this lockdep splat:
>
>    ======================================================
>    WARNING: possible circular locking dependency detected
>    6.2.0-rc8-debug+ #558 Tainted: G        W
>    ------------------------------------------------------
>    ring0/125 is trying to acquire lock:
>    ffffffd6d6ce0f28 (dev_pm_qos_mtx){+.+.}-{3:3}, at: dev_pm_qos_update_request+0x38/0x68
>
>    but task is already holding lock:
>    ffffff8087239208 (&gpu->active_lock){+.+.}-{3:3}, at: msm_gpu_submit+0xec/0x178
>
>    which lock already depends on the new lock.
>
>    the existing dependency chain (in reverse order) is:
>
>    -> #4 (&gpu->active_lock){+.+.}-{3:3}:
>           __mutex_lock+0xcc/0x3c8
>           mutex_lock_nested+0x30/0x44
>           msm_gpu_submit+0xec/0x178
>           msm_job_run+0x78/0x150
>           drm_sched_main+0x290/0x370
>           kthread+0xf0/0x100
>           ret_from_fork+0x10/0x20
>
>    -> #3 (dma_fence_map){++++}-{0:0}:
>           __dma_fence_might_wait+0x74/0xc0
>           dma_resv_lockdep+0x1f4/0x2f4
>           do_one_initcall+0x104/0x2bc
>           kernel_init_freeable+0x344/0x34c
>           kernel_init+0x30/0x134
>           ret_from_fork+0x10/0x20
>
>    -> #2 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}:
>           fs_reclaim_acquire+0x80/0xa8
>           slab_pre_alloc_hook.constprop.0+0x40/0x25c
>           __kmem_cache_alloc_node+0x60/0x1cc
>           __kmalloc+0xd8/0x100
>           topology_parse_cpu_capacity+0x8c/0x178
>           get_cpu_for_node+0x88/0xc4
>           parse_cluster+0x1b0/0x28c
>           parse_cluster+0x8c/0x28c
>           init_cpu_topology+0x168/0x188
>           smp_prepare_cpus+0x24/0xf8
>           kernel_init_freeable+0x18c/0x34c
>           kernel_init+0x30/0x134
>           ret_from_fork+0x10/0x20
>
>    -> #1 (fs_reclaim){+.+.}-{0:0}:
>           __fs_reclaim_acquire+0x3c/0x48
>           fs_reclaim_acquire+0x54/0xa8
>           slab_pre_alloc_hook.constprop.0+0x40/0x25c
>           __kmem_cache_alloc_node+0x60/0x1cc
>           kmalloc_trace+0x50/0xa8
>           dev_pm_qos_constraints_allocate+0x38/0x100
>           __dev_pm_qos_add_request+0xb0/0x1e8
>           dev_pm_qos_add_request+0x58/0x80
>           dev_pm_qos_expose_latency_limit+0x60/0x13c
>           register_cpu+0x12c/0x130
>           topology_init+0xac/0xbc
>           do_one_initcall+0x104/0x2bc
>           kernel_init_freeable+0x344/0x34c
>           kernel_init+0x30/0x134
>           ret_from_fork+0x10/0x20
>
>    -> #0 (dev_pm_qos_mtx){+.+.}-{3:3}:
>           __lock_acquire+0xe00/0x1060
>           lock_acquire+0x1e0/0x2f8
>           __mutex_lock+0xcc/0x3c8
>           mutex_lock_nested+0x30/0x44
>           dev_pm_qos_update_request+0x38/0x68
>           msm_devfreq_boost+0x40/0x70
>           msm_devfreq_active+0xc0/0xf0
>           msm_gpu_submit+0x10c/0x178
>           msm_job_run+0x78/0x150
>           drm_sched_main+0x290/0x370
>           kthread+0xf0/0x100
>           ret_from_fork+0x10/0x20
>
>    other info that might help us debug this:
>
>    Chain exists of:
>      dev_pm_qos_mtx --> dma_fence_map --> &gpu->active_lock
>
>     Possible unsafe locking scenario:
>
>           CPU0                    CPU1
>           ----                    ----
>      lock(&gpu->active_lock);
>                                   lock(dma_fence_map);
>                                   lock(&gpu->active_lock);
>      lock(dev_pm_qos_mtx);
>
>     *** DEADLOCK ***
>
>    3 locks held by ring0/123:
>     #0: ffffff8087251170 (&gpu->lock){+.+.}-{3:3}, at: msm_job_run+0x64/0x150
>     #1: ffffffd00b0e57e8 (dma_fence_map){++++}-{0:0}, at: msm_job_run+0x68/0x150
>     #2: ffffff8087251208 (&gpu->active_lock){+.+.}-{3:3}, at: msm_gpu_submit+0xec/0x178
>
>    stack backtrace:
>    CPU: 6 PID: 123 Comm: ring0 Not tainted 6.2.0-rc8-debug+ #559
>    Hardware name: Google Lazor (rev1 - 2) with LTE (DT)
>    Call trace:
>     dump_backtrace.part.0+0xb4/0xf8
>     show_stack+0x20/0x38
>     dump_stack_lvl+0x9c/0xd0
>     dump_stack+0x18/0x34
>     print_circular_bug+0x1b4/0x1f0
>     check_noncircular+0x78/0xac
>     __lock_acquire+0xe00/0x1060
>     lock_acquire+0x1e0/0x2f8
>     __mutex_lock+0xcc/0x3c8
>     mutex_lock_nested+0x30/0x44
>     dev_pm_qos_update_request+0x38/0x68
>     msm_devfreq_boost+0x40/0x70
>     msm_devfreq_active+0xc0/0xf0
>     msm_gpu_submit+0x10c/0x178
>     msm_job_run+0x78/0x150
>     drm_sched_main+0x290/0x370
>     kthread+0xf0/0x100
>     ret_from_fork+0x10/0x20
>
> The issue is that dev_pm_qos_mtx is held in the runpm suspend/resume (or
> freq change) path, but it is also held across allocations that could
> recurse into shrinker.
>
> Solve this by changing dev_pm_qos_constraints_allocate() into a function
> that can be called unconditionally before the device qos object is
> needed and before aquiring dev_pm_qos_mtx.  This way the allocations can
> be done without holding the mutex.  In the case that we raced with
> another thread to allocate the qos object, detect this *after* acquiring
> the dev_pm_qos_mtx and simply free the redundant allocations.

Honestly, I don't like this approach.

In particular, dropping a lock just in order to grab it again right
away is really confusing (and I'm not even sure it is correct ATM).

Let me think about how to possibly address the issue at hand in a different way.

> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/base/power/qos.c | 60 +++++++++++++++++++++++++++-------------
>  1 file changed, 41 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/base/power/qos.c b/drivers/base/power/qos.c
> index 8e93167f1783..f3e0c6b65635 100644
> --- a/drivers/base/power/qos.c
> +++ b/drivers/base/power/qos.c
> @@ -185,18 +185,24 @@ static int apply_constraint(struct dev_pm_qos_request *req,
>  }
>
>  /*
> - * dev_pm_qos_constraints_allocate
> + * dev_pm_qos_constraints_ensure_allocated
>   * @dev: device to allocate data for
>   *
> - * Called at the first call to add_request, for constraint data allocation
> - * Must be called with the dev_pm_qos_mtx mutex held
> + * Called to ensure that devices qos is allocated, before acquiring
> + * dev_pm_qos_mtx.
>   */
> -static int dev_pm_qos_constraints_allocate(struct device *dev)
> +static int dev_pm_qos_constraints_ensure_allocated(struct device *dev)
>  {
>         struct dev_pm_qos *qos;
>         struct pm_qos_constraints *c;
>         struct blocking_notifier_head *n;
>
> +       if (!dev)
> +               return -ENODEV;
> +
> +       if (!IS_ERR_OR_NULL(dev->power.qos))
> +               return 0;
> +
>         qos = kzalloc(sizeof(*qos), GFP_KERNEL);
>         if (!qos)
>                 return -ENOMEM;
> @@ -227,10 +233,26 @@ static int dev_pm_qos_constraints_allocate(struct device *dev)
>
>         INIT_LIST_HEAD(&qos->flags.list);
>
> +       mutex_lock(&dev_pm_qos_mtx);
> +
> +       if (!IS_ERR_OR_NULL(dev->power.qos)) {
> +               /*
> +                * We have raced with another task to create the qos.
> +                * No biggie, just free the resources we've allocated
> +                * outside of dev_pm_qos_mtx and move on with life.
> +                */
> +               kfree(n);
> +               kfree(qos);
> +               goto unlock;
> +       }
> +
>         spin_lock_irq(&dev->power.lock);
>         dev->power.qos = qos;
>         spin_unlock_irq(&dev->power.lock);
>
> +unlock:
> +       mutex_unlock(&dev_pm_qos_mtx);
> +
>         return 0;
>  }
>
> @@ -331,17 +353,15 @@ static int __dev_pm_qos_add_request(struct device *dev,
>  {
>         int ret = 0;
>
> -       if (!dev || !req || dev_pm_qos_invalid_req_type(dev, type))
> +       if (!req || dev_pm_qos_invalid_req_type(dev, type))
>                 return -EINVAL;
>
>         if (WARN(dev_pm_qos_request_active(req),
>                  "%s() called for already added request\n", __func__))
>                 return -EINVAL;
>
> -       if (IS_ERR(dev->power.qos))
> +       if (IS_ERR_OR_NULL(dev->power.qos))
>                 ret = -ENODEV;
> -       else if (!dev->power.qos)
> -               ret = dev_pm_qos_constraints_allocate(dev);
>
>         trace_dev_pm_qos_add_request(dev_name(dev), type, value);
>         if (ret)
> @@ -390,6 +410,10 @@ int dev_pm_qos_add_request(struct device *dev, struct dev_pm_qos_request *req,
>  {
>         int ret;
>
> +       ret = dev_pm_qos_constraints_ensure_allocated(dev);
> +       if (ret)
> +               return ret;
> +
>         mutex_lock(&dev_pm_qos_mtx);
>         ret = __dev_pm_qos_add_request(dev, req, type, value);
>         mutex_unlock(&dev_pm_qos_mtx);
> @@ -537,15 +561,11 @@ int dev_pm_qos_add_notifier(struct device *dev, struct notifier_block *notifier,
>  {
>         int ret = 0;
>
> -       mutex_lock(&dev_pm_qos_mtx);
> -
> -       if (IS_ERR(dev->power.qos))
> -               ret = -ENODEV;
> -       else if (!dev->power.qos)
> -               ret = dev_pm_qos_constraints_allocate(dev);
> -
> +       ret = dev_pm_qos_constraints_ensure_allocated(dev);
>         if (ret)
> -               goto unlock;
> +               return ret;
> +
> +       mutex_lock(&dev_pm_qos_mtx);
>
>         switch (type) {
>         case DEV_PM_QOS_RESUME_LATENCY:
> @@ -565,7 +585,6 @@ int dev_pm_qos_add_notifier(struct device *dev, struct notifier_block *notifier,
>                 ret = -EINVAL;
>         }
>
> -unlock:
>         mutex_unlock(&dev_pm_qos_mtx);
>         return ret;
>  }
> @@ -905,10 +924,13 @@ int dev_pm_qos_update_user_latency_tolerance(struct device *dev, s32 val)
>  {
>         int ret;
>
> +       ret = dev_pm_qos_constraints_ensure_allocated(dev);
> +       if (ret)
> +               return ret;
> +
>         mutex_lock(&dev_pm_qos_mtx);
>
> -       if (IS_ERR_OR_NULL(dev->power.qos)
> -           || !dev->power.qos->latency_tolerance_req) {
> +       if (!dev->power.qos->latency_tolerance_req) {
>                 struct dev_pm_qos_request *req;
>
>                 if (val < 0) {
> --
Rob Clark March 27, 2023, 7:52 p.m. UTC | #3
On Mon, Mar 27, 2023 at 10:53 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Mon, Mar 20, 2023 at 3:45 PM Rob Clark <robdclark@gmail.com> wrote:
> >
> > From: Rob Clark <robdclark@chromium.org>
> >
> > In the process of adding lockdep annotation for drm GPU scheduler's
> > job_run() to detect potential deadlock against shrinker/reclaim, I hit
> > this lockdep splat:
> >
> >    ======================================================
> >    WARNING: possible circular locking dependency detected
> >    6.2.0-rc8-debug+ #558 Tainted: G        W
> >    ------------------------------------------------------
> >    ring0/125 is trying to acquire lock:
> >    ffffffd6d6ce0f28 (dev_pm_qos_mtx){+.+.}-{3:3}, at: dev_pm_qos_update_request+0x38/0x68
> >
> >    but task is already holding lock:
> >    ffffff8087239208 (&gpu->active_lock){+.+.}-{3:3}, at: msm_gpu_submit+0xec/0x178
> >
> >    which lock already depends on the new lock.
> >
> >    the existing dependency chain (in reverse order) is:
> >
> >    -> #4 (&gpu->active_lock){+.+.}-{3:3}:
> >           __mutex_lock+0xcc/0x3c8
> >           mutex_lock_nested+0x30/0x44
> >           msm_gpu_submit+0xec/0x178
> >           msm_job_run+0x78/0x150
> >           drm_sched_main+0x290/0x370
> >           kthread+0xf0/0x100
> >           ret_from_fork+0x10/0x20
> >
> >    -> #3 (dma_fence_map){++++}-{0:0}:
> >           __dma_fence_might_wait+0x74/0xc0
> >           dma_resv_lockdep+0x1f4/0x2f4
> >           do_one_initcall+0x104/0x2bc
> >           kernel_init_freeable+0x344/0x34c
> >           kernel_init+0x30/0x134
> >           ret_from_fork+0x10/0x20
> >
> >    -> #2 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}:
> >           fs_reclaim_acquire+0x80/0xa8
> >           slab_pre_alloc_hook.constprop.0+0x40/0x25c
> >           __kmem_cache_alloc_node+0x60/0x1cc
> >           __kmalloc+0xd8/0x100
> >           topology_parse_cpu_capacity+0x8c/0x178
> >           get_cpu_for_node+0x88/0xc4
> >           parse_cluster+0x1b0/0x28c
> >           parse_cluster+0x8c/0x28c
> >           init_cpu_topology+0x168/0x188
> >           smp_prepare_cpus+0x24/0xf8
> >           kernel_init_freeable+0x18c/0x34c
> >           kernel_init+0x30/0x134
> >           ret_from_fork+0x10/0x20
> >
> >    -> #1 (fs_reclaim){+.+.}-{0:0}:
> >           __fs_reclaim_acquire+0x3c/0x48
> >           fs_reclaim_acquire+0x54/0xa8
> >           slab_pre_alloc_hook.constprop.0+0x40/0x25c
> >           __kmem_cache_alloc_node+0x60/0x1cc
> >           kmalloc_trace+0x50/0xa8
> >           dev_pm_qos_constraints_allocate+0x38/0x100
> >           __dev_pm_qos_add_request+0xb0/0x1e8
> >           dev_pm_qos_add_request+0x58/0x80
> >           dev_pm_qos_expose_latency_limit+0x60/0x13c
> >           register_cpu+0x12c/0x130
> >           topology_init+0xac/0xbc
> >           do_one_initcall+0x104/0x2bc
> >           kernel_init_freeable+0x344/0x34c
> >           kernel_init+0x30/0x134
> >           ret_from_fork+0x10/0x20
> >
> >    -> #0 (dev_pm_qos_mtx){+.+.}-{3:3}:
> >           __lock_acquire+0xe00/0x1060
> >           lock_acquire+0x1e0/0x2f8
> >           __mutex_lock+0xcc/0x3c8
> >           mutex_lock_nested+0x30/0x44
> >           dev_pm_qos_update_request+0x38/0x68
> >           msm_devfreq_boost+0x40/0x70
> >           msm_devfreq_active+0xc0/0xf0
> >           msm_gpu_submit+0x10c/0x178
> >           msm_job_run+0x78/0x150
> >           drm_sched_main+0x290/0x370
> >           kthread+0xf0/0x100
> >           ret_from_fork+0x10/0x20
> >
> >    other info that might help us debug this:
> >
> >    Chain exists of:
> >      dev_pm_qos_mtx --> dma_fence_map --> &gpu->active_lock
> >
> >     Possible unsafe locking scenario:
> >
> >           CPU0                    CPU1
> >           ----                    ----
> >      lock(&gpu->active_lock);
> >                                   lock(dma_fence_map);
> >                                   lock(&gpu->active_lock);
> >      lock(dev_pm_qos_mtx);
> >
> >     *** DEADLOCK ***
> >
> >    3 locks held by ring0/123:
> >     #0: ffffff8087251170 (&gpu->lock){+.+.}-{3:3}, at: msm_job_run+0x64/0x150
> >     #1: ffffffd00b0e57e8 (dma_fence_map){++++}-{0:0}, at: msm_job_run+0x68/0x150
> >     #2: ffffff8087251208 (&gpu->active_lock){+.+.}-{3:3}, at: msm_gpu_submit+0xec/0x178
> >
> >    stack backtrace:
> >    CPU: 6 PID: 123 Comm: ring0 Not tainted 6.2.0-rc8-debug+ #559
> >    Hardware name: Google Lazor (rev1 - 2) with LTE (DT)
> >    Call trace:
> >     dump_backtrace.part.0+0xb4/0xf8
> >     show_stack+0x20/0x38
> >     dump_stack_lvl+0x9c/0xd0
> >     dump_stack+0x18/0x34
> >     print_circular_bug+0x1b4/0x1f0
> >     check_noncircular+0x78/0xac
> >     __lock_acquire+0xe00/0x1060
> >     lock_acquire+0x1e0/0x2f8
> >     __mutex_lock+0xcc/0x3c8
> >     mutex_lock_nested+0x30/0x44
> >     dev_pm_qos_update_request+0x38/0x68
> >     msm_devfreq_boost+0x40/0x70
> >     msm_devfreq_active+0xc0/0xf0
> >     msm_gpu_submit+0x10c/0x178
> >     msm_job_run+0x78/0x150
> >     drm_sched_main+0x290/0x370
> >     kthread+0xf0/0x100
> >     ret_from_fork+0x10/0x20
> >
> > The issue is that dev_pm_qos_mtx is held in the runpm suspend/resume (or
> > freq change) path, but it is also held across allocations that could
> > recurse into shrinker.
> >
> > Solve this by changing dev_pm_qos_constraints_allocate() into a function
> > that can be called unconditionally before the device qos object is
> > needed and before aquiring dev_pm_qos_mtx.  This way the allocations can
> > be done without holding the mutex.  In the case that we raced with
> > another thread to allocate the qos object, detect this *after* acquiring
> > the dev_pm_qos_mtx and simply free the redundant allocations.
>
> Honestly, I don't like this approach.
>
> In particular, dropping a lock just in order to grab it again right
> away is really confusing (and I'm not even sure it is correct ATM).

This patch isn't actually doing that.  (And you are right, if it were,
that would be a thing to be suspicious of)

It is just moving the allocations ahead of the locking.

> Let me think about how to possibly address the issue at hand in a different way.

Per device locking would make this easier.  But I suppose that gets
into needing ww_mutex when you have things like device_link?

BR,
-R

> > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > ---
> >  drivers/base/power/qos.c | 60 +++++++++++++++++++++++++++-------------
> >  1 file changed, 41 insertions(+), 19 deletions(-)
> >
> > diff --git a/drivers/base/power/qos.c b/drivers/base/power/qos.c
> > index 8e93167f1783..f3e0c6b65635 100644
> > --- a/drivers/base/power/qos.c
> > +++ b/drivers/base/power/qos.c
> > @@ -185,18 +185,24 @@ static int apply_constraint(struct dev_pm_qos_request *req,
> >  }
> >
> >  /*
> > - * dev_pm_qos_constraints_allocate
> > + * dev_pm_qos_constraints_ensure_allocated
> >   * @dev: device to allocate data for
> >   *
> > - * Called at the first call to add_request, for constraint data allocation
> > - * Must be called with the dev_pm_qos_mtx mutex held
> > + * Called to ensure that devices qos is allocated, before acquiring
> > + * dev_pm_qos_mtx.
> >   */
> > -static int dev_pm_qos_constraints_allocate(struct device *dev)
> > +static int dev_pm_qos_constraints_ensure_allocated(struct device *dev)
> >  {
> >         struct dev_pm_qos *qos;
> >         struct pm_qos_constraints *c;
> >         struct blocking_notifier_head *n;
> >
> > +       if (!dev)
> > +               return -ENODEV;
> > +
> > +       if (!IS_ERR_OR_NULL(dev->power.qos))
> > +               return 0;
> > +
> >         qos = kzalloc(sizeof(*qos), GFP_KERNEL);
> >         if (!qos)
> >                 return -ENOMEM;
> > @@ -227,10 +233,26 @@ static int dev_pm_qos_constraints_allocate(struct device *dev)
> >
> >         INIT_LIST_HEAD(&qos->flags.list);
> >
> > +       mutex_lock(&dev_pm_qos_mtx);
> > +
> > +       if (!IS_ERR_OR_NULL(dev->power.qos)) {
> > +               /*
> > +                * We have raced with another task to create the qos.
> > +                * No biggie, just free the resources we've allocated
> > +                * outside of dev_pm_qos_mtx and move on with life.
> > +                */
> > +               kfree(n);
> > +               kfree(qos);
> > +               goto unlock;
> > +       }
> > +
> >         spin_lock_irq(&dev->power.lock);
> >         dev->power.qos = qos;
> >         spin_unlock_irq(&dev->power.lock);
> >
> > +unlock:
> > +       mutex_unlock(&dev_pm_qos_mtx);
> > +
> >         return 0;
> >  }
> >
> > @@ -331,17 +353,15 @@ static int __dev_pm_qos_add_request(struct device *dev,
> >  {
> >         int ret = 0;
> >
> > -       if (!dev || !req || dev_pm_qos_invalid_req_type(dev, type))
> > +       if (!req || dev_pm_qos_invalid_req_type(dev, type))
> >                 return -EINVAL;
> >
> >         if (WARN(dev_pm_qos_request_active(req),
> >                  "%s() called for already added request\n", __func__))
> >                 return -EINVAL;
> >
> > -       if (IS_ERR(dev->power.qos))
> > +       if (IS_ERR_OR_NULL(dev->power.qos))
> >                 ret = -ENODEV;
> > -       else if (!dev->power.qos)
> > -               ret = dev_pm_qos_constraints_allocate(dev);
> >
> >         trace_dev_pm_qos_add_request(dev_name(dev), type, value);
> >         if (ret)
> > @@ -390,6 +410,10 @@ int dev_pm_qos_add_request(struct device *dev, struct dev_pm_qos_request *req,
> >  {
> >         int ret;
> >
> > +       ret = dev_pm_qos_constraints_ensure_allocated(dev);
> > +       if (ret)
> > +               return ret;
> > +
> >         mutex_lock(&dev_pm_qos_mtx);
> >         ret = __dev_pm_qos_add_request(dev, req, type, value);
> >         mutex_unlock(&dev_pm_qos_mtx);
> > @@ -537,15 +561,11 @@ int dev_pm_qos_add_notifier(struct device *dev, struct notifier_block *notifier,
> >  {
> >         int ret = 0;
> >
> > -       mutex_lock(&dev_pm_qos_mtx);
> > -
> > -       if (IS_ERR(dev->power.qos))
> > -               ret = -ENODEV;
> > -       else if (!dev->power.qos)
> > -               ret = dev_pm_qos_constraints_allocate(dev);
> > -
> > +       ret = dev_pm_qos_constraints_ensure_allocated(dev);
> >         if (ret)
> > -               goto unlock;
> > +               return ret;
> > +
> > +       mutex_lock(&dev_pm_qos_mtx);
> >
> >         switch (type) {
> >         case DEV_PM_QOS_RESUME_LATENCY:
> > @@ -565,7 +585,6 @@ int dev_pm_qos_add_notifier(struct device *dev, struct notifier_block *notifier,
> >                 ret = -EINVAL;
> >         }
> >
> > -unlock:
> >         mutex_unlock(&dev_pm_qos_mtx);
> >         return ret;
> >  }
> > @@ -905,10 +924,13 @@ int dev_pm_qos_update_user_latency_tolerance(struct device *dev, s32 val)
> >  {
> >         int ret;
> >
> > +       ret = dev_pm_qos_constraints_ensure_allocated(dev);
> > +       if (ret)
> > +               return ret;
> > +
> >         mutex_lock(&dev_pm_qos_mtx);
> >
> > -       if (IS_ERR_OR_NULL(dev->power.qos)
> > -           || !dev->power.qos->latency_tolerance_req) {
> > +       if (!dev->power.qos->latency_tolerance_req) {
> >                 struct dev_pm_qos_request *req;
> >
> >                 if (val < 0) {
> > --
Bjorn Andersson April 7, 2023, 5:41 p.m. UTC | #4
On Mon, 20 Mar 2023 07:43:22 -0700, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> Inspired by https://lore.kernel.org/dri-devel/20200604081224.863494-10-daniel.vetter@ffwll.ch/
> it seemed like a good idea to get rid of memory allocation in job_run()
> fence signaling path, and use lockdep annotations to yell at us about
> anything that could deadlock against shrinker/reclaim.  Anything that
> can trigger reclaim, or block on any other thread that has triggered
> reclaim, can block the GPU shrinker from releasing memory if it is
> waiting the job to complete, causing deadlock.
> 
> [...]

Applied, thanks!

[20/23] soc: qcom: smd-rpm: Use GFP_ATOMIC in write path
        commit: 5808c532ca0a983d643319caca44f2bcb148298f

Best regards,