[v4,04/40] drm/sched: Add enqueue credit limit

Message ID	20250514170118.40555-5-robdclark@gmail.com
State	New
Headers	show Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1B64D1C84A4; Wed, 14 May 2025 17:03:23 +0000 (UTC) From: Rob Clark <robdclark@gmail.com> To: dri-devel@lists.freedesktop.org Cc: freedreno@lists.freedesktop.org, linux-arm-msm@vger.kernel.org, Connor Abbott <cwabbott0@gmail.com>, Rob Clark <robdclark@chromium.org>, Matthew Brost <matthew.brost@intel.com>, Danilo Krummrich <dakr@kernel.org>, Philipp Stanner <phasta@kernel.org>, =?utf-8?q?Christian_K=C3=B6nig?= <ckoenig.leichtzumerken@gmail.com>, Maarten Lankhorst <maarten.lankhorst@linux.intel.com>, Maxime Ripard <mripard@kernel.org>, Thomas Zimmermann <tzimmermann@suse.de>, David Airlie <airlied@gmail.com>, Simona Vetter <simona@ffwll.ch>, linux-kernel@vger.kernel.org (open list) Subject: [PATCH v4 04/40] drm/sched: Add enqueue credit limit Date: Wed, 14 May 2025 09:59:03 -0700 Message-ID: <20250514170118.40555-5-robdclark@gmail.com> In-Reply-To: <20250514170118.40555-1-robdclark@gmail.com> References: <20250514170118.40555-1-robdclark@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	drm/msm: sparse / "VM_BIND" support \| expand [v4,00/40] drm/msm: sparse / "VM_BIND" support [v4,01/40] drm/gpuvm: Don't require obj lock in destructor path [v4,02/40] drm/gpuvm: Allow VAs to hold soft reference to BOs [v4,03/40] drm/gem: Add ww_acquire_ctx support to drm_gem_lru_scan() [v4,04/40] drm/sched: Add enqueue credit limit [v4,05/40] iommu/io-pgtable-arm: Add quirk to quiet WARN_ON() [v4,06/40] drm/msm: Rename msm_file_private -> msm_context [v4,07/40] drm/msm: Improve msm_context comments [v4,08/40] drm/msm: Rename msm_gem_address_space -> msm_gem_vm [v4,09/40] drm/msm: Remove vram carveout support [v4,10/40] drm/msm: Collapse vma allocation and initialization [v4,11/40] drm/msm: Collapse vma close and delete

Rob Clark May 14, 2025, 4:59 p.m. UTC

From: Rob Clark <robdclark@chromium.org>

Similar to the existing credit limit mechanism, but applying to jobs
enqueued to the scheduler but not yet run.

The use case is to put an upper bound on preallocated, and potentially
unneeded, pgtable pages.  When this limit is exceeded, pushing new jobs
will block until the count drops below the limit.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 16 ++++++++++++++--
 drivers/gpu/drm/scheduler/sched_main.c   |  3 +++
 include/drm/gpu_scheduler.h              | 13 ++++++++++++-
 3 files changed, 29 insertions(+), 3 deletions(-)

Philipp Stanner May 15, 2025, 9:28 a.m. UTC | #1

Hello,

On Wed, 2025-05-14 at 09:59 -0700, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> Similar to the existing credit limit mechanism, but applying to jobs
> enqueued to the scheduler but not yet run.
> 
> The use case is to put an upper bound on preallocated, and
> potentially
> unneeded, pgtable pages.  When this limit is exceeded, pushing new
> jobs
> will block until the count drops below the limit.

the commit message doesn't make clear why that's needed within the
scheduler.

Connor Abbott May 15, 2025, 4:22 p.m. UTC | #2

On Thu, May 15, 2025 at 12:15 PM Rob Clark <robdclark@chromium.org> wrote:
>
> On Thu, May 15, 2025 at 2:28 AM Philipp Stanner <phasta@mailbox.org> wrote:
> >
> > Hello,
> >
> > On Wed, 2025-05-14 at 09:59 -0700, Rob Clark wrote:
> > > From: Rob Clark <robdclark@chromium.org>
> > >
> > > Similar to the existing credit limit mechanism, but applying to jobs
> > > enqueued to the scheduler but not yet run.
> > >
> > > The use case is to put an upper bound on preallocated, and
> > > potentially
> > > unneeded, pgtable pages.  When this limit is exceeded, pushing new
> > > jobs
> > > will block until the count drops below the limit.
> >
> > the commit message doesn't make clear why that's needed within the
> > scheduler.
> >
> > From what I understand from the cover letter, this is a (rare?) Vulkan
> > feature. And as important as Vulkan is, it's the drivers that implement
> > support for it. I don't see why the scheduler is a blocker.
>
> Maybe not rare, or at least it comes up with a group of deqp-vk tests ;-)
>
> Basically it is a way to throttle userspace to prevent it from OoM'ing
> itself.  (I suppose userspace could throttle itself, but it doesn't
> really know how much pre-allocation will need to be done for pgtable
> updates.)

For some context, other drivers have the concept of a "synchronous"
VM_BIND ioctl which completes immediately, and drivers implement it by
waiting for the whole thing to finish before returning. But this
doesn't work for native context, where everything has to be
asynchronous, so we're trying a new approach where we instead submit
an asynchronous bind for "normal" (non-sparse/driver internal)
allocations and only attach its out-fence to the in-fence of
subsequent submits to other queues. Once you do this then you need a
limit like this to prevent memory usage from pending page table
updates from getting out of control. Other drivers haven't needed this
yet, but they will when they get native context support.

Connor

>
> > All the knowledge about when to stop pushing into the entity is in the
> > driver, and the scheduler obtains all the knowledge about that from the
> > driver anyways.
> >
> > So you could do
> >
> > if (my_vulkan_condition())
> >    drm_sched_entity_push_job();
> >
> > couldn't you?
>
> It would need to reach in and use the sched's job_scheduled
> wait_queue_head_t...  if that isn't too ugly, maybe the rest could be
> implemented on top of sched.  But it seemed like a reasonable thing
> for the scheduler to support directly.
>
> > >
> > > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > > ---
> > >  drivers/gpu/drm/scheduler/sched_entity.c | 16 ++++++++++++++--
> > >  drivers/gpu/drm/scheduler/sched_main.c   |  3 +++
> > >  include/drm/gpu_scheduler.h              | 13 ++++++++++++-
> > >  3 files changed, 29 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > index dc0e60d2c14b..c5f688362a34 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > @@ -580,11 +580,21 @@ void drm_sched_entity_select_rq(struct
> > > drm_sched_entity *entity)
> > >   * under common lock for the struct drm_sched_entity that was set up
> > > for
> > >   * @sched_job in drm_sched_job_init().
> > >   */
> > > -void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> > > +int drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> >
> > Return code would need to be documented in the docstring, too. If we'd
> > go for that solution.
> >
> > >  {
> > >       struct drm_sched_entity *entity = sched_job->entity;
> > > +     struct drm_gpu_scheduler *sched = sched_job->sched;
> > >       bool first;
> > >       ktime_t submit_ts;
> > > +     int ret;
> > > +
> > > +     ret = wait_event_interruptible(
> > > +                     sched->job_scheduled,
> > > +                     atomic_read(&sched->enqueue_credit_count) <=
> > > +                     sched->enqueue_credit_limit);
> >
> > This very significantly changes the function's semantics. This function
> > is used in a great many drivers, and here it would be transformed into
> > a function that can block.
> >
> > From what I see below those credits are to be optional. But even if, it
> > needs to be clearly documented when a function can block.
>
> Sure.  The behavior changes only for drivers that use the
> enqueue_credit_limit, so other drivers should be unaffected.
>
> I can improve the docs.
>
> (Maybe push_credit or something else would be a better name than
> enqueue_credit?)
>
> >
> > > +     if (ret)
> > > +             return ret;
> > > +     atomic_add(sched_job->enqueue_credits, &sched-
> > > >enqueue_credit_count);
> > >
> > >       trace_drm_sched_job(sched_job, entity);
> > >       atomic_inc(entity->rq->sched->score);
> > > @@ -609,7 +619,7 @@ void drm_sched_entity_push_job(struct
> > > drm_sched_job *sched_job)
> > >                       spin_unlock(&entity->lock);
> > >
> > >                       DRM_ERROR("Trying to push to a killed
> > > entity\n");
> > > -                     return;
> > > +                     return -EINVAL;
> > >               }
> > >
> > >               rq = entity->rq;
> > > @@ -626,5 +636,7 @@ void drm_sched_entity_push_job(struct
> > > drm_sched_job *sched_job)
> > >
> > >               drm_sched_wakeup(sched);
> > >       }
> > > +
> > > +     return 0;
> > >  }
> > >  EXPORT_SYMBOL(drm_sched_entity_push_job);
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 9412bffa8c74..1102cca69cb4 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -1217,6 +1217,7 @@ static void drm_sched_run_job_work(struct
> > > work_struct *w)
> > >
> > >       trace_drm_run_job(sched_job, entity);
> > >       fence = sched->ops->run_job(sched_job);
> > > +     atomic_sub(sched_job->enqueue_credits, &sched-
> > > >enqueue_credit_count);
> > >       complete_all(&entity->entity_idle);
> > >       drm_sched_fence_scheduled(s_fence, fence);
> > >
> > > @@ -1253,6 +1254,7 @@ int drm_sched_init(struct drm_gpu_scheduler
> > > *sched, const struct drm_sched_init_
> > >
> > >       sched->ops = args->ops;
> > >       sched->credit_limit = args->credit_limit;
> > > +     sched->enqueue_credit_limit = args->enqueue_credit_limit;
> > >       sched->name = args->name;
> > >       sched->timeout = args->timeout;
> > >       sched->hang_limit = args->hang_limit;
> > > @@ -1308,6 +1310,7 @@ int drm_sched_init(struct drm_gpu_scheduler
> > > *sched, const struct drm_sched_init_
> > >       INIT_LIST_HEAD(&sched->pending_list);
> > >       spin_lock_init(&sched->job_list_lock);
> > >       atomic_set(&sched->credit_count, 0);
> > > +     atomic_set(&sched->enqueue_credit_count, 0);
> > >       INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > >       INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > >       INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> > > diff --git a/include/drm/gpu_scheduler.h
> > > b/include/drm/gpu_scheduler.h
> > > index da64232c989d..d830ffe083f1 100644
> > > --- a/include/drm/gpu_scheduler.h
> > > +++ b/include/drm/gpu_scheduler.h
> > > @@ -329,6 +329,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct
> > > dma_fence *f);
> > >   * @s_fence: contains the fences for the scheduling of job.
> > >   * @finish_cb: the callback for the finished fence.
> > >   * @credits: the number of credits this job contributes to the
> > > scheduler
> > > + * @enqueue_credits: the number of enqueue credits this job
> > > contributes
> > >   * @work: Helper to reschedule job kill to different context.
> > >   * @id: a unique id assigned to each job scheduled on the scheduler.
> > >   * @karma: increment on every hang caused by this job. If this
> > > exceeds the hang
> > > @@ -366,6 +367,7 @@ struct drm_sched_job {
> > >
> > >       enum drm_sched_priority         s_priority;
> > >       u32                             credits;
> > > +     u32                             enqueue_credits;
> >
> > What's the policy of setting this?
> >
> > drm_sched_job_init() and drm_sched_job_arm() are responsible for
> > initializing jobs.
>
> It should be set before drm_sched_entity_push_job().  I wouldn't
> really expect drivers to know the value at drm_sched_job_init() time.
> But they would by the time drm_sched_entity_push_job() is called.
>
> > >       /** @last_dependency: tracks @dependencies as they signal */
> > >       unsigned int                    last_dependency;
> > >       atomic_t                        karma;
> > > @@ -485,6 +487,10 @@ struct drm_sched_backend_ops {
> > >   * @ops: backend operations provided by the driver.
> > >   * @credit_limit: the credit limit of this scheduler
> > >   * @credit_count: the current credit count of this scheduler
> > > + * @enqueue_credit_limit: the credit limit of jobs pushed to
> > > scheduler and not
> > > + *                        yet run
> > > + * @enqueue_credit_count: the current crdit count of jobs pushed to
> > > scheduler
> > > + *                        but not yet run
> > >   * @timeout: the time after which a job is removed from the
> > > scheduler.
> > >   * @name: name of the ring for which this scheduler is being used.
> > >   * @num_rqs: Number of run-queues. This is at most
> > > DRM_SCHED_PRIORITY_COUNT,
> > > @@ -518,6 +524,8 @@ struct drm_gpu_scheduler {
> > >       const struct drm_sched_backend_ops      *ops;
> > >       u32                             credit_limit;
> > >       atomic_t                        credit_count;
> > > +     u32                             enqueue_credit_limit;
> > > +     atomic_t                        enqueue_credit_count;
> > >       long                            timeout;
> > >       const char                      *name;
> > >       u32                             num_rqs;
> > > @@ -550,6 +558,8 @@ struct drm_gpu_scheduler {
> > >   * @num_rqs: Number of run-queues. This may be at most
> > > DRM_SCHED_PRIORITY_COUNT,
> > >   *        as there's usually one run-queue per priority, but may
> > > be less.
> > >   * @credit_limit: the number of credits this scheduler can hold from
> > > all jobs
> > > + * @enqueue_credit_limit: the number of credits that can be enqueued
> > > before
> > > + *                        drm_sched_entity_push_job() blocks
> >
> > Is it optional or not? Can it be deactivated?
> >
> > It seems to me that it is optional, and so far only used in msm. If
> > there are no other parties in need for that mechanism, the right place
> > to have this feature probably is msm, which has all the knowledge about
> > when to block already.
> >
>
> As with the existing credit_limit, it is optional.  Although I think
> it would be also useful for other drivers that use drm sched for
> VM_BIND queues, for the same reason.
>
> BR,
> -R
>
> >
> > Regards
> > P.
> >
> >
> > >   * @hang_limit: number of times to allow a job to hang before
> > > dropping it.
> > >   *           This mechanism is DEPRECATED. Set it to 0.
> > >   * @timeout: timeout value in jiffies for submitted jobs.
> > > @@ -564,6 +574,7 @@ struct drm_sched_init_args {
> > >       struct workqueue_struct *timeout_wq;
> > >       u32 num_rqs;
> > >       u32 credit_limit;
> > > +     u32 enqueue_credit_limit;
> > >       unsigned int hang_limit;
> > >       long timeout;
> > >       atomic_t *score;
> > > @@ -600,7 +611,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
> > >                      struct drm_sched_entity *entity,
> > >                      u32 credits, void *owner);
> > >  void drm_sched_job_arm(struct drm_sched_job *job);
> > > -void drm_sched_entity_push_job(struct drm_sched_job *sched_job);
> > > +int drm_sched_entity_push_job(struct drm_sched_job *sched_job);
> > >  int drm_sched_job_add_dependency(struct drm_sched_job *job,
> > >                                struct dma_fence *fence);
> > >  int drm_sched_job_add_syncobj_dependency(struct drm_sched_job *job,
> >

Danilo Krummrich May 15, 2025, 5:29 p.m. UTC | #3

(Cc: Boris)

On Thu, May 15, 2025 at 12:22:18PM -0400, Connor Abbott wrote:
> For some context, other drivers have the concept of a "synchronous"
> VM_BIND ioctl which completes immediately, and drivers implement it by
> waiting for the whole thing to finish before returning.

Nouveau implements sync by issuing a normal async VM_BIND and subsequently
waits for the out-fence synchronously.

> But this
> doesn't work for native context, where everything has to be
> asynchronous, so we're trying a new approach where we instead submit
> an asynchronous bind for "normal" (non-sparse/driver internal)
> allocations and only attach its out-fence to the in-fence of
> subsequent submits to other queues.

This is what nouveau does and I think other drivers like Xe and panthor do this
as well.

> Once you do this then you need a
> limit like this to prevent memory usage from pending page table
> updates from getting out of control. Other drivers haven't needed this
> yet, but they will when they get native context support.

What are the cases where you did run into this, i.e. which application in
userspace hit this? Was it the CTS, some game, something else?

Rob Clark May 15, 2025, 5:36 p.m. UTC | #4

On Thu, May 15, 2025 at 10:23 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On Thu, May 15, 2025 at 09:15:08AM -0700, Rob Clark wrote:
> > Basically it is a way to throttle userspace to prevent it from OoM'ing
> > itself.  (I suppose userspace could throttle itself, but it doesn't
> > really know how much pre-allocation will need to be done for pgtable
> > updates.)
>
> I assume you mean prevent a single process from OOM'ing itself by queuing up
> VM_BIND requests much faster than they can be completed and hence
> pre-allocations for page tables get out of control?

Yes

Danilo Krummrich May 15, 2025, 6:56 p.m. UTC | #5

On Thu, May 15, 2025 at 10:40:15AM -0700, Rob Clark wrote:
> On Thu, May 15, 2025 at 10:30 AM Danilo Krummrich <dakr@kernel.org> wrote:
> >
> > (Cc: Boris)
> >
> > On Thu, May 15, 2025 at 12:22:18PM -0400, Connor Abbott wrote:
> > > For some context, other drivers have the concept of a "synchronous"
> > > VM_BIND ioctl which completes immediately, and drivers implement it by
> > > waiting for the whole thing to finish before returning.
> >
> > Nouveau implements sync by issuing a normal async VM_BIND and subsequently
> > waits for the out-fence synchronously.
> 
> As Connor mentioned, we'd prefer it to be async rather than blocking,
> in normal cases, otherwise with drm native context for using native
> UMD in guest VM, you'd be blocking the single host/VMM virglrender
> thread.
> 
> The key is we want to keep it async in the normal cases, and not have
> weird edge case CTS tests blow up from being _too_ async ;-)

I really wonder why they don't blow up in Nouveau, which also support full
asynchronous VM_BIND. Mind sharing which tests blow up? :)

> > > But this
> > > doesn't work for native context, where everything has to be
> > > asynchronous, so we're trying a new approach where we instead submit
> > > an asynchronous bind for "normal" (non-sparse/driver internal)
> > > allocations and only attach its out-fence to the in-fence of
> > > subsequent submits to other queues.
> >
> > This is what nouveau does and I think other drivers like Xe and panthor do this
> > as well.
> 
> No one has added native context support for these drivers yet

Huh? What exactly do you mean with "native context" then?

> > > Once you do this then you need a
> > > limit like this to prevent memory usage from pending page table
> > > updates from getting out of control. Other drivers haven't needed this
> > > yet, but they will when they get native context support.
> >
> > What are the cases where you did run into this, i.e. which application in
> > userspace hit this? Was it the CTS, some game, something else?
> 
> CTS tests that do weird things with massive # of small bind/unbind.  I
> wouldn't expect to hit the blocking case in the real world.

As mentioned above, can you please share them? I'd like to play around a bit. :)

- Danilo

[v4,04/40] drm/sched: Add enqueue credit limit

Commit Message

Comments

Patch