Message ID | 20210728010632.2633470-13-robdclark@gmail.com |
---|---|
State | New |
Headers | show |
Series | drm/msm: drm scheduler conversion and cleanups | expand |
Hi Rob, On 28/07/2021 02:06, Rob Clark wrote: > From: Rob Clark <robdclark@chromium.org> > > The drm/scheduler provides additional prioritization on top of that > provided by however many number of ringbuffers (each with their own > priority level) is supported on a given generation. Expose the > additional levels of priority to userspace and map the userspace > priority back to ring (first level of priority) and schedular priority > (additional priority levels within the ring). > > Signed-off-by: Rob Clark <robdclark@chromium.org> > Acked-by: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +- > drivers/gpu/drm/msm/msm_gem_submit.c | 4 +- > drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++- > drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++-------- > include/uapi/drm/msm_drm.h | 14 +++++- > 5 files changed, 88 insertions(+), 27 deletions(-) > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > index bad4809b68ef..748665232d29 100644 > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value) > return ret; > } > return -EINVAL; > - case MSM_PARAM_NR_RINGS: > - *value = gpu->nr_rings; > + case MSM_PARAM_PRIORITIES: > + *value = gpu->nr_rings * NR_SCHED_PRIORITIES; > return 0; > case MSM_PARAM_PP_PGTABLE: > *value = 0; > diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c > index 450efe59abb5..c2ecec5b11c4 100644 > --- a/drivers/gpu/drm/msm/msm_gem_submit.c > +++ b/drivers/gpu/drm/msm/msm_gem_submit.c > @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev, > submit->gpu = gpu; > submit->cmd = (void *)&submit->bos[nr_bos]; > submit->queue = queue; > - submit->ring = gpu->rb[queue->prio]; > + submit->ring = gpu->rb[queue->ring_nr]; > submit->fault_dumped = false; > > INIT_LIST_HEAD(&submit->node); > @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data, > /* Get a unique identifier for the submission for logging purposes */ > submitid = atomic_inc_return(&ident) - 1; > > - ring = gpu->rb[queue->prio]; > + ring = gpu->rb[queue->ring_nr]; > trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid, > args->nr_bos, args->nr_cmds); > > diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h > index b912cacaecc0..0e4b45bff2e6 100644 > --- a/drivers/gpu/drm/msm/msm_gpu.h > +++ b/drivers/gpu/drm/msm/msm_gpu.h > @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr { > const char *name; > }; > > +/* > + * The number of priority levels provided by drm gpu scheduler. The > + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some > + * cases, so we don't use it (no need for kernel generated jobs). > + */ > +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN) > + > +/** > + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority > + * > + * @gpu: the gpu instance > + * @prio: the userspace priority level > + * @ring_nr: [out] the ringbuffer the userspace priority maps to > + * @sched_prio: [out] the gpu scheduler priority level which the userspace > + * priority maps to > + * > + * With drm/scheduler providing it's own level of prioritization, our total > + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES). > + * Each ring is associated with it's own scheduler instance. However, our > + * UABI is that lower numerical values are higher priority. So mapping the > + * single userspace priority level into ring_nr and sched_prio takes some > + * care. The userspace provided priority (when a submitqueue is created) > + * is mapped to ring nr and scheduler priority as such: > + * > + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES > + * sched_prio = NR_SCHED_PRIORITIES - > + * (userspace_prio % NR_SCHED_PRIORITIES) - 1 > + * > + * This allows generations without preemption (nr_rings==1) to have some > + * amount of prioritization, and provides more priority levels for gens > + * that do have preemption. I am exploring how different drivers handle priority levels and this caught my eye. Is the implication of the last paragraphs that on hw with nr_rings > 1, ring + 1 preempts ring? If so I am wondering does the "spreading" of user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable levels within every "bucket" or how does that work? Regards, Tvrtko > + */ > +static inline int msm_gpu_convert_priority(struct msm_gpu *gpu, int prio, > + unsigned *ring_nr, enum drm_sched_priority *sched_prio) > +{ > + unsigned rn, sp; > + > + rn = div_u64_rem(prio, NR_SCHED_PRIORITIES, &sp); > + > + /* invert sched priority to map to higher-numeric-is-higher- > + * priority convention > + */ > + sp = NR_SCHED_PRIORITIES - sp - 1; > + > + if (rn >= gpu->nr_rings) > + return -EINVAL; > + > + *ring_nr = rn; > + *sched_prio = sp; > + > + return 0; > +} > + > /** > * A submitqueue is associated with a gl context or vk queue (or equiv) > * in userspace. > @@ -257,7 +310,8 @@ struct msm_gpu_perfcntr { > * @id: userspace id for the submitqueue, unique within the drm_file > * @flags: userspace flags for the submitqueue, specified at creation > * (currently unusued) > - * @prio: the submitqueue priority > + * @ring_nr: the ringbuffer used by this submitqueue, which is determined > + * by the submitqueue's priority > * @faults: the number of GPU hangs associated with this submitqueue > * @ctx: the per-drm_file context associated with the submitqueue (ie. > * which set of pgtables do submits jobs associated with the > @@ -272,7 +326,7 @@ struct msm_gpu_perfcntr { > struct msm_gpu_submitqueue { > int id; > u32 flags; > - u32 prio; > + u32 ring_nr; > int faults; > struct msm_file_private *ctx; > struct list_head node; > diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c > index 682ba2a7c0ec..32a55d81b58b 100644 > --- a/drivers/gpu/drm/msm/msm_submitqueue.c > +++ b/drivers/gpu/drm/msm/msm_submitqueue.c > @@ -68,6 +68,8 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > struct msm_gpu_submitqueue *queue; > struct msm_ringbuffer *ring; > struct drm_gpu_scheduler *sched; > + enum drm_sched_priority sched_prio; > + unsigned ring_nr; > int ret; > > if (!ctx) > @@ -76,8 +78,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > if (!priv->gpu) > return -ENODEV; > > - if (prio >= priv->gpu->nr_rings) > - return -EINVAL; > + ret = msm_gpu_convert_priority(priv->gpu, prio, &ring_nr, &sched_prio); > + if (ret) > + return ret; > > queue = kzalloc(sizeof(*queue), GFP_KERNEL); > > @@ -86,24 +89,13 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > > kref_init(&queue->ref); > queue->flags = flags; > - queue->prio = prio; > + queue->ring_nr = ring_nr; > > - ring = priv->gpu->rb[prio]; > + ring = priv->gpu->rb[ring_nr]; > sched = &ring->sched; > > - /* > - * TODO we can allow more priorities than we have ringbuffers by > - * mapping: > - * > - * ring = prio / 3; > - * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3); > - * > - * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is > - * treated specially in places. > - */ > ret = drm_sched_entity_init(&queue->entity, > - DRM_SCHED_PRIORITY_NORMAL, > - &sched, 1, NULL); > + sched_prio, &sched, 1, NULL); > if (ret) { > kfree(queue); > return ret; > @@ -134,16 +126,19 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx) > { > struct msm_drm_private *priv = drm->dev_private; > - int default_prio; > + int default_prio, max_priority; > > if (!priv->gpu) > return -ENODEV; > > + max_priority = (priv->gpu->nr_rings * NR_SCHED_PRIORITIES) - 1; > + > /* > - * Select priority 2 as the "default priority" unless nr_rings is less > - * than 2 and then pick the lowest priority > + * Pick a medium priority level as default. Lower numeric value is > + * higher priority, so round-up to pick a priority that is not higher > + * than the middle priority level. > */ > - default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1); > + default_prio = DIV_ROUND_UP(max_priority, 2); > > INIT_LIST_HEAD(&ctx->submitqueues); > > diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h > index f075851021c3..6b8fffc28a50 100644 > --- a/include/uapi/drm/msm_drm.h > +++ b/include/uapi/drm/msm_drm.h > @@ -73,11 +73,19 @@ struct drm_msm_timespec { > #define MSM_PARAM_MAX_FREQ 0x04 > #define MSM_PARAM_TIMESTAMP 0x05 > #define MSM_PARAM_GMEM_BASE 0x06 > -#define MSM_PARAM_NR_RINGS 0x07 > +#define MSM_PARAM_PRIORITIES 0x07 /* The # of priority levels */ > #define MSM_PARAM_PP_PGTABLE 0x08 /* => 1 for per-process pagetables, else 0 */ > #define MSM_PARAM_FAULTS 0x09 > #define MSM_PARAM_SUSPENDS 0x0a > > +/* For backwards compat. The original support for preemption was based on > + * a single ring per priority level so # of priority levels equals the # > + * of rings. With drm/scheduler providing additional levels of priority, > + * the number of priorities is greater than the # of rings. The param is > + * renamed to better reflect this. > + */ > +#define MSM_PARAM_NR_RINGS MSM_PARAM_PRIORITIES > + > struct drm_msm_param { > __u32 pipe; /* in, MSM_PIPE_x */ > __u32 param; /* in, MSM_PARAM_x */ > @@ -304,6 +312,10 @@ struct drm_msm_gem_madvise { > > #define MSM_SUBMITQUEUE_FLAGS (0) > > +/* > + * The submitqueue priority should be between 0 and MSM_PARAM_PRIORITIES-1, > + * a lower numeric value is higher priority. > + */ > struct drm_msm_submitqueue { > __u32 flags; /* in, MSM_SUBMITQUEUE_x */ > __u32 prio; /* in, Priority level */
On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote: > > > Hi Rob, > > On 28/07/2021 02:06, Rob Clark wrote: > > From: Rob Clark <robdclark@chromium.org> > > > > The drm/scheduler provides additional prioritization on top of that > > provided by however many number of ringbuffers (each with their own > > priority level) is supported on a given generation. Expose the > > additional levels of priority to userspace and map the userspace > > priority back to ring (first level of priority) and schedular priority > > (additional priority levels within the ring). > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > Acked-by: Christian König <christian.koenig@amd.com> > > --- > > drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +- > > drivers/gpu/drm/msm/msm_gem_submit.c | 4 +- > > drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++- > > drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++-------- > > include/uapi/drm/msm_drm.h | 14 +++++- > > 5 files changed, 88 insertions(+), 27 deletions(-) > > > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > index bad4809b68ef..748665232d29 100644 > > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > > @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value) > > return ret; > > } > > return -EINVAL; > > - case MSM_PARAM_NR_RINGS: > > - *value = gpu->nr_rings; > > + case MSM_PARAM_PRIORITIES: > > + *value = gpu->nr_rings * NR_SCHED_PRIORITIES; > > return 0; > > case MSM_PARAM_PP_PGTABLE: > > *value = 0; > > diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c > > index 450efe59abb5..c2ecec5b11c4 100644 > > --- a/drivers/gpu/drm/msm/msm_gem_submit.c > > +++ b/drivers/gpu/drm/msm/msm_gem_submit.c > > @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev, > > submit->gpu = gpu; > > submit->cmd = (void *)&submit->bos[nr_bos]; > > submit->queue = queue; > > - submit->ring = gpu->rb[queue->prio]; > > + submit->ring = gpu->rb[queue->ring_nr]; > > submit->fault_dumped = false; > > > > INIT_LIST_HEAD(&submit->node); > > @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data, > > /* Get a unique identifier for the submission for logging purposes */ > > submitid = atomic_inc_return(&ident) - 1; > > > > - ring = gpu->rb[queue->prio]; > > + ring = gpu->rb[queue->ring_nr]; > > trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid, > > args->nr_bos, args->nr_cmds); > > > > diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h > > index b912cacaecc0..0e4b45bff2e6 100644 > > --- a/drivers/gpu/drm/msm/msm_gpu.h > > +++ b/drivers/gpu/drm/msm/msm_gpu.h > > @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr { > > const char *name; > > }; > > > > +/* > > + * The number of priority levels provided by drm gpu scheduler. The > > + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some > > + * cases, so we don't use it (no need for kernel generated jobs). > > + */ > > +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN) > > + > > +/** > > + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority > > + * > > + * @gpu: the gpu instance > > + * @prio: the userspace priority level > > + * @ring_nr: [out] the ringbuffer the userspace priority maps to > > + * @sched_prio: [out] the gpu scheduler priority level which the userspace > > + * priority maps to > > + * > > + * With drm/scheduler providing it's own level of prioritization, our total > > + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES). > > + * Each ring is associated with it's own scheduler instance. However, our > > + * UABI is that lower numerical values are higher priority. So mapping the > > + * single userspace priority level into ring_nr and sched_prio takes some > > + * care. The userspace provided priority (when a submitqueue is created) > > + * is mapped to ring nr and scheduler priority as such: > > + * > > + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES > > + * sched_prio = NR_SCHED_PRIORITIES - > > + * (userspace_prio % NR_SCHED_PRIORITIES) - 1 > > + * > > + * This allows generations without preemption (nr_rings==1) to have some > > + * amount of prioritization, and provides more priority levels for gens > > + * that do have preemption. > > I am exploring how different drivers handle priority levels and this > caught my eye. > > Is the implication of the last paragraphs that on hw with nr_rings > 1, > ring + 1 preempts ring? Other way around, at least from the uabi standpoint. Ie. ring[0] preempts ring[1] > If so I am wondering does the "spreading" of > user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable > levels within every "bucket" or how does that work? So, preemption is possible between any priority level before run_job() gets called, which writes the job into the ringbuffer. After that point, you only have "bucket" level preemption, because NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO ringbuffer. ----- btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm trying to add an igt test to stress shrinker/eviction, similar to the existing tests/i915/gem_shrink.c. But we hit an unfortunate combination of circumstances: 1. Pinning memory happens in the synchronous part of the submit ioctl, before enqueuing the job for the kthread to handle. 2. The first run_job() callback incurs a slight delay (~1.5ms) while resuming the GPU 3. Because of that delay, userspace has a chance to queue up enough more jobs to require locking/pinning more than the available system RAM.. I'm not sure if we want a way to prevent userspace from getting *too* far ahead of the kthread. Or maybe at some point the shrinker should sleep on non-idle buffers? BR, -R > > Regards, > > Tvrtko > > > + */ > > +static inline int msm_gpu_convert_priority(struct msm_gpu *gpu, int prio, > > + unsigned *ring_nr, enum drm_sched_priority *sched_prio) > > +{ > > + unsigned rn, sp; > > + > > + rn = div_u64_rem(prio, NR_SCHED_PRIORITIES, &sp); > > + > > + /* invert sched priority to map to higher-numeric-is-higher- > > + * priority convention > > + */ > > + sp = NR_SCHED_PRIORITIES - sp - 1; > > + > > + if (rn >= gpu->nr_rings) > > + return -EINVAL; > > + > > + *ring_nr = rn; > > + *sched_prio = sp; > > + > > + return 0; > > +} > > + > > /** > > * A submitqueue is associated with a gl context or vk queue (or equiv) > > * in userspace. > > @@ -257,7 +310,8 @@ struct msm_gpu_perfcntr { > > * @id: userspace id for the submitqueue, unique within the drm_file > > * @flags: userspace flags for the submitqueue, specified at creation > > * (currently unusued) > > - * @prio: the submitqueue priority > > + * @ring_nr: the ringbuffer used by this submitqueue, which is determined > > + * by the submitqueue's priority > > * @faults: the number of GPU hangs associated with this submitqueue > > * @ctx: the per-drm_file context associated with the submitqueue (ie. > > * which set of pgtables do submits jobs associated with the > > @@ -272,7 +326,7 @@ struct msm_gpu_perfcntr { > > struct msm_gpu_submitqueue { > > int id; > > u32 flags; > > - u32 prio; > > + u32 ring_nr; > > int faults; > > struct msm_file_private *ctx; > > struct list_head node; > > diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c > > index 682ba2a7c0ec..32a55d81b58b 100644 > > --- a/drivers/gpu/drm/msm/msm_submitqueue.c > > +++ b/drivers/gpu/drm/msm/msm_submitqueue.c > > @@ -68,6 +68,8 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > > struct msm_gpu_submitqueue *queue; > > struct msm_ringbuffer *ring; > > struct drm_gpu_scheduler *sched; > > + enum drm_sched_priority sched_prio; > > + unsigned ring_nr; > > int ret; > > > > if (!ctx) > > @@ -76,8 +78,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > > if (!priv->gpu) > > return -ENODEV; > > > > - if (prio >= priv->gpu->nr_rings) > > - return -EINVAL; > > + ret = msm_gpu_convert_priority(priv->gpu, prio, &ring_nr, &sched_prio); > > + if (ret) > > + return ret; > > > > queue = kzalloc(sizeof(*queue), GFP_KERNEL); > > > > @@ -86,24 +89,13 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > > > > kref_init(&queue->ref); > > queue->flags = flags; > > - queue->prio = prio; > > + queue->ring_nr = ring_nr; > > > > - ring = priv->gpu->rb[prio]; > > + ring = priv->gpu->rb[ring_nr]; > > sched = &ring->sched; > > > > - /* > > - * TODO we can allow more priorities than we have ringbuffers by > > - * mapping: > > - * > > - * ring = prio / 3; > > - * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3); > > - * > > - * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is > > - * treated specially in places. > > - */ > > ret = drm_sched_entity_init(&queue->entity, > > - DRM_SCHED_PRIORITY_NORMAL, > > - &sched, 1, NULL); > > + sched_prio, &sched, 1, NULL); > > if (ret) { > > kfree(queue); > > return ret; > > @@ -134,16 +126,19 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > > int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx) > > { > > struct msm_drm_private *priv = drm->dev_private; > > - int default_prio; > > + int default_prio, max_priority; > > > > if (!priv->gpu) > > return -ENODEV; > > > > + max_priority = (priv->gpu->nr_rings * NR_SCHED_PRIORITIES) - 1; > > + > > /* > > - * Select priority 2 as the "default priority" unless nr_rings is less > > - * than 2 and then pick the lowest priority > > + * Pick a medium priority level as default. Lower numeric value is > > + * higher priority, so round-up to pick a priority that is not higher > > + * than the middle priority level. > > */ > > - default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1); > > + default_prio = DIV_ROUND_UP(max_priority, 2); > > > > INIT_LIST_HEAD(&ctx->submitqueues); > > > > diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h > > index f075851021c3..6b8fffc28a50 100644 > > --- a/include/uapi/drm/msm_drm.h > > +++ b/include/uapi/drm/msm_drm.h > > @@ -73,11 +73,19 @@ struct drm_msm_timespec { > > #define MSM_PARAM_MAX_FREQ 0x04 > > #define MSM_PARAM_TIMESTAMP 0x05 > > #define MSM_PARAM_GMEM_BASE 0x06 > > -#define MSM_PARAM_NR_RINGS 0x07 > > +#define MSM_PARAM_PRIORITIES 0x07 /* The # of priority levels */ > > #define MSM_PARAM_PP_PGTABLE 0x08 /* => 1 for per-process pagetables, else 0 */ > > #define MSM_PARAM_FAULTS 0x09 > > #define MSM_PARAM_SUSPENDS 0x0a > > > > +/* For backwards compat. The original support for preemption was based on > > + * a single ring per priority level so # of priority levels equals the # > > + * of rings. With drm/scheduler providing additional levels of priority, > > + * the number of priorities is greater than the # of rings. The param is > > + * renamed to better reflect this. > > + */ > > +#define MSM_PARAM_NR_RINGS MSM_PARAM_PRIORITIES > > + > > struct drm_msm_param { > > __u32 pipe; /* in, MSM_PIPE_x */ > > __u32 param; /* in, MSM_PARAM_x */ > > @@ -304,6 +312,10 @@ struct drm_msm_gem_madvise { > > > > #define MSM_SUBMITQUEUE_FLAGS (0) > > > > +/* > > + * The submitqueue priority should be between 0 and MSM_PARAM_PRIORITIES-1, > > + * a lower numeric value is higher priority. > > + */ > > struct drm_msm_submitqueue { > > __u32 flags; /* in, MSM_SUBMITQUEUE_x */ > > __u32 prio; /* in, Priority level */
On 23/05/2022 23:53, Rob Clark wrote: > On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin > <tvrtko.ursulin@linux.intel.com> wrote: >> >> >> Hi Rob, >> >> On 28/07/2021 02:06, Rob Clark wrote: >>> From: Rob Clark <robdclark@chromium.org> >>> >>> The drm/scheduler provides additional prioritization on top of that >>> provided by however many number of ringbuffers (each with their own >>> priority level) is supported on a given generation. Expose the >>> additional levels of priority to userspace and map the userspace >>> priority back to ring (first level of priority) and schedular priority >>> (additional priority levels within the ring). >>> >>> Signed-off-by: Rob Clark <robdclark@chromium.org> >>> Acked-by: Christian König <christian.koenig@amd.com> >>> --- >>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +- >>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +- >>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++- >>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++-------- >>> include/uapi/drm/msm_drm.h | 14 +++++- >>> 5 files changed, 88 insertions(+), 27 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c >>> index bad4809b68ef..748665232d29 100644 >>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c >>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c >>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value) >>> return ret; >>> } >>> return -EINVAL; >>> - case MSM_PARAM_NR_RINGS: >>> - *value = gpu->nr_rings; >>> + case MSM_PARAM_PRIORITIES: >>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES; >>> return 0; >>> case MSM_PARAM_PP_PGTABLE: >>> *value = 0; >>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c >>> index 450efe59abb5..c2ecec5b11c4 100644 >>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c >>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c >>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev, >>> submit->gpu = gpu; >>> submit->cmd = (void *)&submit->bos[nr_bos]; >>> submit->queue = queue; >>> - submit->ring = gpu->rb[queue->prio]; >>> + submit->ring = gpu->rb[queue->ring_nr]; >>> submit->fault_dumped = false; >>> >>> INIT_LIST_HEAD(&submit->node); >>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data, >>> /* Get a unique identifier for the submission for logging purposes */ >>> submitid = atomic_inc_return(&ident) - 1; >>> >>> - ring = gpu->rb[queue->prio]; >>> + ring = gpu->rb[queue->ring_nr]; >>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid, >>> args->nr_bos, args->nr_cmds); >>> >>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h >>> index b912cacaecc0..0e4b45bff2e6 100644 >>> --- a/drivers/gpu/drm/msm/msm_gpu.h >>> +++ b/drivers/gpu/drm/msm/msm_gpu.h >>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr { >>> const char *name; >>> }; >>> >>> +/* >>> + * The number of priority levels provided by drm gpu scheduler. The >>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some >>> + * cases, so we don't use it (no need for kernel generated jobs). >>> + */ >>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN) >>> + >>> +/** >>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority >>> + * >>> + * @gpu: the gpu instance >>> + * @prio: the userspace priority level >>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to >>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace >>> + * priority maps to >>> + * >>> + * With drm/scheduler providing it's own level of prioritization, our total >>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES). >>> + * Each ring is associated with it's own scheduler instance. However, our >>> + * UABI is that lower numerical values are higher priority. So mapping the >>> + * single userspace priority level into ring_nr and sched_prio takes some >>> + * care. The userspace provided priority (when a submitqueue is created) >>> + * is mapped to ring nr and scheduler priority as such: >>> + * >>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES >>> + * sched_prio = NR_SCHED_PRIORITIES - >>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1 >>> + * >>> + * This allows generations without preemption (nr_rings==1) to have some >>> + * amount of prioritization, and provides more priority levels for gens >>> + * that do have preemption. >> >> I am exploring how different drivers handle priority levels and this >> caught my eye. >> >> Is the implication of the last paragraphs that on hw with nr_rings > 1, >> ring + 1 preempts ring? > > Other way around, at least from the uabi standpoint. Ie. ring[0] > preempts ring[1] Ah yes, I figure it out from the comments but then confused myself when writing the email. >> If so I am wondering does the "spreading" of >> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable >> levels within every "bucket" or how does that work? > > So, preemption is possible between any priority level before run_job() > gets called, which writes the job into the ringbuffer. After that Hmm how? Before run_job() the jobs are not runnable, sitting in the scheduler queues, right? > point, you only have "bucket" level preemption, because > NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO > ringbuffer. Right, and you have one GPU with four rings, which means you expose 12 priority levels to userspace, did I get that right? If so how do you convey in the ABI that not all there priority levels are equal? Like userspace can submit at prio 4 and expect prio 3 to preempt, as would prio 2 preempt prio 3. While actual behaviour will not match - 3 will not preempt 4. Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a quick peek in Mesa but did not spot it - although I am not really at home there yet so maybe I missed it. > ----- > > btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm > trying to add an igt test to stress shrinker/eviction, similar to the > existing tests/i915/gem_shrink.c. But we hit an unfortunate > combination of circumstances: > 1. Pinning memory happens in the synchronous part of the submit ioctl, > before enqueuing the job for the kthread to handle. > 2. The first run_job() callback incurs a slight delay (~1.5ms) while > resuming the GPU > 3. Because of that delay, userspace has a chance to queue up enough > more jobs to require locking/pinning more than the available system > RAM.. Is that one or multiple threads submitting jobs? > I'm not sure if we want a way to prevent userspace from getting *too* > far ahead of the kthread. Or maybe at some point the shrinker should > sleep on non-idle buffers? On the direct reclaim path when invoked from the submit ioctl? In i915 we only shrink idle objects on direct reclaim and leave active ones for the swapper. It depends on how your locking looks like whether you could do them, whether there would be coupling of locks and fs-reclaim context. Regards, Tvrtko > BR, > -R > >> >> Regards, >> >> Tvrtko >> >>> + */ >>> +static inline int msm_gpu_convert_priority(struct msm_gpu *gpu, int prio, >>> + unsigned *ring_nr, enum drm_sched_priority *sched_prio) >>> +{ >>> + unsigned rn, sp; >>> + >>> + rn = div_u64_rem(prio, NR_SCHED_PRIORITIES, &sp); >>> + >>> + /* invert sched priority to map to higher-numeric-is-higher- >>> + * priority convention >>> + */ >>> + sp = NR_SCHED_PRIORITIES - sp - 1; >>> + >>> + if (rn >= gpu->nr_rings) >>> + return -EINVAL; >>> + >>> + *ring_nr = rn; >>> + *sched_prio = sp; >>> + >>> + return 0; >>> +} >>> + >>> /** >>> * A submitqueue is associated with a gl context or vk queue (or equiv) >>> * in userspace. >>> @@ -257,7 +310,8 @@ struct msm_gpu_perfcntr { >>> * @id: userspace id for the submitqueue, unique within the drm_file >>> * @flags: userspace flags for the submitqueue, specified at creation >>> * (currently unusued) >>> - * @prio: the submitqueue priority >>> + * @ring_nr: the ringbuffer used by this submitqueue, which is determined >>> + * by the submitqueue's priority >>> * @faults: the number of GPU hangs associated with this submitqueue >>> * @ctx: the per-drm_file context associated with the submitqueue (ie. >>> * which set of pgtables do submits jobs associated with the >>> @@ -272,7 +326,7 @@ struct msm_gpu_perfcntr { >>> struct msm_gpu_submitqueue { >>> int id; >>> u32 flags; >>> - u32 prio; >>> + u32 ring_nr; >>> int faults; >>> struct msm_file_private *ctx; >>> struct list_head node; >>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c >>> index 682ba2a7c0ec..32a55d81b58b 100644 >>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c >>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c >>> @@ -68,6 +68,8 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, >>> struct msm_gpu_submitqueue *queue; >>> struct msm_ringbuffer *ring; >>> struct drm_gpu_scheduler *sched; >>> + enum drm_sched_priority sched_prio; >>> + unsigned ring_nr; >>> int ret; >>> >>> if (!ctx) >>> @@ -76,8 +78,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, >>> if (!priv->gpu) >>> return -ENODEV; >>> >>> - if (prio >= priv->gpu->nr_rings) >>> - return -EINVAL; >>> + ret = msm_gpu_convert_priority(priv->gpu, prio, &ring_nr, &sched_prio); >>> + if (ret) >>> + return ret; >>> >>> queue = kzalloc(sizeof(*queue), GFP_KERNEL); >>> >>> @@ -86,24 +89,13 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, >>> >>> kref_init(&queue->ref); >>> queue->flags = flags; >>> - queue->prio = prio; >>> + queue->ring_nr = ring_nr; >>> >>> - ring = priv->gpu->rb[prio]; >>> + ring = priv->gpu->rb[ring_nr]; >>> sched = &ring->sched; >>> >>> - /* >>> - * TODO we can allow more priorities than we have ringbuffers by >>> - * mapping: >>> - * >>> - * ring = prio / 3; >>> - * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3); >>> - * >>> - * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is >>> - * treated specially in places. >>> - */ >>> ret = drm_sched_entity_init(&queue->entity, >>> - DRM_SCHED_PRIORITY_NORMAL, >>> - &sched, 1, NULL); >>> + sched_prio, &sched, 1, NULL); >>> if (ret) { >>> kfree(queue); >>> return ret; >>> @@ -134,16 +126,19 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, >>> int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx) >>> { >>> struct msm_drm_private *priv = drm->dev_private; >>> - int default_prio; >>> + int default_prio, max_priority; >>> >>> if (!priv->gpu) >>> return -ENODEV; >>> >>> + max_priority = (priv->gpu->nr_rings * NR_SCHED_PRIORITIES) - 1; >>> + >>> /* >>> - * Select priority 2 as the "default priority" unless nr_rings is less >>> - * than 2 and then pick the lowest priority >>> + * Pick a medium priority level as default. Lower numeric value is >>> + * higher priority, so round-up to pick a priority that is not higher >>> + * than the middle priority level. >>> */ >>> - default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1); >>> + default_prio = DIV_ROUND_UP(max_priority, 2); >>> >>> INIT_LIST_HEAD(&ctx->submitqueues); >>> >>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h >>> index f075851021c3..6b8fffc28a50 100644 >>> --- a/include/uapi/drm/msm_drm.h >>> +++ b/include/uapi/drm/msm_drm.h >>> @@ -73,11 +73,19 @@ struct drm_msm_timespec { >>> #define MSM_PARAM_MAX_FREQ 0x04 >>> #define MSM_PARAM_TIMESTAMP 0x05 >>> #define MSM_PARAM_GMEM_BASE 0x06 >>> -#define MSM_PARAM_NR_RINGS 0x07 >>> +#define MSM_PARAM_PRIORITIES 0x07 /* The # of priority levels */ >>> #define MSM_PARAM_PP_PGTABLE 0x08 /* => 1 for per-process pagetables, else 0 */ >>> #define MSM_PARAM_FAULTS 0x09 >>> #define MSM_PARAM_SUSPENDS 0x0a >>> >>> +/* For backwards compat. The original support for preemption was based on >>> + * a single ring per priority level so # of priority levels equals the # >>> + * of rings. With drm/scheduler providing additional levels of priority, >>> + * the number of priorities is greater than the # of rings. The param is >>> + * renamed to better reflect this. >>> + */ >>> +#define MSM_PARAM_NR_RINGS MSM_PARAM_PRIORITIES >>> + >>> struct drm_msm_param { >>> __u32 pipe; /* in, MSM_PIPE_x */ >>> __u32 param; /* in, MSM_PARAM_x */ >>> @@ -304,6 +312,10 @@ struct drm_msm_gem_madvise { >>> >>> #define MSM_SUBMITQUEUE_FLAGS (0) >>> >>> +/* >>> + * The submitqueue priority should be between 0 and MSM_PARAM_PRIORITIES-1, >>> + * a lower numeric value is higher priority. >>> + */ >>> struct drm_msm_submitqueue { >>> __u32 flags; /* in, MSM_SUBMITQUEUE_x */ >>> __u32 prio; /* in, Priority level */
On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote: > > > On 23/05/2022 23:53, Rob Clark wrote: > > On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin > > <tvrtko.ursulin@linux.intel.com> wrote: > >> > >> > >> Hi Rob, > >> > >> On 28/07/2021 02:06, Rob Clark wrote: > >>> From: Rob Clark <robdclark@chromium.org> > >>> > >>> The drm/scheduler provides additional prioritization on top of that > >>> provided by however many number of ringbuffers (each with their own > >>> priority level) is supported on a given generation. Expose the > >>> additional levels of priority to userspace and map the userspace > >>> priority back to ring (first level of priority) and schedular priority > >>> (additional priority levels within the ring). > >>> > >>> Signed-off-by: Rob Clark <robdclark@chromium.org> > >>> Acked-by: Christian König <christian.koenig@amd.com> > >>> --- > >>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +- > >>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +- > >>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++- > >>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++-------- > >>> include/uapi/drm/msm_drm.h | 14 +++++- > >>> 5 files changed, 88 insertions(+), 27 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > >>> index bad4809b68ef..748665232d29 100644 > >>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c > >>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > >>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value) > >>> return ret; > >>> } > >>> return -EINVAL; > >>> - case MSM_PARAM_NR_RINGS: > >>> - *value = gpu->nr_rings; > >>> + case MSM_PARAM_PRIORITIES: > >>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES; > >>> return 0; > >>> case MSM_PARAM_PP_PGTABLE: > >>> *value = 0; > >>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c > >>> index 450efe59abb5..c2ecec5b11c4 100644 > >>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c > >>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c > >>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev, > >>> submit->gpu = gpu; > >>> submit->cmd = (void *)&submit->bos[nr_bos]; > >>> submit->queue = queue; > >>> - submit->ring = gpu->rb[queue->prio]; > >>> + submit->ring = gpu->rb[queue->ring_nr]; > >>> submit->fault_dumped = false; > >>> > >>> INIT_LIST_HEAD(&submit->node); > >>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data, > >>> /* Get a unique identifier for the submission for logging purposes */ > >>> submitid = atomic_inc_return(&ident) - 1; > >>> > >>> - ring = gpu->rb[queue->prio]; > >>> + ring = gpu->rb[queue->ring_nr]; > >>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid, > >>> args->nr_bos, args->nr_cmds); > >>> > >>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h > >>> index b912cacaecc0..0e4b45bff2e6 100644 > >>> --- a/drivers/gpu/drm/msm/msm_gpu.h > >>> +++ b/drivers/gpu/drm/msm/msm_gpu.h > >>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr { > >>> const char *name; > >>> }; > >>> > >>> +/* > >>> + * The number of priority levels provided by drm gpu scheduler. The > >>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some > >>> + * cases, so we don't use it (no need for kernel generated jobs). > >>> + */ > >>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN) > >>> + > >>> +/** > >>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority > >>> + * > >>> + * @gpu: the gpu instance > >>> + * @prio: the userspace priority level > >>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to > >>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace > >>> + * priority maps to > >>> + * > >>> + * With drm/scheduler providing it's own level of prioritization, our total > >>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES). > >>> + * Each ring is associated with it's own scheduler instance. However, our > >>> + * UABI is that lower numerical values are higher priority. So mapping the > >>> + * single userspace priority level into ring_nr and sched_prio takes some > >>> + * care. The userspace provided priority (when a submitqueue is created) > >>> + * is mapped to ring nr and scheduler priority as such: > >>> + * > >>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES > >>> + * sched_prio = NR_SCHED_PRIORITIES - > >>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1 > >>> + * > >>> + * This allows generations without preemption (nr_rings==1) to have some > >>> + * amount of prioritization, and provides more priority levels for gens > >>> + * that do have preemption. > >> > >> I am exploring how different drivers handle priority levels and this > >> caught my eye. > >> > >> Is the implication of the last paragraphs that on hw with nr_rings > 1, > >> ring + 1 preempts ring? > > > > Other way around, at least from the uabi standpoint. Ie. ring[0] > > preempts ring[1] > > Ah yes, I figure it out from the comments but then confused myself when > writing the email. > > >> If so I am wondering does the "spreading" of > >> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable > >> levels within every "bucket" or how does that work? > > > > So, preemption is possible between any priority level before run_job() > > gets called, which writes the job into the ringbuffer. After that > > Hmm how? Before run_job() the jobs are not runnable, sitting in the > scheduler queues, right? I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on prio[1] could be executed after submit B on prio[2] provided that run_job(submitA) hasn't happened yet. So I guess it isn't "really" preemption because the submit hasn't started running on the GPU yet. But rather just scheduling according to priority. > > point, you only have "bucket" level preemption, because > > NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO > > ringbuffer. > > Right, and you have one GPU with four rings, which means you expose 12 > priority levels to userspace, did I get that right? Correct > If so how do you convey in the ABI that not all there priority levels > are equal? Like userspace can submit at prio 4 and expect prio 3 to > preempt, as would prio 2 preempt prio 3. While actual behaviour will not > match - 3 will not preempt 4. It isn't really exposed to userspace, but perhaps it should be.. Userspace just knows that, to the extent possible, the kernel will try to execute prio 3 before prio 4. > Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a > quick peek in Mesa but did not spot it - although I am not really at > home there yet so maybe I missed it. Yes, there is an EGL extension: https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt It is pretty limited, it only exposes three priority levels. BR, -R > > ----- > > > > btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm > > trying to add an igt test to stress shrinker/eviction, similar to the > > existing tests/i915/gem_shrink.c. But we hit an unfortunate > > combination of circumstances: > > 1. Pinning memory happens in the synchronous part of the submit ioctl, > > before enqueuing the job for the kthread to handle. > > 2. The first run_job() callback incurs a slight delay (~1.5ms) while > > resuming the GPU > > 3. Because of that delay, userspace has a chance to queue up enough > > more jobs to require locking/pinning more than the available system > > RAM.. > > Is that one or multiple threads submitting jobs? > > > I'm not sure if we want a way to prevent userspace from getting *too* > > far ahead of the kthread. Or maybe at some point the shrinker should > > sleep on non-idle buffers? > > On the direct reclaim path when invoked from the submit ioctl? In i915 > we only shrink idle objects on direct reclaim and leave active ones for > the swapper. It depends on how your locking looks like whether you could > do them, whether there would be coupling of locks and fs-reclaim context. > > Regards, > > Tvrtko > > > BR, > > -R > > > >> > >> Regards, > >> > >> Tvrtko > >> > >>> + */ > >>> +static inline int msm_gpu_convert_priority(struct msm_gpu *gpu, int prio, > >>> + unsigned *ring_nr, enum drm_sched_priority *sched_prio) > >>> +{ > >>> + unsigned rn, sp; > >>> + > >>> + rn = div_u64_rem(prio, NR_SCHED_PRIORITIES, &sp); > >>> + > >>> + /* invert sched priority to map to higher-numeric-is-higher- > >>> + * priority convention > >>> + */ > >>> + sp = NR_SCHED_PRIORITIES - sp - 1; > >>> + > >>> + if (rn >= gpu->nr_rings) > >>> + return -EINVAL; > >>> + > >>> + *ring_nr = rn; > >>> + *sched_prio = sp; > >>> + > >>> + return 0; > >>> +} > >>> + > >>> /** > >>> * A submitqueue is associated with a gl context or vk queue (or equiv) > >>> * in userspace. > >>> @@ -257,7 +310,8 @@ struct msm_gpu_perfcntr { > >>> * @id: userspace id for the submitqueue, unique within the drm_file > >>> * @flags: userspace flags for the submitqueue, specified at creation > >>> * (currently unusued) > >>> - * @prio: the submitqueue priority > >>> + * @ring_nr: the ringbuffer used by this submitqueue, which is determined > >>> + * by the submitqueue's priority > >>> * @faults: the number of GPU hangs associated with this submitqueue > >>> * @ctx: the per-drm_file context associated with the submitqueue (ie. > >>> * which set of pgtables do submits jobs associated with the > >>> @@ -272,7 +326,7 @@ struct msm_gpu_perfcntr { > >>> struct msm_gpu_submitqueue { > >>> int id; > >>> u32 flags; > >>> - u32 prio; > >>> + u32 ring_nr; > >>> int faults; > >>> struct msm_file_private *ctx; > >>> struct list_head node; > >>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c > >>> index 682ba2a7c0ec..32a55d81b58b 100644 > >>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c > >>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c > >>> @@ -68,6 +68,8 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > >>> struct msm_gpu_submitqueue *queue; > >>> struct msm_ringbuffer *ring; > >>> struct drm_gpu_scheduler *sched; > >>> + enum drm_sched_priority sched_prio; > >>> + unsigned ring_nr; > >>> int ret; > >>> > >>> if (!ctx) > >>> @@ -76,8 +78,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > >>> if (!priv->gpu) > >>> return -ENODEV; > >>> > >>> - if (prio >= priv->gpu->nr_rings) > >>> - return -EINVAL; > >>> + ret = msm_gpu_convert_priority(priv->gpu, prio, &ring_nr, &sched_prio); > >>> + if (ret) > >>> + return ret; > >>> > >>> queue = kzalloc(sizeof(*queue), GFP_KERNEL); > >>> > >>> @@ -86,24 +89,13 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > >>> > >>> kref_init(&queue->ref); > >>> queue->flags = flags; > >>> - queue->prio = prio; > >>> + queue->ring_nr = ring_nr; > >>> > >>> - ring = priv->gpu->rb[prio]; > >>> + ring = priv->gpu->rb[ring_nr]; > >>> sched = &ring->sched; > >>> > >>> - /* > >>> - * TODO we can allow more priorities than we have ringbuffers by > >>> - * mapping: > >>> - * > >>> - * ring = prio / 3; > >>> - * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3); > >>> - * > >>> - * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is > >>> - * treated specially in places. > >>> - */ > >>> ret = drm_sched_entity_init(&queue->entity, > >>> - DRM_SCHED_PRIORITY_NORMAL, > >>> - &sched, 1, NULL); > >>> + sched_prio, &sched, 1, NULL); > >>> if (ret) { > >>> kfree(queue); > >>> return ret; > >>> @@ -134,16 +126,19 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > >>> int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx) > >>> { > >>> struct msm_drm_private *priv = drm->dev_private; > >>> - int default_prio; > >>> + int default_prio, max_priority; > >>> > >>> if (!priv->gpu) > >>> return -ENODEV; > >>> > >>> + max_priority = (priv->gpu->nr_rings * NR_SCHED_PRIORITIES) - 1; > >>> + > >>> /* > >>> - * Select priority 2 as the "default priority" unless nr_rings is less > >>> - * than 2 and then pick the lowest priority > >>> + * Pick a medium priority level as default. Lower numeric value is > >>> + * higher priority, so round-up to pick a priority that is not higher > >>> + * than the middle priority level. > >>> */ > >>> - default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1); > >>> + default_prio = DIV_ROUND_UP(max_priority, 2); > >>> > >>> INIT_LIST_HEAD(&ctx->submitqueues); > >>> > >>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h > >>> index f075851021c3..6b8fffc28a50 100644 > >>> --- a/include/uapi/drm/msm_drm.h > >>> +++ b/include/uapi/drm/msm_drm.h > >>> @@ -73,11 +73,19 @@ struct drm_msm_timespec { > >>> #define MSM_PARAM_MAX_FREQ 0x04 > >>> #define MSM_PARAM_TIMESTAMP 0x05 > >>> #define MSM_PARAM_GMEM_BASE 0x06 > >>> -#define MSM_PARAM_NR_RINGS 0x07 > >>> +#define MSM_PARAM_PRIORITIES 0x07 /* The # of priority levels */ > >>> #define MSM_PARAM_PP_PGTABLE 0x08 /* => 1 for per-process pagetables, else 0 */ > >>> #define MSM_PARAM_FAULTS 0x09 > >>> #define MSM_PARAM_SUSPENDS 0x0a > >>> > >>> +/* For backwards compat. The original support for preemption was based on > >>> + * a single ring per priority level so # of priority levels equals the # > >>> + * of rings. With drm/scheduler providing additional levels of priority, > >>> + * the number of priorities is greater than the # of rings. The param is > >>> + * renamed to better reflect this. > >>> + */ > >>> +#define MSM_PARAM_NR_RINGS MSM_PARAM_PRIORITIES > >>> + > >>> struct drm_msm_param { > >>> __u32 pipe; /* in, MSM_PIPE_x */ > >>> __u32 param; /* in, MSM_PARAM_x */ > >>> @@ -304,6 +312,10 @@ struct drm_msm_gem_madvise { > >>> > >>> #define MSM_SUBMITQUEUE_FLAGS (0) > >>> > >>> +/* > >>> + * The submitqueue priority should be between 0 and MSM_PARAM_PRIORITIES-1, > >>> + * a lower numeric value is higher priority. > >>> + */ > >>> struct drm_msm_submitqueue { > >>> __u32 flags; /* in, MSM_SUBMITQUEUE_x */ > >>> __u32 prio; /* in, Priority level */
On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote: > > On 23/05/2022 23:53, Rob Clark wrote: > > > > btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm > > trying to add an igt test to stress shrinker/eviction, similar to the > > existing tests/i915/gem_shrink.c. But we hit an unfortunate > > combination of circumstances: > > 1. Pinning memory happens in the synchronous part of the submit ioctl, > > before enqueuing the job for the kthread to handle. > > 2. The first run_job() callback incurs a slight delay (~1.5ms) while > > resuming the GPU > > 3. Because of that delay, userspace has a chance to queue up enough > > more jobs to require locking/pinning more than the available system > > RAM.. > > Is that one or multiple threads submitting jobs? In this case multiple.. but I think it could also happen with a single thread (provided it didn't stall on a fence, directly or indirectly, from an earlier submit), because of how resume and actual job submission happens from scheduler kthread. > > I'm not sure if we want a way to prevent userspace from getting *too* > > far ahead of the kthread. Or maybe at some point the shrinker should > > sleep on non-idle buffers? > > On the direct reclaim path when invoked from the submit ioctl? In i915 > we only shrink idle objects on direct reclaim and leave active ones for > the swapper. It depends on how your locking looks like whether you could > do them, whether there would be coupling of locks and fs-reclaim context. I think the locking is more or less ok, although lockdep is unhappy about one thing[1] which is I think a false warning (ie. not recognizing that we'd already successfully acquired the obj lock via trylock). We can already reclaim idle bo's in this path. But the problem with a bunch of submits queued up in the scheduler, is that they are already considered pinned and active. So at some point we need to sleep (hopefully interruptabley) until they are no longer active, ie. to throttle userspace trying to shove in more submits until some of the enqueued ones have a chance to run and complete. BR, -R [1] https://gitlab.freedesktop.org/drm/msm/-/issues/14 > Regards, > > Tvrtko > > > BR, > > -R > > > >> > >> Regards, > >> > >> Tvrtko > >> > >>> + */ > >>> +static inline int msm_gpu_convert_priority(struct msm_gpu *gpu, int prio, > >>> + unsigned *ring_nr, enum drm_sched_priority *sched_prio) > >>> +{ > >>> + unsigned rn, sp; > >>> + > >>> + rn = div_u64_rem(prio, NR_SCHED_PRIORITIES, &sp); > >>> + > >>> + /* invert sched priority to map to higher-numeric-is-higher- > >>> + * priority convention > >>> + */ > >>> + sp = NR_SCHED_PRIORITIES - sp - 1; > >>> + > >>> + if (rn >= gpu->nr_rings) > >>> + return -EINVAL; > >>> + > >>> + *ring_nr = rn; > >>> + *sched_prio = sp; > >>> + > >>> + return 0; > >>> +} > >>> + > >>> /** > >>> * A submitqueue is associated with a gl context or vk queue (or equiv) > >>> * in userspace. > >>> @@ -257,7 +310,8 @@ struct msm_gpu_perfcntr { > >>> * @id: userspace id for the submitqueue, unique within the drm_file > >>> * @flags: userspace flags for the submitqueue, specified at creation > >>> * (currently unusued) > >>> - * @prio: the submitqueue priority > >>> + * @ring_nr: the ringbuffer used by this submitqueue, which is determined > >>> + * by the submitqueue's priority > >>> * @faults: the number of GPU hangs associated with this submitqueue > >>> * @ctx: the per-drm_file context associated with the submitqueue (ie. > >>> * which set of pgtables do submits jobs associated with the > >>> @@ -272,7 +326,7 @@ struct msm_gpu_perfcntr { > >>> struct msm_gpu_submitqueue { > >>> int id; > >>> u32 flags; > >>> - u32 prio; > >>> + u32 ring_nr; > >>> int faults; > >>> struct msm_file_private *ctx; > >>> struct list_head node; > >>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c > >>> index 682ba2a7c0ec..32a55d81b58b 100644 > >>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c > >>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c > >>> @@ -68,6 +68,8 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > >>> struct msm_gpu_submitqueue *queue; > >>> struct msm_ringbuffer *ring; > >>> struct drm_gpu_scheduler *sched; > >>> + enum drm_sched_priority sched_prio; > >>> + unsigned ring_nr; > >>> int ret; > >>> > >>> if (!ctx) > >>> @@ -76,8 +78,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > >>> if (!priv->gpu) > >>> return -ENODEV; > >>> > >>> - if (prio >= priv->gpu->nr_rings) > >>> - return -EINVAL; > >>> + ret = msm_gpu_convert_priority(priv->gpu, prio, &ring_nr, &sched_prio); > >>> + if (ret) > >>> + return ret; > >>> > >>> queue = kzalloc(sizeof(*queue), GFP_KERNEL); > >>> > >>> @@ -86,24 +89,13 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > >>> > >>> kref_init(&queue->ref); > >>> queue->flags = flags; > >>> - queue->prio = prio; > >>> + queue->ring_nr = ring_nr; > >>> > >>> - ring = priv->gpu->rb[prio]; > >>> + ring = priv->gpu->rb[ring_nr]; > >>> sched = &ring->sched; > >>> > >>> - /* > >>> - * TODO we can allow more priorities than we have ringbuffers by > >>> - * mapping: > >>> - * > >>> - * ring = prio / 3; > >>> - * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3); > >>> - * > >>> - * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is > >>> - * treated specially in places. > >>> - */ > >>> ret = drm_sched_entity_init(&queue->entity, > >>> - DRM_SCHED_PRIORITY_NORMAL, > >>> - &sched, 1, NULL); > >>> + sched_prio, &sched, 1, NULL); > >>> if (ret) { > >>> kfree(queue); > >>> return ret; > >>> @@ -134,16 +126,19 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, > >>> int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx) > >>> { > >>> struct msm_drm_private *priv = drm->dev_private; > >>> - int default_prio; > >>> + int default_prio, max_priority; > >>> > >>> if (!priv->gpu) > >>> return -ENODEV; > >>> > >>> + max_priority = (priv->gpu->nr_rings * NR_SCHED_PRIORITIES) - 1; > >>> + > >>> /* > >>> - * Select priority 2 as the "default priority" unless nr_rings is less > >>> - * than 2 and then pick the lowest priority > >>> + * Pick a medium priority level as default. Lower numeric value is > >>> + * higher priority, so round-up to pick a priority that is not higher > >>> + * than the middle priority level. > >>> */ > >>> - default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1); > >>> + default_prio = DIV_ROUND_UP(max_priority, 2); > >>> > >>> INIT_LIST_HEAD(&ctx->submitqueues); > >>> > >>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h > >>> index f075851021c3..6b8fffc28a50 100644 > >>> --- a/include/uapi/drm/msm_drm.h > >>> +++ b/include/uapi/drm/msm_drm.h > >>> @@ -73,11 +73,19 @@ struct drm_msm_timespec { > >>> #define MSM_PARAM_MAX_FREQ 0x04 > >>> #define MSM_PARAM_TIMESTAMP 0x05 > >>> #define MSM_PARAM_GMEM_BASE 0x06 > >>> -#define MSM_PARAM_NR_RINGS 0x07 > >>> +#define MSM_PARAM_PRIORITIES 0x07 /* The # of priority levels */ > >>> #define MSM_PARAM_PP_PGTABLE 0x08 /* => 1 for per-process pagetables, else 0 */ > >>> #define MSM_PARAM_FAULTS 0x09 > >>> #define MSM_PARAM_SUSPENDS 0x0a > >>> > >>> +/* For backwards compat. The original support for preemption was based on > >>> + * a single ring per priority level so # of priority levels equals the # > >>> + * of rings. With drm/scheduler providing additional levels of priority, > >>> + * the number of priorities is greater than the # of rings. The param is > >>> + * renamed to better reflect this. > >>> + */ > >>> +#define MSM_PARAM_NR_RINGS MSM_PARAM_PRIORITIES > >>> + > >>> struct drm_msm_param { > >>> __u32 pipe; /* in, MSM_PIPE_x */ > >>> __u32 param; /* in, MSM_PARAM_x */ > >>> @@ -304,6 +312,10 @@ struct drm_msm_gem_madvise { > >>> > >>> #define MSM_SUBMITQUEUE_FLAGS (0) > >>> > >>> +/* > >>> + * The submitqueue priority should be between 0 and MSM_PARAM_PRIORITIES-1, > >>> + * a lower numeric value is higher priority. > >>> + */ > >>> struct drm_msm_submitqueue { > >>> __u32 flags; /* in, MSM_SUBMITQUEUE_x */ > >>> __u32 prio; /* in, Priority level */
On Tue, May 24, 2022 at 7:57 AM Rob Clark <robdclark@gmail.com> wrote: > > On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin > <tvrtko.ursulin@linux.intel.com> wrote: > > > > On 23/05/2022 23:53, Rob Clark wrote: > > > > > > btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm > > > trying to add an igt test to stress shrinker/eviction, similar to the > > > existing tests/i915/gem_shrink.c. But we hit an unfortunate > > > combination of circumstances: > > > 1. Pinning memory happens in the synchronous part of the submit ioctl, > > > before enqueuing the job for the kthread to handle. > > > 2. The first run_job() callback incurs a slight delay (~1.5ms) while > > > resuming the GPU > > > 3. Because of that delay, userspace has a chance to queue up enough > > > more jobs to require locking/pinning more than the available system > > > RAM.. > > > > Is that one or multiple threads submitting jobs? > > In this case multiple.. but I think it could also happen with a single > thread (provided it didn't stall on a fence, directly or indirectly, > from an earlier submit), because of how resume and actual job > submission happens from scheduler kthread. > > > > I'm not sure if we want a way to prevent userspace from getting *too* > > > far ahead of the kthread. Or maybe at some point the shrinker should > > > sleep on non-idle buffers? > > > > On the direct reclaim path when invoked from the submit ioctl? In i915 > > we only shrink idle objects on direct reclaim and leave active ones for > > the swapper. It depends on how your locking looks like whether you could > > do them, whether there would be coupling of locks and fs-reclaim context. > > I think the locking is more or less ok, although lockdep is unhappy > about one thing[1] which is I think a false warning (ie. not > recognizing that we'd already successfully acquired the obj lock via > trylock). We can already reclaim idle bo's in this path. But the > problem with a bunch of submits queued up in the scheduler, is that > they are already considered pinned and active. So at some point we > need to sleep (hopefully interruptabley) until they are no longer > active, ie. to throttle userspace trying to shove in more submits > until some of the enqueued ones have a chance to run and complete. > > BR, > -R > > [1] https://gitlab.freedesktop.org/drm/msm/-/issues/14 > btw, one thing I'm thinking about is __GFP_RETRY_MAYFAIL for gem bo's.. I'd need to think about the various code paths that could trigger us to need to allocate pages, but short-circuiting the out_of_memory() path deep in drm_gem_get_pages() -> shmem_read_mapping_page() -> ... -> __alloc_pages_may_oom() and letting the driver decide itself if there is queued work worth waiting on (and if not, calling out_of_memory() directly itself) seems like a possible solution.. that also untangles the interrupted-syscall case so we don't end up having to block in a non-interruptible way. Seems like it might work? BR, -R
On 24/05/2022 15:50, Rob Clark wrote: > On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin > <tvrtko.ursulin@linux.intel.com> wrote: >> >> >> On 23/05/2022 23:53, Rob Clark wrote: >>> On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin >>> <tvrtko.ursulin@linux.intel.com> wrote: >>>> >>>> >>>> Hi Rob, >>>> >>>> On 28/07/2021 02:06, Rob Clark wrote: >>>>> From: Rob Clark <robdclark@chromium.org> >>>>> >>>>> The drm/scheduler provides additional prioritization on top of that >>>>> provided by however many number of ringbuffers (each with their own >>>>> priority level) is supported on a given generation. Expose the >>>>> additional levels of priority to userspace and map the userspace >>>>> priority back to ring (first level of priority) and schedular priority >>>>> (additional priority levels within the ring). >>>>> >>>>> Signed-off-by: Rob Clark <robdclark@chromium.org> >>>>> Acked-by: Christian König <christian.koenig@amd.com> >>>>> --- >>>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +- >>>>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +- >>>>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++- >>>>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++-------- >>>>> include/uapi/drm/msm_drm.h | 14 +++++- >>>>> 5 files changed, 88 insertions(+), 27 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c >>>>> index bad4809b68ef..748665232d29 100644 >>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c >>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c >>>>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value) >>>>> return ret; >>>>> } >>>>> return -EINVAL; >>>>> - case MSM_PARAM_NR_RINGS: >>>>> - *value = gpu->nr_rings; >>>>> + case MSM_PARAM_PRIORITIES: >>>>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES; >>>>> return 0; >>>>> case MSM_PARAM_PP_PGTABLE: >>>>> *value = 0; >>>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c >>>>> index 450efe59abb5..c2ecec5b11c4 100644 >>>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c >>>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c >>>>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev, >>>>> submit->gpu = gpu; >>>>> submit->cmd = (void *)&submit->bos[nr_bos]; >>>>> submit->queue = queue; >>>>> - submit->ring = gpu->rb[queue->prio]; >>>>> + submit->ring = gpu->rb[queue->ring_nr]; >>>>> submit->fault_dumped = false; >>>>> >>>>> INIT_LIST_HEAD(&submit->node); >>>>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data, >>>>> /* Get a unique identifier for the submission for logging purposes */ >>>>> submitid = atomic_inc_return(&ident) - 1; >>>>> >>>>> - ring = gpu->rb[queue->prio]; >>>>> + ring = gpu->rb[queue->ring_nr]; >>>>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid, >>>>> args->nr_bos, args->nr_cmds); >>>>> >>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h >>>>> index b912cacaecc0..0e4b45bff2e6 100644 >>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h >>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h >>>>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr { >>>>> const char *name; >>>>> }; >>>>> >>>>> +/* >>>>> + * The number of priority levels provided by drm gpu scheduler. The >>>>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some >>>>> + * cases, so we don't use it (no need for kernel generated jobs). >>>>> + */ >>>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN) >>>>> + >>>>> +/** >>>>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority >>>>> + * >>>>> + * @gpu: the gpu instance >>>>> + * @prio: the userspace priority level >>>>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to >>>>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace >>>>> + * priority maps to >>>>> + * >>>>> + * With drm/scheduler providing it's own level of prioritization, our total >>>>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES). >>>>> + * Each ring is associated with it's own scheduler instance. However, our >>>>> + * UABI is that lower numerical values are higher priority. So mapping the >>>>> + * single userspace priority level into ring_nr and sched_prio takes some >>>>> + * care. The userspace provided priority (when a submitqueue is created) >>>>> + * is mapped to ring nr and scheduler priority as such: >>>>> + * >>>>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES >>>>> + * sched_prio = NR_SCHED_PRIORITIES - >>>>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1 >>>>> + * >>>>> + * This allows generations without preemption (nr_rings==1) to have some >>>>> + * amount of prioritization, and provides more priority levels for gens >>>>> + * that do have preemption. >>>> >>>> I am exploring how different drivers handle priority levels and this >>>> caught my eye. >>>> >>>> Is the implication of the last paragraphs that on hw with nr_rings > 1, >>>> ring + 1 preempts ring? >>> >>> Other way around, at least from the uabi standpoint. Ie. ring[0] >>> preempts ring[1] >> >> Ah yes, I figure it out from the comments but then confused myself when >> writing the email. >> >>>> If so I am wondering does the "spreading" of >>>> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable >>>> levels within every "bucket" or how does that work? >>> >>> So, preemption is possible between any priority level before run_job() >>> gets called, which writes the job into the ringbuffer. After that >> >> Hmm how? Before run_job() the jobs are not runnable, sitting in the >> scheduler queues, right? > > I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on > prio[1] could be executed after submit B on prio[2] provided that > run_job(submitA) hasn't happened yet. So I guess it isn't "really" > preemption because the submit hasn't started running on the GPU yet. > But rather just scheduling according to priority. > >>> point, you only have "bucket" level preemption, because >>> NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO >>> ringbuffer. >> >> Right, and you have one GPU with four rings, which means you expose 12 >> priority levels to userspace, did I get that right? > > Correct > >> If so how do you convey in the ABI that not all there priority levels >> are equal? Like userspace can submit at prio 4 and expect prio 3 to >> preempt, as would prio 2 preempt prio 3. While actual behaviour will not >> match - 3 will not preempt 4. > > It isn't really exposed to userspace, but perhaps it should be.. > Userspace just knows that, to the extent possible, the kernel will try > to execute prio 3 before prio 4. > >> Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a >> quick peek in Mesa but did not spot it - although I am not really at >> home there yet so maybe I missed it. > > Yes, there is an EGL extension: > > https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt > > It is pretty limited, it only exposes three priority levels. Right, is that wired up on msm? And if it is, or could be, how do/would you map the three priority levels for GPUs which expose 3 priority levels versus the one which exposes 12? Is it doable properly without leaking the fact drm/sched internal implementation detail of three priority levels? Or if you went the other way and only exposed up to max 3 levels, then you lose one priority level your hardware suppose which is also not good. It is all quite interesting because your hardware is completely different from ours in this respect. In our case i915 decides when to preempt, hardware has no concept of priority (*). Regards, Tvrtko (*) Almost no concept of priority in hardware - we do have it on new GPUs and only on a subset of engine classes where render and compute share the EUs. But I think it's way different from Ardenos.
On Wed, May 25, 2022 at 2:46 AM Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote: > > > On 24/05/2022 15:50, Rob Clark wrote: > > On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin > > <tvrtko.ursulin@linux.intel.com> wrote: > >> > >> > >> On 23/05/2022 23:53, Rob Clark wrote: > >>> On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin > >>> <tvrtko.ursulin@linux.intel.com> wrote: > >>>> > >>>> > >>>> Hi Rob, > >>>> > >>>> On 28/07/2021 02:06, Rob Clark wrote: > >>>>> From: Rob Clark <robdclark@chromium.org> > >>>>> > >>>>> The drm/scheduler provides additional prioritization on top of that > >>>>> provided by however many number of ringbuffers (each with their own > >>>>> priority level) is supported on a given generation. Expose the > >>>>> additional levels of priority to userspace and map the userspace > >>>>> priority back to ring (first level of priority) and schedular priority > >>>>> (additional priority levels within the ring). > >>>>> > >>>>> Signed-off-by: Rob Clark <robdclark@chromium.org> > >>>>> Acked-by: Christian König <christian.koenig@amd.com> > >>>>> --- > >>>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +- > >>>>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +- > >>>>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++- > >>>>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++-------- > >>>>> include/uapi/drm/msm_drm.h | 14 +++++- > >>>>> 5 files changed, 88 insertions(+), 27 deletions(-) > >>>>> > >>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > >>>>> index bad4809b68ef..748665232d29 100644 > >>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c > >>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > >>>>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value) > >>>>> return ret; > >>>>> } > >>>>> return -EINVAL; > >>>>> - case MSM_PARAM_NR_RINGS: > >>>>> - *value = gpu->nr_rings; > >>>>> + case MSM_PARAM_PRIORITIES: > >>>>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES; > >>>>> return 0; > >>>>> case MSM_PARAM_PP_PGTABLE: > >>>>> *value = 0; > >>>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c > >>>>> index 450efe59abb5..c2ecec5b11c4 100644 > >>>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c > >>>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c > >>>>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev, > >>>>> submit->gpu = gpu; > >>>>> submit->cmd = (void *)&submit->bos[nr_bos]; > >>>>> submit->queue = queue; > >>>>> - submit->ring = gpu->rb[queue->prio]; > >>>>> + submit->ring = gpu->rb[queue->ring_nr]; > >>>>> submit->fault_dumped = false; > >>>>> > >>>>> INIT_LIST_HEAD(&submit->node); > >>>>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data, > >>>>> /* Get a unique identifier for the submission for logging purposes */ > >>>>> submitid = atomic_inc_return(&ident) - 1; > >>>>> > >>>>> - ring = gpu->rb[queue->prio]; > >>>>> + ring = gpu->rb[queue->ring_nr]; > >>>>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid, > >>>>> args->nr_bos, args->nr_cmds); > >>>>> > >>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h > >>>>> index b912cacaecc0..0e4b45bff2e6 100644 > >>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h > >>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h > >>>>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr { > >>>>> const char *name; > >>>>> }; > >>>>> > >>>>> +/* > >>>>> + * The number of priority levels provided by drm gpu scheduler. The > >>>>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some > >>>>> + * cases, so we don't use it (no need for kernel generated jobs). > >>>>> + */ > >>>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN) > >>>>> + > >>>>> +/** > >>>>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority > >>>>> + * > >>>>> + * @gpu: the gpu instance > >>>>> + * @prio: the userspace priority level > >>>>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to > >>>>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace > >>>>> + * priority maps to > >>>>> + * > >>>>> + * With drm/scheduler providing it's own level of prioritization, our total > >>>>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES). > >>>>> + * Each ring is associated with it's own scheduler instance. However, our > >>>>> + * UABI is that lower numerical values are higher priority. So mapping the > >>>>> + * single userspace priority level into ring_nr and sched_prio takes some > >>>>> + * care. The userspace provided priority (when a submitqueue is created) > >>>>> + * is mapped to ring nr and scheduler priority as such: > >>>>> + * > >>>>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES > >>>>> + * sched_prio = NR_SCHED_PRIORITIES - > >>>>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1 > >>>>> + * > >>>>> + * This allows generations without preemption (nr_rings==1) to have some > >>>>> + * amount of prioritization, and provides more priority levels for gens > >>>>> + * that do have preemption. > >>>> > >>>> I am exploring how different drivers handle priority levels and this > >>>> caught my eye. > >>>> > >>>> Is the implication of the last paragraphs that on hw with nr_rings > 1, > >>>> ring + 1 preempts ring? > >>> > >>> Other way around, at least from the uabi standpoint. Ie. ring[0] > >>> preempts ring[1] > >> > >> Ah yes, I figure it out from the comments but then confused myself when > >> writing the email. > >> > >>>> If so I am wondering does the "spreading" of > >>>> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable > >>>> levels within every "bucket" or how does that work? > >>> > >>> So, preemption is possible between any priority level before run_job() > >>> gets called, which writes the job into the ringbuffer. After that > >> > >> Hmm how? Before run_job() the jobs are not runnable, sitting in the > >> scheduler queues, right? > > > > I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on > > prio[1] could be executed after submit B on prio[2] provided that > > run_job(submitA) hasn't happened yet. So I guess it isn't "really" > > preemption because the submit hasn't started running on the GPU yet. > > But rather just scheduling according to priority. > > > >>> point, you only have "bucket" level preemption, because > >>> NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO > >>> ringbuffer. > >> > >> Right, and you have one GPU with four rings, which means you expose 12 > >> priority levels to userspace, did I get that right? > > > > Correct > > > >> If so how do you convey in the ABI that not all there priority levels > >> are equal? Like userspace can submit at prio 4 and expect prio 3 to > >> preempt, as would prio 2 preempt prio 3. While actual behaviour will not > >> match - 3 will not preempt 4. > > > > It isn't really exposed to userspace, but perhaps it should be.. > > Userspace just knows that, to the extent possible, the kernel will try > > to execute prio 3 before prio 4. > > > >> Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a > >> quick peek in Mesa but did not spot it - although I am not really at > >> home there yet so maybe I missed it. > > > > Yes, there is an EGL extension: > > > > https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt > > > > It is pretty limited, it only exposes three priority levels. > > Right, is that wired up on msm? And if it is, or could be, how do/would > you map the three priority levels for GPUs which expose 3 priority > levels versus the one which exposes 12? We don't yet, but probably should, expose a cap to indicate to userspace the # of hw rings vs # of levels of sched priority > Is it doable properly without leaking the fact drm/sched internal > implementation detail of three priority levels? Or if you went the other > way and only exposed up to max 3 levels, then you lose one priority > level your hardware suppose which is also not good. > > It is all quite interesting because your hardware is completely > different from ours in this respect. In our case i915 decides when to > preempt, hardware has no concept of priority (*). It is really pretty much all in firmware.. a6xx is the first gen that could do actual (non-cooperative) preemption (but that isn't implemented yet in upstream driver) BR, -R > Regards, > > Tvrtko > > (*) Almost no concept of priority in hardware - we do have it on new > GPUs and only on a subset of engine classes where render and compute > share the EUs. But I think it's way different from Ardenos.
On 24/05/2022 15:57, Rob Clark wrote: > On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin > <tvrtko.ursulin@linux.intel.com> wrote: >> >> On 23/05/2022 23:53, Rob Clark wrote: >>> >>> btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm >>> trying to add an igt test to stress shrinker/eviction, similar to the >>> existing tests/i915/gem_shrink.c. But we hit an unfortunate >>> combination of circumstances: >>> 1. Pinning memory happens in the synchronous part of the submit ioctl, >>> before enqueuing the job for the kthread to handle. >>> 2. The first run_job() callback incurs a slight delay (~1.5ms) while >>> resuming the GPU >>> 3. Because of that delay, userspace has a chance to queue up enough >>> more jobs to require locking/pinning more than the available system >>> RAM.. >> >> Is that one or multiple threads submitting jobs? > > In this case multiple.. but I think it could also happen with a single > thread (provided it didn't stall on a fence, directly or indirectly, > from an earlier submit), because of how resume and actual job > submission happens from scheduler kthread. > >>> I'm not sure if we want a way to prevent userspace from getting *too* >>> far ahead of the kthread. Or maybe at some point the shrinker should >>> sleep on non-idle buffers? >> >> On the direct reclaim path when invoked from the submit ioctl? In i915 >> we only shrink idle objects on direct reclaim and leave active ones for >> the swapper. It depends on how your locking looks like whether you could >> do them, whether there would be coupling of locks and fs-reclaim context. > > I think the locking is more or less ok, although lockdep is unhappy > about one thing[1] which is I think a false warning (ie. not > recognizing that we'd already successfully acquired the obj lock via > trylock). We can already reclaim idle bo's in this path. But the > problem with a bunch of submits queued up in the scheduler, is that > they are already considered pinned and active. So at some point we > need to sleep (hopefully interruptabley) until they are no longer > active, ie. to throttle userspace trying to shove in more submits > until some of the enqueued ones have a chance to run and complete. Odd I did not think trylock could trigger that. Looking at your code it indeed seems two trylocks. I am pretty sure we use the same trylock trick to avoid it. I am confused.. Otherwise if you can afford to sleep you can of course throttle organically via direct reclaim. Unless I am forgetting some key gotcha - it's been a while I've been active in this area. Regards, Tvrtko > > BR, > -R > > [1] https://gitlab.freedesktop.org/drm/msm/-/issues/14 > >> Regards, >> >> Tvrtko >> >>> BR, >>> -R >>> >>>> >>>> Regards, >>>> >>>> Tvrtko >>>> >>>>> + */ >>>>> +static inline int msm_gpu_convert_priority(struct msm_gpu *gpu, int prio, >>>>> + unsigned *ring_nr, enum drm_sched_priority *sched_prio) >>>>> +{ >>>>> + unsigned rn, sp; >>>>> + >>>>> + rn = div_u64_rem(prio, NR_SCHED_PRIORITIES, &sp); >>>>> + >>>>> + /* invert sched priority to map to higher-numeric-is-higher- >>>>> + * priority convention >>>>> + */ >>>>> + sp = NR_SCHED_PRIORITIES - sp - 1; >>>>> + >>>>> + if (rn >= gpu->nr_rings) >>>>> + return -EINVAL; >>>>> + >>>>> + *ring_nr = rn; >>>>> + *sched_prio = sp; >>>>> + >>>>> + return 0; >>>>> +} >>>>> + >>>>> /** >>>>> * A submitqueue is associated with a gl context or vk queue (or equiv) >>>>> * in userspace. >>>>> @@ -257,7 +310,8 @@ struct msm_gpu_perfcntr { >>>>> * @id: userspace id for the submitqueue, unique within the drm_file >>>>> * @flags: userspace flags for the submitqueue, specified at creation >>>>> * (currently unusued) >>>>> - * @prio: the submitqueue priority >>>>> + * @ring_nr: the ringbuffer used by this submitqueue, which is determined >>>>> + * by the submitqueue's priority >>>>> * @faults: the number of GPU hangs associated with this submitqueue >>>>> * @ctx: the per-drm_file context associated with the submitqueue (ie. >>>>> * which set of pgtables do submits jobs associated with the >>>>> @@ -272,7 +326,7 @@ struct msm_gpu_perfcntr { >>>>> struct msm_gpu_submitqueue { >>>>> int id; >>>>> u32 flags; >>>>> - u32 prio; >>>>> + u32 ring_nr; >>>>> int faults; >>>>> struct msm_file_private *ctx; >>>>> struct list_head node; >>>>> diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c >>>>> index 682ba2a7c0ec..32a55d81b58b 100644 >>>>> --- a/drivers/gpu/drm/msm/msm_submitqueue.c >>>>> +++ b/drivers/gpu/drm/msm/msm_submitqueue.c >>>>> @@ -68,6 +68,8 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, >>>>> struct msm_gpu_submitqueue *queue; >>>>> struct msm_ringbuffer *ring; >>>>> struct drm_gpu_scheduler *sched; >>>>> + enum drm_sched_priority sched_prio; >>>>> + unsigned ring_nr; >>>>> int ret; >>>>> >>>>> if (!ctx) >>>>> @@ -76,8 +78,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, >>>>> if (!priv->gpu) >>>>> return -ENODEV; >>>>> >>>>> - if (prio >= priv->gpu->nr_rings) >>>>> - return -EINVAL; >>>>> + ret = msm_gpu_convert_priority(priv->gpu, prio, &ring_nr, &sched_prio); >>>>> + if (ret) >>>>> + return ret; >>>>> >>>>> queue = kzalloc(sizeof(*queue), GFP_KERNEL); >>>>> >>>>> @@ -86,24 +89,13 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, >>>>> >>>>> kref_init(&queue->ref); >>>>> queue->flags = flags; >>>>> - queue->prio = prio; >>>>> + queue->ring_nr = ring_nr; >>>>> >>>>> - ring = priv->gpu->rb[prio]; >>>>> + ring = priv->gpu->rb[ring_nr]; >>>>> sched = &ring->sched; >>>>> >>>>> - /* >>>>> - * TODO we can allow more priorities than we have ringbuffers by >>>>> - * mapping: >>>>> - * >>>>> - * ring = prio / 3; >>>>> - * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3); >>>>> - * >>>>> - * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is >>>>> - * treated specially in places. >>>>> - */ >>>>> ret = drm_sched_entity_init(&queue->entity, >>>>> - DRM_SCHED_PRIORITY_NORMAL, >>>>> - &sched, 1, NULL); >>>>> + sched_prio, &sched, 1, NULL); >>>>> if (ret) { >>>>> kfree(queue); >>>>> return ret; >>>>> @@ -134,16 +126,19 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, >>>>> int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx) >>>>> { >>>>> struct msm_drm_private *priv = drm->dev_private; >>>>> - int default_prio; >>>>> + int default_prio, max_priority; >>>>> >>>>> if (!priv->gpu) >>>>> return -ENODEV; >>>>> >>>>> + max_priority = (priv->gpu->nr_rings * NR_SCHED_PRIORITIES) - 1; >>>>> + >>>>> /* >>>>> - * Select priority 2 as the "default priority" unless nr_rings is less >>>>> - * than 2 and then pick the lowest priority >>>>> + * Pick a medium priority level as default. Lower numeric value is >>>>> + * higher priority, so round-up to pick a priority that is not higher >>>>> + * than the middle priority level. >>>>> */ >>>>> - default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1); >>>>> + default_prio = DIV_ROUND_UP(max_priority, 2); >>>>> >>>>> INIT_LIST_HEAD(&ctx->submitqueues); >>>>> >>>>> diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h >>>>> index f075851021c3..6b8fffc28a50 100644 >>>>> --- a/include/uapi/drm/msm_drm.h >>>>> +++ b/include/uapi/drm/msm_drm.h >>>>> @@ -73,11 +73,19 @@ struct drm_msm_timespec { >>>>> #define MSM_PARAM_MAX_FREQ 0x04 >>>>> #define MSM_PARAM_TIMESTAMP 0x05 >>>>> #define MSM_PARAM_GMEM_BASE 0x06 >>>>> -#define MSM_PARAM_NR_RINGS 0x07 >>>>> +#define MSM_PARAM_PRIORITIES 0x07 /* The # of priority levels */ >>>>> #define MSM_PARAM_PP_PGTABLE 0x08 /* => 1 for per-process pagetables, else 0 */ >>>>> #define MSM_PARAM_FAULTS 0x09 >>>>> #define MSM_PARAM_SUSPENDS 0x0a >>>>> >>>>> +/* For backwards compat. The original support for preemption was based on >>>>> + * a single ring per priority level so # of priority levels equals the # >>>>> + * of rings. With drm/scheduler providing additional levels of priority, >>>>> + * the number of priorities is greater than the # of rings. The param is >>>>> + * renamed to better reflect this. >>>>> + */ >>>>> +#define MSM_PARAM_NR_RINGS MSM_PARAM_PRIORITIES >>>>> + >>>>> struct drm_msm_param { >>>>> __u32 pipe; /* in, MSM_PIPE_x */ >>>>> __u32 param; /* in, MSM_PARAM_x */ >>>>> @@ -304,6 +312,10 @@ struct drm_msm_gem_madvise { >>>>> >>>>> #define MSM_SUBMITQUEUE_FLAGS (0) >>>>> >>>>> +/* >>>>> + * The submitqueue priority should be between 0 and MSM_PARAM_PRIORITIES-1, >>>>> + * a lower numeric value is higher priority. >>>>> + */ >>>>> struct drm_msm_submitqueue { >>>>> __u32 flags; /* in, MSM_SUBMITQUEUE_x */ >>>>> __u32 prio; /* in, Priority level */
On 25/05/2022 14:41, Rob Clark wrote: > On Wed, May 25, 2022 at 2:46 AM Tvrtko Ursulin > <tvrtko.ursulin@linux.intel.com> wrote: >> >> >> On 24/05/2022 15:50, Rob Clark wrote: >>> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin >>> <tvrtko.ursulin@linux.intel.com> wrote: >>>> >>>> >>>> On 23/05/2022 23:53, Rob Clark wrote: >>>>> On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin >>>>> <tvrtko.ursulin@linux.intel.com> wrote: >>>>>> >>>>>> >>>>>> Hi Rob, >>>>>> >>>>>> On 28/07/2021 02:06, Rob Clark wrote: >>>>>>> From: Rob Clark <robdclark@chromium.org> >>>>>>> >>>>>>> The drm/scheduler provides additional prioritization on top of that >>>>>>> provided by however many number of ringbuffers (each with their own >>>>>>> priority level) is supported on a given generation. Expose the >>>>>>> additional levels of priority to userspace and map the userspace >>>>>>> priority back to ring (first level of priority) and schedular priority >>>>>>> (additional priority levels within the ring). >>>>>>> >>>>>>> Signed-off-by: Rob Clark <robdclark@chromium.org> >>>>>>> Acked-by: Christian König <christian.koenig@amd.com> >>>>>>> --- >>>>>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +- >>>>>>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +- >>>>>>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++- >>>>>>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++-------- >>>>>>> include/uapi/drm/msm_drm.h | 14 +++++- >>>>>>> 5 files changed, 88 insertions(+), 27 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c >>>>>>> index bad4809b68ef..748665232d29 100644 >>>>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c >>>>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c >>>>>>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value) >>>>>>> return ret; >>>>>>> } >>>>>>> return -EINVAL; >>>>>>> - case MSM_PARAM_NR_RINGS: >>>>>>> - *value = gpu->nr_rings; >>>>>>> + case MSM_PARAM_PRIORITIES: >>>>>>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES; >>>>>>> return 0; >>>>>>> case MSM_PARAM_PP_PGTABLE: >>>>>>> *value = 0; >>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c >>>>>>> index 450efe59abb5..c2ecec5b11c4 100644 >>>>>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c >>>>>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c >>>>>>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev, >>>>>>> submit->gpu = gpu; >>>>>>> submit->cmd = (void *)&submit->bos[nr_bos]; >>>>>>> submit->queue = queue; >>>>>>> - submit->ring = gpu->rb[queue->prio]; >>>>>>> + submit->ring = gpu->rb[queue->ring_nr]; >>>>>>> submit->fault_dumped = false; >>>>>>> >>>>>>> INIT_LIST_HEAD(&submit->node); >>>>>>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data, >>>>>>> /* Get a unique identifier for the submission for logging purposes */ >>>>>>> submitid = atomic_inc_return(&ident) - 1; >>>>>>> >>>>>>> - ring = gpu->rb[queue->prio]; >>>>>>> + ring = gpu->rb[queue->ring_nr]; >>>>>>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid, >>>>>>> args->nr_bos, args->nr_cmds); >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h >>>>>>> index b912cacaecc0..0e4b45bff2e6 100644 >>>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h >>>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h >>>>>>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr { >>>>>>> const char *name; >>>>>>> }; >>>>>>> >>>>>>> +/* >>>>>>> + * The number of priority levels provided by drm gpu scheduler. The >>>>>>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some >>>>>>> + * cases, so we don't use it (no need for kernel generated jobs). >>>>>>> + */ >>>>>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN) >>>>>>> + >>>>>>> +/** >>>>>>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority >>>>>>> + * >>>>>>> + * @gpu: the gpu instance >>>>>>> + * @prio: the userspace priority level >>>>>>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to >>>>>>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace >>>>>>> + * priority maps to >>>>>>> + * >>>>>>> + * With drm/scheduler providing it's own level of prioritization, our total >>>>>>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES). >>>>>>> + * Each ring is associated with it's own scheduler instance. However, our >>>>>>> + * UABI is that lower numerical values are higher priority. So mapping the >>>>>>> + * single userspace priority level into ring_nr and sched_prio takes some >>>>>>> + * care. The userspace provided priority (when a submitqueue is created) >>>>>>> + * is mapped to ring nr and scheduler priority as such: >>>>>>> + * >>>>>>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES >>>>>>> + * sched_prio = NR_SCHED_PRIORITIES - >>>>>>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1 >>>>>>> + * >>>>>>> + * This allows generations without preemption (nr_rings==1) to have some >>>>>>> + * amount of prioritization, and provides more priority levels for gens >>>>>>> + * that do have preemption. >>>>>> >>>>>> I am exploring how different drivers handle priority levels and this >>>>>> caught my eye. >>>>>> >>>>>> Is the implication of the last paragraphs that on hw with nr_rings > 1, >>>>>> ring + 1 preempts ring? >>>>> >>>>> Other way around, at least from the uabi standpoint. Ie. ring[0] >>>>> preempts ring[1] >>>> >>>> Ah yes, I figure it out from the comments but then confused myself when >>>> writing the email. >>>> >>>>>> If so I am wondering does the "spreading" of >>>>>> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable >>>>>> levels within every "bucket" or how does that work? >>>>> >>>>> So, preemption is possible between any priority level before run_job() >>>>> gets called, which writes the job into the ringbuffer. After that >>>> >>>> Hmm how? Before run_job() the jobs are not runnable, sitting in the >>>> scheduler queues, right? >>> >>> I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on >>> prio[1] could be executed after submit B on prio[2] provided that >>> run_job(submitA) hasn't happened yet. So I guess it isn't "really" >>> preemption because the submit hasn't started running on the GPU yet. >>> But rather just scheduling according to priority. >>> >>>>> point, you only have "bucket" level preemption, because >>>>> NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO >>>>> ringbuffer. >>>> >>>> Right, and you have one GPU with four rings, which means you expose 12 >>>> priority levels to userspace, did I get that right? >>> >>> Correct >>> >>>> If so how do you convey in the ABI that not all there priority levels >>>> are equal? Like userspace can submit at prio 4 and expect prio 3 to >>>> preempt, as would prio 2 preempt prio 3. While actual behaviour will not >>>> match - 3 will not preempt 4. >>> >>> It isn't really exposed to userspace, but perhaps it should be.. >>> Userspace just knows that, to the extent possible, the kernel will try >>> to execute prio 3 before prio 4. >>> >>>> Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a >>>> quick peek in Mesa but did not spot it - although I am not really at >>>> home there yet so maybe I missed it. >>> >>> Yes, there is an EGL extension: >>> >>> https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt >>> >>> It is pretty limited, it only exposes three priority levels. >> >> Right, is that wired up on msm? And if it is, or could be, how do/would >> you map the three priority levels for GPUs which expose 3 priority >> levels versus the one which exposes 12? > > We don't yet, but probably should, expose a cap to indicate to > userspace the # of hw rings vs # of levels of sched priority What bothers me is the question of whether this setup provides a consistent benefit. Why would userspace use other than "real" (hardware) priority levels on chips where they are available? For instance if you exposed 4 instead of 12 on a respective platform, would that be better or worse? Yes you could only map three directly drm/sched and one would have to be "fake". Like: hw prio 0 -> drm/sched 2 hw prio 1 -> drm/sched 1 hw prio 2 -> drm/sched 0 hw prio 3 -> drm/sched 0 Not saying that's nice either. Perhaps the answer is that drm/sched needs more flexibility for instance if it wants to be widely used. For instance in i915 uapi we have priority as int -1023 - +1023. And matching implementation on some platforms, until the new ones which are GuC firmware based, where we need to squash that to low/normal/high. So thinking was drm/sched happens to align with GuC. But then we have your hw where it doesn't seem to. Regards, Tvrtko >> Is it doable properly without leaking the fact drm/sched internal >> implementation detail of three priority levels? Or if you went the other >> way and only exposed up to max 3 levels, then you lose one priority >> level your hardware suppose which is also not good. >> >> It is all quite interesting because your hardware is completely >> different from ours in this respect. In our case i915 decides when to >> preempt, hardware has no concept of priority (*). > > It is really pretty much all in firmware.. a6xx is the first gen that > could do actual (non-cooperative) preemption (but that isn't > implemented yet in upstream driver) > > BR, > -R > >> Regards, >> >> Tvrtko >> >> (*) Almost no concept of priority in hardware - we do have it on new >> GPUs and only on a subset of engine classes where render and compute >> share the EUs. But I think it's way different from Ardenos.
On Wed, May 25, 2022 at 9:11 AM Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote: > > > On 24/05/2022 15:57, Rob Clark wrote: > > On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin > > <tvrtko.ursulin@linux.intel.com> wrote: > >> > >> On 23/05/2022 23:53, Rob Clark wrote: > >>> > >>> btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm > >>> trying to add an igt test to stress shrinker/eviction, similar to the > >>> existing tests/i915/gem_shrink.c. But we hit an unfortunate > >>> combination of circumstances: > >>> 1. Pinning memory happens in the synchronous part of the submit ioctl, > >>> before enqueuing the job for the kthread to handle. > >>> 2. The first run_job() callback incurs a slight delay (~1.5ms) while > >>> resuming the GPU > >>> 3. Because of that delay, userspace has a chance to queue up enough > >>> more jobs to require locking/pinning more than the available system > >>> RAM.. > >> > >> Is that one or multiple threads submitting jobs? > > > > In this case multiple.. but I think it could also happen with a single > > thread (provided it didn't stall on a fence, directly or indirectly, > > from an earlier submit), because of how resume and actual job > > submission happens from scheduler kthread. > > > >>> I'm not sure if we want a way to prevent userspace from getting *too* > >>> far ahead of the kthread. Or maybe at some point the shrinker should > >>> sleep on non-idle buffers? > >> > >> On the direct reclaim path when invoked from the submit ioctl? In i915 > >> we only shrink idle objects on direct reclaim and leave active ones for > >> the swapper. It depends on how your locking looks like whether you could > >> do them, whether there would be coupling of locks and fs-reclaim context. > > > > I think the locking is more or less ok, although lockdep is unhappy > > about one thing[1] which is I think a false warning (ie. not > > recognizing that we'd already successfully acquired the obj lock via > > trylock). We can already reclaim idle bo's in this path. But the > > problem with a bunch of submits queued up in the scheduler, is that > > they are already considered pinned and active. So at some point we > > need to sleep (hopefully interruptabley) until they are no longer > > active, ie. to throttle userspace trying to shove in more submits > > until some of the enqueued ones have a chance to run and complete. > > Odd I did not think trylock could trigger that. Looking at your code it > indeed seems two trylocks. I am pretty sure we use the same trylock > trick to avoid it. I am confused.. The sequence is, 1. kref_get_unless_zero() 2. trylock, which succeeds 3. attempt to evict or purge (which may or may not have succeeded) 4. unlock ... meanwhile this has raced with submit (aka execbuf) finishing and retiring and dropping *other* remaining reference to bo... 5. drm_gem_object_put() which triggers drm_gem_object_free() 6. in our free path we acquire the obj lock again and then drop it. Which arguably is unnecessary and only serves to satisfy some GEM_WARN_ON(!msm_gem_is_locked(obj)) in code paths that are also used elsewhere lockdep doesn't realize the previously successful trylock+unlock sequence so it assumes that the code that triggered recursion into shrinker could be holding the objects lock. > > Otherwise if you can afford to sleep you can of course throttle > organically via direct reclaim. Unless I am forgetting some key gotcha - > it's been a while I've been active in this area. So, one thing that is awkward about sleeping in this path is that there is no way to propagate back -EINTR, so we end up doing an uninterruptible sleep in something that could be called indirectly from userspace syscall.. i915 seems to deal with this by limiting it to shrinker being called from kswapd. I think in the shrinker we want to know whether it is ok to sleep (ie. not syscall trigggered codepath, and whether we are under enough memory pressure to justify sleeping). For the syscall path, I'm playing with something that lets me pass __GFP_RETRY_MAYFAIL | __GFP_NOWARN to shmem_read_mapping_page_gfp(), and then stall after the shrinker has failed, somewhere where we can make it interruptable. Ofc, that doesn't help with all the other random memory allocations which can fail, so not sure if it will turn out to be a good approach or not. But I guess pinning the GEM bo's is the single biggest potential consumer of pages in the submit path, so maybe it will be better than nothing. BR, -R > > Regards, > > Tvrtko > > > > > BR, > > -R > > > > [1] https://gitlab.freedesktop.org/drm/msm/-/issues/14 > >
On Wed, May 25, 2022 at 9:22 AM Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote: > > > On 25/05/2022 14:41, Rob Clark wrote: > > On Wed, May 25, 2022 at 2:46 AM Tvrtko Ursulin > > <tvrtko.ursulin@linux.intel.com> wrote: > >> > >> > >> On 24/05/2022 15:50, Rob Clark wrote: > >>> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin > >>> <tvrtko.ursulin@linux.intel.com> wrote: > >>>> > >>>> > >>>> On 23/05/2022 23:53, Rob Clark wrote: > >>>>> On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin > >>>>> <tvrtko.ursulin@linux.intel.com> wrote: > >>>>>> > >>>>>> > >>>>>> Hi Rob, > >>>>>> > >>>>>> On 28/07/2021 02:06, Rob Clark wrote: > >>>>>>> From: Rob Clark <robdclark@chromium.org> > >>>>>>> > >>>>>>> The drm/scheduler provides additional prioritization on top of that > >>>>>>> provided by however many number of ringbuffers (each with their own > >>>>>>> priority level) is supported on a given generation. Expose the > >>>>>>> additional levels of priority to userspace and map the userspace > >>>>>>> priority back to ring (first level of priority) and schedular priority > >>>>>>> (additional priority levels within the ring). > >>>>>>> > >>>>>>> Signed-off-by: Rob Clark <robdclark@chromium.org> > >>>>>>> Acked-by: Christian König <christian.koenig@amd.com> > >>>>>>> --- > >>>>>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +- > >>>>>>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +- > >>>>>>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++- > >>>>>>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++-------- > >>>>>>> include/uapi/drm/msm_drm.h | 14 +++++- > >>>>>>> 5 files changed, 88 insertions(+), 27 deletions(-) > >>>>>>> > >>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > >>>>>>> index bad4809b68ef..748665232d29 100644 > >>>>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c > >>>>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > >>>>>>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value) > >>>>>>> return ret; > >>>>>>> } > >>>>>>> return -EINVAL; > >>>>>>> - case MSM_PARAM_NR_RINGS: > >>>>>>> - *value = gpu->nr_rings; > >>>>>>> + case MSM_PARAM_PRIORITIES: > >>>>>>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES; > >>>>>>> return 0; > >>>>>>> case MSM_PARAM_PP_PGTABLE: > >>>>>>> *value = 0; > >>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c > >>>>>>> index 450efe59abb5..c2ecec5b11c4 100644 > >>>>>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c > >>>>>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c > >>>>>>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev, > >>>>>>> submit->gpu = gpu; > >>>>>>> submit->cmd = (void *)&submit->bos[nr_bos]; > >>>>>>> submit->queue = queue; > >>>>>>> - submit->ring = gpu->rb[queue->prio]; > >>>>>>> + submit->ring = gpu->rb[queue->ring_nr]; > >>>>>>> submit->fault_dumped = false; > >>>>>>> > >>>>>>> INIT_LIST_HEAD(&submit->node); > >>>>>>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data, > >>>>>>> /* Get a unique identifier for the submission for logging purposes */ > >>>>>>> submitid = atomic_inc_return(&ident) - 1; > >>>>>>> > >>>>>>> - ring = gpu->rb[queue->prio]; > >>>>>>> + ring = gpu->rb[queue->ring_nr]; > >>>>>>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid, > >>>>>>> args->nr_bos, args->nr_cmds); > >>>>>>> > >>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h > >>>>>>> index b912cacaecc0..0e4b45bff2e6 100644 > >>>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h > >>>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h > >>>>>>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr { > >>>>>>> const char *name; > >>>>>>> }; > >>>>>>> > >>>>>>> +/* > >>>>>>> + * The number of priority levels provided by drm gpu scheduler. The > >>>>>>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some > >>>>>>> + * cases, so we don't use it (no need for kernel generated jobs). > >>>>>>> + */ > >>>>>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN) > >>>>>>> + > >>>>>>> +/** > >>>>>>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority > >>>>>>> + * > >>>>>>> + * @gpu: the gpu instance > >>>>>>> + * @prio: the userspace priority level > >>>>>>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to > >>>>>>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace > >>>>>>> + * priority maps to > >>>>>>> + * > >>>>>>> + * With drm/scheduler providing it's own level of prioritization, our total > >>>>>>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES). > >>>>>>> + * Each ring is associated with it's own scheduler instance. However, our > >>>>>>> + * UABI is that lower numerical values are higher priority. So mapping the > >>>>>>> + * single userspace priority level into ring_nr and sched_prio takes some > >>>>>>> + * care. The userspace provided priority (when a submitqueue is created) > >>>>>>> + * is mapped to ring nr and scheduler priority as such: > >>>>>>> + * > >>>>>>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES > >>>>>>> + * sched_prio = NR_SCHED_PRIORITIES - > >>>>>>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1 > >>>>>>> + * > >>>>>>> + * This allows generations without preemption (nr_rings==1) to have some > >>>>>>> + * amount of prioritization, and provides more priority levels for gens > >>>>>>> + * that do have preemption. > >>>>>> > >>>>>> I am exploring how different drivers handle priority levels and this > >>>>>> caught my eye. > >>>>>> > >>>>>> Is the implication of the last paragraphs that on hw with nr_rings > 1, > >>>>>> ring + 1 preempts ring? > >>>>> > >>>>> Other way around, at least from the uabi standpoint. Ie. ring[0] > >>>>> preempts ring[1] > >>>> > >>>> Ah yes, I figure it out from the comments but then confused myself when > >>>> writing the email. > >>>> > >>>>>> If so I am wondering does the "spreading" of > >>>>>> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable > >>>>>> levels within every "bucket" or how does that work? > >>>>> > >>>>> So, preemption is possible between any priority level before run_job() > >>>>> gets called, which writes the job into the ringbuffer. After that > >>>> > >>>> Hmm how? Before run_job() the jobs are not runnable, sitting in the > >>>> scheduler queues, right? > >>> > >>> I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on > >>> prio[1] could be executed after submit B on prio[2] provided that > >>> run_job(submitA) hasn't happened yet. So I guess it isn't "really" > >>> preemption because the submit hasn't started running on the GPU yet. > >>> But rather just scheduling according to priority. > >>> > >>>>> point, you only have "bucket" level preemption, because > >>>>> NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO > >>>>> ringbuffer. > >>>> > >>>> Right, and you have one GPU with four rings, which means you expose 12 > >>>> priority levels to userspace, did I get that right? > >>> > >>> Correct > >>> > >>>> If so how do you convey in the ABI that not all there priority levels > >>>> are equal? Like userspace can submit at prio 4 and expect prio 3 to > >>>> preempt, as would prio 2 preempt prio 3. While actual behaviour will not > >>>> match - 3 will not preempt 4. > >>> > >>> It isn't really exposed to userspace, but perhaps it should be.. > >>> Userspace just knows that, to the extent possible, the kernel will try > >>> to execute prio 3 before prio 4. > >>> > >>>> Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a > >>>> quick peek in Mesa but did not spot it - although I am not really at > >>>> home there yet so maybe I missed it. > >>> > >>> Yes, there is an EGL extension: > >>> > >>> https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt > >>> > >>> It is pretty limited, it only exposes three priority levels. > >> > >> Right, is that wired up on msm? And if it is, or could be, how do/would > >> you map the three priority levels for GPUs which expose 3 priority > >> levels versus the one which exposes 12? > > > > We don't yet, but probably should, expose a cap to indicate to > > userspace the # of hw rings vs # of levels of sched priority > > What bothers me is the question of whether this setup provides a > consistent benefit. Why would userspace use other than "real" (hardware) > priority levels on chips where they are available? yeah, perhaps we could decide that userspace doesn't really need more than 3 prio levels, and that on generations which have better preemption than what drm/sched provides, *only* expose those priority levels. I've avoided that so far because it seems wrong for the kernel to assume that a single EGL extension is all there is when it comes to userspace context priority.. the other option is to expose more information to userspace and let it decide. Honestly, the combination of the fact that a6xx is the first gen shipping in consumer products with upstream driver (using drm/sched), and not having had time yet to implement hw preemption for a6xx yet, means not a whole lot of thought has gone into the current arrangement ;-) > For instance if you exposed 4 instead of 12 on a respective platform, > would that be better or worse? Yes you could only map three directly > drm/sched and one would have to be "fake". Like: > > hw prio 0 -> drm/sched 2 > hw prio 1 -> drm/sched 1 > hw prio 2 -> drm/sched 0 > hw prio 3 -> drm/sched 0 > > Not saying that's nice either. Perhaps the answer is that drm/sched > needs more flexibility for instance if it wants to be widely used. I'm not sure what I'd add to drm/sched.. once it calls run_job() things are out of its hands, so really all it can do is re-order things prior to calling run_job() according to it's internal priority levels. And that is still better than no re-ordering so it adds some value, even if not complete. > For instance in i915 uapi we have priority as int -1023 - +1023. And > matching implementation on some platforms, until the new ones which are > GuC firmware based, where we need to squash that to low/normal/high. hmm, that is a more awkward problem, since it sounds like you are mapping many more priority levels into a much smaller set of hw priority levels. Do you have separate drm_sched instances per hw priority level? If so you can do the same thing of using drm_sched priority levels to multiply # of hw priority levels, but ofc that is not perfect (and won't get you to 2k). But is there anything that actually *uses* that many levels of priority? BR, -R > So thinking was drm/sched happens to align with GuC. But then we have > your hw where it doesn't seem to. > > Regards, > > Tvrtko > > >> Is it doable properly without leaking the fact drm/sched internal > >> implementation detail of three priority levels? Or if you went the other > >> way and only exposed up to max 3 levels, then you lose one priority > >> level your hardware suppose which is also not good. > >> > >> It is all quite interesting because your hardware is completely > >> different from ours in this respect. In our case i915 decides when to > >> preempt, hardware has no concept of priority (*). > > > > It is really pretty much all in firmware.. a6xx is the first gen that > > could do actual (non-cooperative) preemption (but that isn't > > implemented yet in upstream driver) > > > > BR, > > -R > > > >> Regards, > >> > >> Tvrtko > >> > >> (*) Almost no concept of priority in hardware - we do have it on new > >> GPUs and only on a subset of engine classes where render and compute > >> share the EUs. But I think it's way different from Ardenos.
On 26/05/2022 04:37, Rob Clark wrote: > On Wed, May 25, 2022 at 9:22 AM Tvrtko Ursulin > <tvrtko.ursulin@linux.intel.com> wrote: >> >> >> On 25/05/2022 14:41, Rob Clark wrote: >>> On Wed, May 25, 2022 at 2:46 AM Tvrtko Ursulin >>> <tvrtko.ursulin@linux.intel.com> wrote: >>>> >>>> >>>> On 24/05/2022 15:50, Rob Clark wrote: >>>>> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin >>>>> <tvrtko.ursulin@linux.intel.com> wrote: >>>>>> >>>>>> >>>>>> On 23/05/2022 23:53, Rob Clark wrote: >>>>>>> On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin >>>>>>> <tvrtko.ursulin@linux.intel.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> Hi Rob, >>>>>>>> >>>>>>>> On 28/07/2021 02:06, Rob Clark wrote: >>>>>>>>> From: Rob Clark <robdclark@chromium.org> >>>>>>>>> >>>>>>>>> The drm/scheduler provides additional prioritization on top of that >>>>>>>>> provided by however many number of ringbuffers (each with their own >>>>>>>>> priority level) is supported on a given generation. Expose the >>>>>>>>> additional levels of priority to userspace and map the userspace >>>>>>>>> priority back to ring (first level of priority) and schedular priority >>>>>>>>> (additional priority levels within the ring). >>>>>>>>> >>>>>>>>> Signed-off-by: Rob Clark <robdclark@chromium.org> >>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com> >>>>>>>>> --- >>>>>>>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +- >>>>>>>>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +- >>>>>>>>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++- >>>>>>>>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++-------- >>>>>>>>> include/uapi/drm/msm_drm.h | 14 +++++- >>>>>>>>> 5 files changed, 88 insertions(+), 27 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c >>>>>>>>> index bad4809b68ef..748665232d29 100644 >>>>>>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c >>>>>>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c >>>>>>>>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value) >>>>>>>>> return ret; >>>>>>>>> } >>>>>>>>> return -EINVAL; >>>>>>>>> - case MSM_PARAM_NR_RINGS: >>>>>>>>> - *value = gpu->nr_rings; >>>>>>>>> + case MSM_PARAM_PRIORITIES: >>>>>>>>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES; >>>>>>>>> return 0; >>>>>>>>> case MSM_PARAM_PP_PGTABLE: >>>>>>>>> *value = 0; >>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c >>>>>>>>> index 450efe59abb5..c2ecec5b11c4 100644 >>>>>>>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c >>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c >>>>>>>>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev, >>>>>>>>> submit->gpu = gpu; >>>>>>>>> submit->cmd = (void *)&submit->bos[nr_bos]; >>>>>>>>> submit->queue = queue; >>>>>>>>> - submit->ring = gpu->rb[queue->prio]; >>>>>>>>> + submit->ring = gpu->rb[queue->ring_nr]; >>>>>>>>> submit->fault_dumped = false; >>>>>>>>> >>>>>>>>> INIT_LIST_HEAD(&submit->node); >>>>>>>>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data, >>>>>>>>> /* Get a unique identifier for the submission for logging purposes */ >>>>>>>>> submitid = atomic_inc_return(&ident) - 1; >>>>>>>>> >>>>>>>>> - ring = gpu->rb[queue->prio]; >>>>>>>>> + ring = gpu->rb[queue->ring_nr]; >>>>>>>>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid, >>>>>>>>> args->nr_bos, args->nr_cmds); >>>>>>>>> >>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h >>>>>>>>> index b912cacaecc0..0e4b45bff2e6 100644 >>>>>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h >>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h >>>>>>>>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr { >>>>>>>>> const char *name; >>>>>>>>> }; >>>>>>>>> >>>>>>>>> +/* >>>>>>>>> + * The number of priority levels provided by drm gpu scheduler. The >>>>>>>>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some >>>>>>>>> + * cases, so we don't use it (no need for kernel generated jobs). >>>>>>>>> + */ >>>>>>>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN) >>>>>>>>> + >>>>>>>>> +/** >>>>>>>>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority >>>>>>>>> + * >>>>>>>>> + * @gpu: the gpu instance >>>>>>>>> + * @prio: the userspace priority level >>>>>>>>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to >>>>>>>>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace >>>>>>>>> + * priority maps to >>>>>>>>> + * >>>>>>>>> + * With drm/scheduler providing it's own level of prioritization, our total >>>>>>>>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES). >>>>>>>>> + * Each ring is associated with it's own scheduler instance. However, our >>>>>>>>> + * UABI is that lower numerical values are higher priority. So mapping the >>>>>>>>> + * single userspace priority level into ring_nr and sched_prio takes some >>>>>>>>> + * care. The userspace provided priority (when a submitqueue is created) >>>>>>>>> + * is mapped to ring nr and scheduler priority as such: >>>>>>>>> + * >>>>>>>>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES >>>>>>>>> + * sched_prio = NR_SCHED_PRIORITIES - >>>>>>>>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1 >>>>>>>>> + * >>>>>>>>> + * This allows generations without preemption (nr_rings==1) to have some >>>>>>>>> + * amount of prioritization, and provides more priority levels for gens >>>>>>>>> + * that do have preemption. >>>>>>>> >>>>>>>> I am exploring how different drivers handle priority levels and this >>>>>>>> caught my eye. >>>>>>>> >>>>>>>> Is the implication of the last paragraphs that on hw with nr_rings > 1, >>>>>>>> ring + 1 preempts ring? >>>>>>> >>>>>>> Other way around, at least from the uabi standpoint. Ie. ring[0] >>>>>>> preempts ring[1] >>>>>> >>>>>> Ah yes, I figure it out from the comments but then confused myself when >>>>>> writing the email. >>>>>> >>>>>>>> If so I am wondering does the "spreading" of >>>>>>>> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable >>>>>>>> levels within every "bucket" or how does that work? >>>>>>> >>>>>>> So, preemption is possible between any priority level before run_job() >>>>>>> gets called, which writes the job into the ringbuffer. After that >>>>>> >>>>>> Hmm how? Before run_job() the jobs are not runnable, sitting in the >>>>>> scheduler queues, right? >>>>> >>>>> I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on >>>>> prio[1] could be executed after submit B on prio[2] provided that >>>>> run_job(submitA) hasn't happened yet. So I guess it isn't "really" >>>>> preemption because the submit hasn't started running on the GPU yet. >>>>> But rather just scheduling according to priority. >>>>> >>>>>>> point, you only have "bucket" level preemption, because >>>>>>> NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO >>>>>>> ringbuffer. >>>>>> >>>>>> Right, and you have one GPU with four rings, which means you expose 12 >>>>>> priority levels to userspace, did I get that right? >>>>> >>>>> Correct >>>>> >>>>>> If so how do you convey in the ABI that not all there priority levels >>>>>> are equal? Like userspace can submit at prio 4 and expect prio 3 to >>>>>> preempt, as would prio 2 preempt prio 3. While actual behaviour will not >>>>>> match - 3 will not preempt 4. >>>>> >>>>> It isn't really exposed to userspace, but perhaps it should be.. >>>>> Userspace just knows that, to the extent possible, the kernel will try >>>>> to execute prio 3 before prio 4. >>>>> >>>>>> Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a >>>>>> quick peek in Mesa but did not spot it - although I am not really at >>>>>> home there yet so maybe I missed it. >>>>> >>>>> Yes, there is an EGL extension: >>>>> >>>>> https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt >>>>> >>>>> It is pretty limited, it only exposes three priority levels. >>>> >>>> Right, is that wired up on msm? And if it is, or could be, how do/would >>>> you map the three priority levels for GPUs which expose 3 priority >>>> levels versus the one which exposes 12? >>> >>> We don't yet, but probably should, expose a cap to indicate to >>> userspace the # of hw rings vs # of levels of sched priority >> >> What bothers me is the question of whether this setup provides a >> consistent benefit. Why would userspace use other than "real" (hardware) >> priority levels on chips where they are available? > > yeah, perhaps we could decide that userspace doesn't really need more > than 3 prio levels, and that on generations which have better > preemption than what drm/sched provides, *only* expose those priority > levels. I've avoided that so far because it seems wrong for the > kernel to assume that a single EGL extension is all there is when it > comes to userspace context priority.. the other option is to expose > more information to userspace and let it decide. Maybe in msm you could reserve 0 for kernel submissions (if you have such use cases) and expose levels 1-3 via drm/sched? If you could wire that up, and if four levels is most your hardware will have. Although with that option it seems drm/sched could starve lower priorities, I mean not give anything to the hw/fw scheduler on higher rings as longs as there is work on lower. Which if those chips have some smarter algorithm would defeat it. So perhaps there is no way but improving drm/sched. Backend controlled number of priorities and backend control for whether "in flight" job s limit is global vs per priority level (per run queue). Btw my motivation looking into all this is that we have CPU nice and ionice supporting more levels and I'd like to tie that all together into one consistent user friendly story (see https://patchwork.freedesktop.org/series/102348/). In a world of heterogenous compute pipelines I think that is the way forward. I even demonstrated this from within ChromeOS, since the compositor uses nice -5 is automatically gives it more GPU bandwith compared to for instance Android VM. I know of other hardware supporting more than three levels, but I need to study more drm drivers to gain a complete picture. I only started with msm since it looked simple. :) > Honestly, the combination of the fact that a6xx is the first gen > shipping in consumer products with upstream driver (using drm/sched), > and not having had time yet to implement hw preemption for a6xx yet, > means not a whole lot of thought has gone into the current arrangement > ;-) :) What kind of scheduling algorithm does your hardware have between those priority levels? >> For instance if you exposed 4 instead of 12 on a respective platform, >> would that be better or worse? Yes you could only map three directly >> drm/sched and one would have to be "fake". Like: >> >> hw prio 0 -> drm/sched 2 >> hw prio 1 -> drm/sched 1 >> hw prio 2 -> drm/sched 0 >> hw prio 3 -> drm/sched 0 >> >> Not saying that's nice either. Perhaps the answer is that drm/sched >> needs more flexibility for instance if it wants to be widely used. > > I'm not sure what I'd add to drm/sched.. once it calls run_job() > things are out of its hands, so really all it can do is re-order > things prior to calling run_job() according to it's internal priority > levels. And that is still better than no re-ordering so it adds some > value, even if not complete. Not sure about the value there - as mentioned before I see problems on the uapi front with not all priorities being equal. Besides, priority order scheduling is kind of meh to me. Especially if it only applies in the scheduling frontend. If frontend and backend algorithms do not even match then it's even more weird. IMO sooner or later GPU scheduling will have to catchup with state of the art from the CPU world and use priority as a hint for time sharing decisions. >> For instance in i915 uapi we have priority as int -1023 - +1023. And >> matching implementation on some platforms, until the new ones which are >> GuC firmware based, where we need to squash that to low/normal/high. > > hmm, that is a more awkward problem, since it sounds like you are > mapping many more priority levels into a much smaller set of hw > priority levels. Do you have separate drm_sched instances per hw > priority level? If so you can do the same thing of using drm_sched > priority levels to multiply # of hw priority levels, but ofc that is > not perfect (and won't get you to 2k). We don't use drm/sched yet, I was just mentioning what we have in uapi. But yes, our current scheduling backend can handle more than three levels. > But is there anything that actually *uses* that many levels of priority? From userspace no, there are only a few internal priority levels for things like heartbeats the driver is sending to check engine health and page flip priority boosts. Regards, Tvrtko
On 26/05/2022 04:15, Rob Clark wrote: > On Wed, May 25, 2022 at 9:11 AM Tvrtko Ursulin > <tvrtko.ursulin@linux.intel.com> wrote: >> >> >> On 24/05/2022 15:57, Rob Clark wrote: >>> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin >>> <tvrtko.ursulin@linux.intel.com> wrote: >>>> >>>> On 23/05/2022 23:53, Rob Clark wrote: >>>>> >>>>> btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm >>>>> trying to add an igt test to stress shrinker/eviction, similar to the >>>>> existing tests/i915/gem_shrink.c. But we hit an unfortunate >>>>> combination of circumstances: >>>>> 1. Pinning memory happens in the synchronous part of the submit ioctl, >>>>> before enqueuing the job for the kthread to handle. >>>>> 2. The first run_job() callback incurs a slight delay (~1.5ms) while >>>>> resuming the GPU >>>>> 3. Because of that delay, userspace has a chance to queue up enough >>>>> more jobs to require locking/pinning more than the available system >>>>> RAM.. >>>> >>>> Is that one or multiple threads submitting jobs? >>> >>> In this case multiple.. but I think it could also happen with a single >>> thread (provided it didn't stall on a fence, directly or indirectly, >>> from an earlier submit), because of how resume and actual job >>> submission happens from scheduler kthread. >>> >>>>> I'm not sure if we want a way to prevent userspace from getting *too* >>>>> far ahead of the kthread. Or maybe at some point the shrinker should >>>>> sleep on non-idle buffers? >>>> >>>> On the direct reclaim path when invoked from the submit ioctl? In i915 >>>> we only shrink idle objects on direct reclaim and leave active ones for >>>> the swapper. It depends on how your locking looks like whether you could >>>> do them, whether there would be coupling of locks and fs-reclaim context. >>> >>> I think the locking is more or less ok, although lockdep is unhappy >>> about one thing[1] which is I think a false warning (ie. not >>> recognizing that we'd already successfully acquired the obj lock via >>> trylock). We can already reclaim idle bo's in this path. But the >>> problem with a bunch of submits queued up in the scheduler, is that >>> they are already considered pinned and active. So at some point we >>> need to sleep (hopefully interruptabley) until they are no longer >>> active, ie. to throttle userspace trying to shove in more submits >>> until some of the enqueued ones have a chance to run and complete. >> >> Odd I did not think trylock could trigger that. Looking at your code it >> indeed seems two trylocks. I am pretty sure we use the same trylock >> trick to avoid it. I am confused.. > > The sequence is, > > 1. kref_get_unless_zero() > 2. trylock, which succeeds > 3. attempt to evict or purge (which may or may not have succeeded) > 4. unlock > > ... meanwhile this has raced with submit (aka execbuf) finishing and > retiring and dropping *other* remaining reference to bo... > > 5. drm_gem_object_put() which triggers drm_gem_object_free() > 6. in our free path we acquire the obj lock again and then drop it. > Which arguably is unnecessary and only serves to satisfy some > GEM_WARN_ON(!msm_gem_is_locked(obj)) in code paths that are also used > elsewhere > > lockdep doesn't realize the previously successful trylock+unlock > sequence so it assumes that the code that triggered recursion into > shrinker could be holding the objects lock. Ah yes, missed that lock after trylock in msm_gem_shrinker/scan(). Well i915 has the same sequence in our shrinker, but the difference is we use delayed work to actually free, _and_ use trylock in the delayed worker. It does feel a bit inelegant (objects with no reference count which cannot be trylocked?!), but as this is the code recently refactored by Maarten so I think best try and sync with him for the full story. >> Otherwise if you can afford to sleep you can of course throttle >> organically via direct reclaim. Unless I am forgetting some key gotcha - >> it's been a while I've been active in this area. > > So, one thing that is awkward about sleeping in this path is that > there is no way to propagate back -EINTR, so we end up doing an > uninterruptible sleep in something that could be called indirectly > from userspace syscall.. i915 seems to deal with this by limiting it > to shrinker being called from kswapd. I think in the shrinker we want > to know whether it is ok to sleep (ie. not syscall trigggered > codepath, and whether we are under enough memory pressure to justify > sleeping). For the syscall path, I'm playing with something that lets > me pass __GFP_RETRY_MAYFAIL | __GFP_NOWARN to > shmem_read_mapping_page_gfp(), and then stall after the shrinker has > failed, somewhere where we can make it interruptable. Ofc, that > doesn't help with all the other random memory allocations which can > fail, so not sure if it will turn out to be a good approach or not. > But I guess pinning the GEM bo's is the single biggest potential > consumer of pages in the submit path, so maybe it will be better than > nothing. We play similar games, although by a quick look I am not sure we quite manage to honour/propagate signals. This has certainly been a historically fiddly area. If you first ask for no reclaim allocations and invoke the shrinker manually first, then falling back to a bigger hammer, you should be able to do it. Regards, Tvrtko
On Thu, May 26, 2022 at 4:38 AM Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote: > > > On 26/05/2022 04:37, Rob Clark wrote: > > On Wed, May 25, 2022 at 9:22 AM Tvrtko Ursulin > > <tvrtko.ursulin@linux.intel.com> wrote: > >> > >> > >> On 25/05/2022 14:41, Rob Clark wrote: > >>> On Wed, May 25, 2022 at 2:46 AM Tvrtko Ursulin > >>> <tvrtko.ursulin@linux.intel.com> wrote: > >>>> > >>>> > >>>> On 24/05/2022 15:50, Rob Clark wrote: > >>>>> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin > >>>>> <tvrtko.ursulin@linux.intel.com> wrote: > >>>>>> > >>>>>> > >>>>>> On 23/05/2022 23:53, Rob Clark wrote: > >>>>>>> On Mon, May 23, 2022 at 7:45 AM Tvrtko Ursulin > >>>>>>> <tvrtko.ursulin@linux.intel.com> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> Hi Rob, > >>>>>>>> > >>>>>>>> On 28/07/2021 02:06, Rob Clark wrote: > >>>>>>>>> From: Rob Clark <robdclark@chromium.org> > >>>>>>>>> > >>>>>>>>> The drm/scheduler provides additional prioritization on top of that > >>>>>>>>> provided by however many number of ringbuffers (each with their own > >>>>>>>>> priority level) is supported on a given generation. Expose the > >>>>>>>>> additional levels of priority to userspace and map the userspace > >>>>>>>>> priority back to ring (first level of priority) and schedular priority > >>>>>>>>> (additional priority levels within the ring). > >>>>>>>>> > >>>>>>>>> Signed-off-by: Rob Clark <robdclark@chromium.org> > >>>>>>>>> Acked-by: Christian König <christian.koenig@amd.com> > >>>>>>>>> --- > >>>>>>>>> drivers/gpu/drm/msm/adreno/adreno_gpu.c | 4 +- > >>>>>>>>> drivers/gpu/drm/msm/msm_gem_submit.c | 4 +- > >>>>>>>>> drivers/gpu/drm/msm/msm_gpu.h | 58 ++++++++++++++++++++++++- > >>>>>>>>> drivers/gpu/drm/msm/msm_submitqueue.c | 35 +++++++-------- > >>>>>>>>> include/uapi/drm/msm_drm.h | 14 +++++- > >>>>>>>>> 5 files changed, 88 insertions(+), 27 deletions(-) > >>>>>>>>> > >>>>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > >>>>>>>>> index bad4809b68ef..748665232d29 100644 > >>>>>>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c > >>>>>>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c > >>>>>>>>> @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value) > >>>>>>>>> return ret; > >>>>>>>>> } > >>>>>>>>> return -EINVAL; > >>>>>>>>> - case MSM_PARAM_NR_RINGS: > >>>>>>>>> - *value = gpu->nr_rings; > >>>>>>>>> + case MSM_PARAM_PRIORITIES: > >>>>>>>>> + *value = gpu->nr_rings * NR_SCHED_PRIORITIES; > >>>>>>>>> return 0; > >>>>>>>>> case MSM_PARAM_PP_PGTABLE: > >>>>>>>>> *value = 0; > >>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c > >>>>>>>>> index 450efe59abb5..c2ecec5b11c4 100644 > >>>>>>>>> --- a/drivers/gpu/drm/msm/msm_gem_submit.c > >>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c > >>>>>>>>> @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev, > >>>>>>>>> submit->gpu = gpu; > >>>>>>>>> submit->cmd = (void *)&submit->bos[nr_bos]; > >>>>>>>>> submit->queue = queue; > >>>>>>>>> - submit->ring = gpu->rb[queue->prio]; > >>>>>>>>> + submit->ring = gpu->rb[queue->ring_nr]; > >>>>>>>>> submit->fault_dumped = false; > >>>>>>>>> > >>>>>>>>> INIT_LIST_HEAD(&submit->node); > >>>>>>>>> @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data, > >>>>>>>>> /* Get a unique identifier for the submission for logging purposes */ > >>>>>>>>> submitid = atomic_inc_return(&ident) - 1; > >>>>>>>>> > >>>>>>>>> - ring = gpu->rb[queue->prio]; > >>>>>>>>> + ring = gpu->rb[queue->ring_nr]; > >>>>>>>>> trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid, > >>>>>>>>> args->nr_bos, args->nr_cmds); > >>>>>>>>> > >>>>>>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h > >>>>>>>>> index b912cacaecc0..0e4b45bff2e6 100644 > >>>>>>>>> --- a/drivers/gpu/drm/msm/msm_gpu.h > >>>>>>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.h > >>>>>>>>> @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr { > >>>>>>>>> const char *name; > >>>>>>>>> }; > >>>>>>>>> > >>>>>>>>> +/* > >>>>>>>>> + * The number of priority levels provided by drm gpu scheduler. The > >>>>>>>>> + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some > >>>>>>>>> + * cases, so we don't use it (no need for kernel generated jobs). > >>>>>>>>> + */ > >>>>>>>>> +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN) > >>>>>>>>> + > >>>>>>>>> +/** > >>>>>>>>> + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority > >>>>>>>>> + * > >>>>>>>>> + * @gpu: the gpu instance > >>>>>>>>> + * @prio: the userspace priority level > >>>>>>>>> + * @ring_nr: [out] the ringbuffer the userspace priority maps to > >>>>>>>>> + * @sched_prio: [out] the gpu scheduler priority level which the userspace > >>>>>>>>> + * priority maps to > >>>>>>>>> + * > >>>>>>>>> + * With drm/scheduler providing it's own level of prioritization, our total > >>>>>>>>> + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES). > >>>>>>>>> + * Each ring is associated with it's own scheduler instance. However, our > >>>>>>>>> + * UABI is that lower numerical values are higher priority. So mapping the > >>>>>>>>> + * single userspace priority level into ring_nr and sched_prio takes some > >>>>>>>>> + * care. The userspace provided priority (when a submitqueue is created) > >>>>>>>>> + * is mapped to ring nr and scheduler priority as such: > >>>>>>>>> + * > >>>>>>>>> + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES > >>>>>>>>> + * sched_prio = NR_SCHED_PRIORITIES - > >>>>>>>>> + * (userspace_prio % NR_SCHED_PRIORITIES) - 1 > >>>>>>>>> + * > >>>>>>>>> + * This allows generations without preemption (nr_rings==1) to have some > >>>>>>>>> + * amount of prioritization, and provides more priority levels for gens > >>>>>>>>> + * that do have preemption. > >>>>>>>> > >>>>>>>> I am exploring how different drivers handle priority levels and this > >>>>>>>> caught my eye. > >>>>>>>> > >>>>>>>> Is the implication of the last paragraphs that on hw with nr_rings > 1, > >>>>>>>> ring + 1 preempts ring? > >>>>>>> > >>>>>>> Other way around, at least from the uabi standpoint. Ie. ring[0] > >>>>>>> preempts ring[1] > >>>>>> > >>>>>> Ah yes, I figure it out from the comments but then confused myself when > >>>>>> writing the email. > >>>>>> > >>>>>>>> If so I am wondering does the "spreading" of > >>>>>>>> user visible priorities by NR_SCHED_PRIORITIES creates a non-preemptable > >>>>>>>> levels within every "bucket" or how does that work? > >>>>>>> > >>>>>>> So, preemption is possible between any priority level before run_job() > >>>>>>> gets called, which writes the job into the ringbuffer. After that > >>>>>> > >>>>>> Hmm how? Before run_job() the jobs are not runnable, sitting in the > >>>>>> scheduler queues, right? > >>>>> > >>>>> I mean, if prio[0]+prio[1]+prio[2] map to a single ring, submit A on > >>>>> prio[1] could be executed after submit B on prio[2] provided that > >>>>> run_job(submitA) hasn't happened yet. So I guess it isn't "really" > >>>>> preemption because the submit hasn't started running on the GPU yet. > >>>>> But rather just scheduling according to priority. > >>>>> > >>>>>>> point, you only have "bucket" level preemption, because > >>>>>>> NR_SCHED_PRIORITIES levels of priority get mapped to a single FIFO > >>>>>>> ringbuffer. > >>>>>> > >>>>>> Right, and you have one GPU with four rings, which means you expose 12 > >>>>>> priority levels to userspace, did I get that right? > >>>>> > >>>>> Correct > >>>>> > >>>>>> If so how do you convey in the ABI that not all there priority levels > >>>>>> are equal? Like userspace can submit at prio 4 and expect prio 3 to > >>>>>> preempt, as would prio 2 preempt prio 3. While actual behaviour will not > >>>>>> match - 3 will not preempt 4. > >>>>> > >>>>> It isn't really exposed to userspace, but perhaps it should be.. > >>>>> Userspace just knows that, to the extent possible, the kernel will try > >>>>> to execute prio 3 before prio 4. > >>>>> > >>>>>> Also, does your userspace stack (EGL/Vulkan) use the priorities? I had a > >>>>>> quick peek in Mesa but did not spot it - although I am not really at > >>>>>> home there yet so maybe I missed it. > >>>>> > >>>>> Yes, there is an EGL extension: > >>>>> > >>>>> https://www.khronos.org/registry/EGL/extensions/IMG/EGL_IMG_context_priority.txt > >>>>> > >>>>> It is pretty limited, it only exposes three priority levels. > >>>> > >>>> Right, is that wired up on msm? And if it is, or could be, how do/would > >>>> you map the three priority levels for GPUs which expose 3 priority > >>>> levels versus the one which exposes 12? > >>> > >>> We don't yet, but probably should, expose a cap to indicate to > >>> userspace the # of hw rings vs # of levels of sched priority > >> > >> What bothers me is the question of whether this setup provides a > >> consistent benefit. Why would userspace use other than "real" (hardware) > >> priority levels on chips where they are available? > > > > yeah, perhaps we could decide that userspace doesn't really need more > > than 3 prio levels, and that on generations which have better > > preemption than what drm/sched provides, *only* expose those priority > > levels. I've avoided that so far because it seems wrong for the > > kernel to assume that a single EGL extension is all there is when it > > comes to userspace context priority.. the other option is to expose > > more information to userspace and let it decide. > > Maybe in msm you could reserve 0 for kernel submissions (if you have > such use cases) and expose levels 1-3 via drm/sched? If you could wire > that up, and if four levels is most your hardware will have. we fortunately don't need kernel submission for anything... that said, the limited # of priorities for drm/sched seems a bit arbitrary (although perhaps catering to the existing egl extension) > Although with that option it seems drm/sched could starve lower > priorities, I mean not give anything to the hw/fw scheduler on higher > rings as longs as there is work on lower. Which if those chips have some > smarter algorithm would defeat it. So the thing is the (existing) gpu scheduling is strictly priority based, and not "nice" based like CPU scheduling. Those two schemes are completely different paradigms, the latter giving some boost to processes that have been blocked on I/O (which, I'm not sure there is an equiv thing for GPU) or otherwise haven't had a chance to run for a while. > So perhaps there is no way but improving drm/sched. Backend controlled > number of priorities and backend control for whether "in flight" job s > limit is global vs per priority level (per run queue). > > Btw my motivation looking into all this is that we have CPU nice and > ionice supporting more levels and I'd like to tie that all together into > one consistent user friendly story (see > https://patchwork.freedesktop.org/series/102348/). In a world of > heterogenous compute pipelines I think that is the way forward. I even > demonstrated this from within ChromeOS, since the compositor uses nice > -5 is automatically gives it more GPU bandwith compared to for instance > Android VM. But this can be achieved with a simple priority based scheme, ie. compositor is higher priority than app. The situation changes a bit, and becomes more cpu like perhaps, when you add long running compute and cpu-offload stuff > I know of other hardware supporting more than three levels, but I need > to study more drm drivers to gain a complete picture. I only started > with msm since it looked simple. :) even in msm the # of priority levels is somewhat arbitrary.. but roughly it is that we tell the hw there is something higher priority to run, it waits a bit for a cooperative yield point (since force preemption is rather expensive for 3d, ie. there is a lot of state to save/restore, not just a few cpu registers), and then eventually if a cooperative yield point isn't hit it triggers a forced preemption. (Only on newer things, older gens only had cooperative yield points to work with.) > > Honestly, the combination of the fact that a6xx is the first gen > > shipping in consumer products with upstream driver (using drm/sched), > > and not having had time yet to implement hw preemption for a6xx yet, > > means not a whole lot of thought has gone into the current arrangement > > ;-) > > :) > > What kind of scheduling algorithm does your hardware have between those > priority levels? Like I said, it is strictly "thing A is higher priority than thing B".. there is no CSF or io-nice type thing. I guess since it is still the kernel that initiates the preemption, we could in theory implement something more clever. But I'm not entirely sure something more clever makes sense given the relatively high cost of forced preemption compared to CPU. Ofc I could be wrong, I've not given a lot of thought to it other than more limited scenarios (ie. compositor should be higher priority than app) BR, -R > >> For instance if you exposed 4 instead of 12 on a respective platform, > >> would that be better or worse? Yes you could only map three directly > >> drm/sched and one would have to be "fake". Like: > >> > >> hw prio 0 -> drm/sched 2 > >> hw prio 1 -> drm/sched 1 > >> hw prio 2 -> drm/sched 0 > >> hw prio 3 -> drm/sched 0 > >> > >> Not saying that's nice either. Perhaps the answer is that drm/sched > >> needs more flexibility for instance if it wants to be widely used. > > > > I'm not sure what I'd add to drm/sched.. once it calls run_job() > > things are out of its hands, so really all it can do is re-order > > things prior to calling run_job() according to it's internal priority > > levels. And that is still better than no re-ordering so it adds some > > value, even if not complete. > > Not sure about the value there - as mentioned before I see problems on > the uapi front with not all priorities being equal. > > Besides, priority order scheduling is kind of meh to me. Especially if > it only applies in the scheduling frontend. If frontend and backend > algorithms do not even match then it's even more weird. > > IMO sooner or later GPU scheduling will have to catchup with state of > the art from the CPU world and use priority as a hint for time sharing > decisions. Maybe.. that is a lot more sophisticated than the current situation of "queue A should have higher priority than queue B" OTOH actual preemption of GPU work is a lot more expensive than preempting a CPU thread, so not even sure if we should try and look at GPU and CPU scheduling the same way. (But so far I've only looked at it as "compositor should have higher priority than app") BR, -R > >> For instance in i915 uapi we have priority as int -1023 - +1023. And > >> matching implementation on some platforms, until the new ones which are > >> GuC firmware based, where we need to squash that to low/normal/high. > > > > hmm, that is a more awkward problem, since it sounds like you are > > mapping many more priority levels into a much smaller set of hw > > priority levels. Do you have separate drm_sched instances per hw > > priority level? If so you can do the same thing of using drm_sched > > priority levels to multiply # of hw priority levels, but ofc that is > > not perfect (and won't get you to 2k). > > We don't use drm/sched yet, I was just mentioning what we have in uapi. > But yes, our current scheduling backend can handle more than three levels. > > > But is there anything that actually *uses* that many levels of priority? > > From userspace no, there are only a few internal priority levels for > things like heartbeats the driver is sending to check engine health and > page flip priority boosts. > > Regards, > > Tvrtko
On Thu, May 26, 2022 at 6:29 AM Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote: > > > On 26/05/2022 04:15, Rob Clark wrote: > > On Wed, May 25, 2022 at 9:11 AM Tvrtko Ursulin > > <tvrtko.ursulin@linux.intel.com> wrote: > >> > >> > >> On 24/05/2022 15:57, Rob Clark wrote: > >>> On Tue, May 24, 2022 at 6:45 AM Tvrtko Ursulin > >>> <tvrtko.ursulin@linux.intel.com> wrote: > >>>> > >>>> On 23/05/2022 23:53, Rob Clark wrote: > >>>>> > >>>>> btw, one fun (but unrelated) issue I'm hitting with scheduler... I'm > >>>>> trying to add an igt test to stress shrinker/eviction, similar to the > >>>>> existing tests/i915/gem_shrink.c. But we hit an unfortunate > >>>>> combination of circumstances: > >>>>> 1. Pinning memory happens in the synchronous part of the submit ioctl, > >>>>> before enqueuing the job for the kthread to handle. > >>>>> 2. The first run_job() callback incurs a slight delay (~1.5ms) while > >>>>> resuming the GPU > >>>>> 3. Because of that delay, userspace has a chance to queue up enough > >>>>> more jobs to require locking/pinning more than the available system > >>>>> RAM.. > >>>> > >>>> Is that one or multiple threads submitting jobs? > >>> > >>> In this case multiple.. but I think it could also happen with a single > >>> thread (provided it didn't stall on a fence, directly or indirectly, > >>> from an earlier submit), because of how resume and actual job > >>> submission happens from scheduler kthread. > >>> > >>>>> I'm not sure if we want a way to prevent userspace from getting *too* > >>>>> far ahead of the kthread. Or maybe at some point the shrinker should > >>>>> sleep on non-idle buffers? > >>>> > >>>> On the direct reclaim path when invoked from the submit ioctl? In i915 > >>>> we only shrink idle objects on direct reclaim and leave active ones for > >>>> the swapper. It depends on how your locking looks like whether you could > >>>> do them, whether there would be coupling of locks and fs-reclaim context. > >>> > >>> I think the locking is more or less ok, although lockdep is unhappy > >>> about one thing[1] which is I think a false warning (ie. not > >>> recognizing that we'd already successfully acquired the obj lock via > >>> trylock). We can already reclaim idle bo's in this path. But the > >>> problem with a bunch of submits queued up in the scheduler, is that > >>> they are already considered pinned and active. So at some point we > >>> need to sleep (hopefully interruptabley) until they are no longer > >>> active, ie. to throttle userspace trying to shove in more submits > >>> until some of the enqueued ones have a chance to run and complete. > >> > >> Odd I did not think trylock could trigger that. Looking at your code it > >> indeed seems two trylocks. I am pretty sure we use the same trylock > >> trick to avoid it. I am confused.. > > > > The sequence is, > > > > 1. kref_get_unless_zero() > > 2. trylock, which succeeds > > 3. attempt to evict or purge (which may or may not have succeeded) > > 4. unlock > > > > ... meanwhile this has raced with submit (aka execbuf) finishing and > > retiring and dropping *other* remaining reference to bo... > > > > 5. drm_gem_object_put() which triggers drm_gem_object_free() > > 6. in our free path we acquire the obj lock again and then drop it. > > Which arguably is unnecessary and only serves to satisfy some > > GEM_WARN_ON(!msm_gem_is_locked(obj)) in code paths that are also used > > elsewhere > > > > lockdep doesn't realize the previously successful trylock+unlock > > sequence so it assumes that the code that triggered recursion into > > shrinker could be holding the objects lock. > > Ah yes, missed that lock after trylock in msm_gem_shrinker/scan(). Well > i915 has the same sequence in our shrinker, but the difference is we use > delayed work to actually free, _and_ use trylock in the delayed worker. > It does feel a bit inelegant (objects with no reference count which > cannot be trylocked?!), but as this is the code recently refactored by > Maarten so I think best try and sync with him for the full story. ahh, we used to use delayed work for free, but realized that was causing janks where we'd get a bunch of bo's queued up to free and at some point that would cause us to miss deadlines I suppose instead we could have used an unbound wq for free instead of the same one we used (at the time, since transitioned to kthread worker to avoid being preempted by RT SF threads) for retiring submits > >> Otherwise if you can afford to sleep you can of course throttle > >> organically via direct reclaim. Unless I am forgetting some key gotcha - > >> it's been a while I've been active in this area. > > > > So, one thing that is awkward about sleeping in this path is that > > there is no way to propagate back -EINTR, so we end up doing an > > uninterruptible sleep in something that could be called indirectly > > from userspace syscall.. i915 seems to deal with this by limiting it > > to shrinker being called from kswapd. I think in the shrinker we want > > to know whether it is ok to sleep (ie. not syscall trigggered > > codepath, and whether we are under enough memory pressure to justify > > sleeping). For the syscall path, I'm playing with something that lets > > me pass __GFP_RETRY_MAYFAIL | __GFP_NOWARN to > > shmem_read_mapping_page_gfp(), and then stall after the shrinker has > > failed, somewhere where we can make it interruptable. Ofc, that > > doesn't help with all the other random memory allocations which can > > fail, so not sure if it will turn out to be a good approach or not. > > But I guess pinning the GEM bo's is the single biggest potential > > consumer of pages in the submit path, so maybe it will be better than > > nothing. > > We play similar games, although by a quick look I am not sure we quite > manage to honour/propagate signals. This has certainly been a > historically fiddly area. If you first ask for no reclaim allocations > and invoke the shrinker manually first, then falling back to a bigger > hammer, you should be able to do it. yeah, I think it should.. but I've been fighting a bit today with the fact that the state of bo wrt. shrinkable state has grown a bit complicated (ie. is it purgeable, evictable, evictable if we are willing to wait a short amount of time, vs things that are pinned for scanout and we shouldn't bother waiting on, etc.. plus I managed to make it a bit worse recently with fenced un-pin of the vma for dealing with the case that userspace notices that, for userspace allocated iova, it can release the virtual address before the kernel has a chance to retire the submit) ;-) BR, -R > Regards, > > Tvrtko
diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c index bad4809b68ef..748665232d29 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c @@ -261,8 +261,8 @@ int adreno_get_param(struct msm_gpu *gpu, uint32_t param, uint64_t *value) return ret; } return -EINVAL; - case MSM_PARAM_NR_RINGS: - *value = gpu->nr_rings; + case MSM_PARAM_PRIORITIES: + *value = gpu->nr_rings * NR_SCHED_PRIORITIES; return 0; case MSM_PARAM_PP_PGTABLE: *value = 0; diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c index 450efe59abb5..c2ecec5b11c4 100644 --- a/drivers/gpu/drm/msm/msm_gem_submit.c +++ b/drivers/gpu/drm/msm/msm_gem_submit.c @@ -59,7 +59,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev, submit->gpu = gpu; submit->cmd = (void *)&submit->bos[nr_bos]; submit->queue = queue; - submit->ring = gpu->rb[queue->prio]; + submit->ring = gpu->rb[queue->ring_nr]; submit->fault_dumped = false; INIT_LIST_HEAD(&submit->node); @@ -749,7 +749,7 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data, /* Get a unique identifier for the submission for logging purposes */ submitid = atomic_inc_return(&ident) - 1; - ring = gpu->rb[queue->prio]; + ring = gpu->rb[queue->ring_nr]; trace_msm_gpu_submit(pid_nr(pid), ring->id, submitid, args->nr_bos, args->nr_cmds); diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h index b912cacaecc0..0e4b45bff2e6 100644 --- a/drivers/gpu/drm/msm/msm_gpu.h +++ b/drivers/gpu/drm/msm/msm_gpu.h @@ -250,6 +250,59 @@ struct msm_gpu_perfcntr { const char *name; }; +/* + * The number of priority levels provided by drm gpu scheduler. The + * DRM_SCHED_PRIORITY_KERNEL priority level is treated specially in some + * cases, so we don't use it (no need for kernel generated jobs). + */ +#define NR_SCHED_PRIORITIES (1 + DRM_SCHED_PRIORITY_HIGH - DRM_SCHED_PRIORITY_MIN) + +/** + * msm_gpu_convert_priority - Map userspace priority to ring # and sched priority + * + * @gpu: the gpu instance + * @prio: the userspace priority level + * @ring_nr: [out] the ringbuffer the userspace priority maps to + * @sched_prio: [out] the gpu scheduler priority level which the userspace + * priority maps to + * + * With drm/scheduler providing it's own level of prioritization, our total + * number of available priority levels is (nr_rings * NR_SCHED_PRIORITIES). + * Each ring is associated with it's own scheduler instance. However, our + * UABI is that lower numerical values are higher priority. So mapping the + * single userspace priority level into ring_nr and sched_prio takes some + * care. The userspace provided priority (when a submitqueue is created) + * is mapped to ring nr and scheduler priority as such: + * + * ring_nr = userspace_prio / NR_SCHED_PRIORITIES + * sched_prio = NR_SCHED_PRIORITIES - + * (userspace_prio % NR_SCHED_PRIORITIES) - 1 + * + * This allows generations without preemption (nr_rings==1) to have some + * amount of prioritization, and provides more priority levels for gens + * that do have preemption. + */ +static inline int msm_gpu_convert_priority(struct msm_gpu *gpu, int prio, + unsigned *ring_nr, enum drm_sched_priority *sched_prio) +{ + unsigned rn, sp; + + rn = div_u64_rem(prio, NR_SCHED_PRIORITIES, &sp); + + /* invert sched priority to map to higher-numeric-is-higher- + * priority convention + */ + sp = NR_SCHED_PRIORITIES - sp - 1; + + if (rn >= gpu->nr_rings) + return -EINVAL; + + *ring_nr = rn; + *sched_prio = sp; + + return 0; +} + /** * A submitqueue is associated with a gl context or vk queue (or equiv) * in userspace. @@ -257,7 +310,8 @@ struct msm_gpu_perfcntr { * @id: userspace id for the submitqueue, unique within the drm_file * @flags: userspace flags for the submitqueue, specified at creation * (currently unusued) - * @prio: the submitqueue priority + * @ring_nr: the ringbuffer used by this submitqueue, which is determined + * by the submitqueue's priority * @faults: the number of GPU hangs associated with this submitqueue * @ctx: the per-drm_file context associated with the submitqueue (ie. * which set of pgtables do submits jobs associated with the @@ -272,7 +326,7 @@ struct msm_gpu_perfcntr { struct msm_gpu_submitqueue { int id; u32 flags; - u32 prio; + u32 ring_nr; int faults; struct msm_file_private *ctx; struct list_head node; diff --git a/drivers/gpu/drm/msm/msm_submitqueue.c b/drivers/gpu/drm/msm/msm_submitqueue.c index 682ba2a7c0ec..32a55d81b58b 100644 --- a/drivers/gpu/drm/msm/msm_submitqueue.c +++ b/drivers/gpu/drm/msm/msm_submitqueue.c @@ -68,6 +68,8 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, struct msm_gpu_submitqueue *queue; struct msm_ringbuffer *ring; struct drm_gpu_scheduler *sched; + enum drm_sched_priority sched_prio; + unsigned ring_nr; int ret; if (!ctx) @@ -76,8 +78,9 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, if (!priv->gpu) return -ENODEV; - if (prio >= priv->gpu->nr_rings) - return -EINVAL; + ret = msm_gpu_convert_priority(priv->gpu, prio, &ring_nr, &sched_prio); + if (ret) + return ret; queue = kzalloc(sizeof(*queue), GFP_KERNEL); @@ -86,24 +89,13 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, kref_init(&queue->ref); queue->flags = flags; - queue->prio = prio; + queue->ring_nr = ring_nr; - ring = priv->gpu->rb[prio]; + ring = priv->gpu->rb[ring_nr]; sched = &ring->sched; - /* - * TODO we can allow more priorities than we have ringbuffers by - * mapping: - * - * ring = prio / 3; - * ent_prio = DRM_SCHED_PRIORITY_MIN + (prio % 3); - * - * Probably avoid using DRM_SCHED_PRIORITY_KERNEL as that is - * treated specially in places. - */ ret = drm_sched_entity_init(&queue->entity, - DRM_SCHED_PRIORITY_NORMAL, - &sched, 1, NULL); + sched_prio, &sched, 1, NULL); if (ret) { kfree(queue); return ret; @@ -134,16 +126,19 @@ int msm_submitqueue_create(struct drm_device *drm, struct msm_file_private *ctx, int msm_submitqueue_init(struct drm_device *drm, struct msm_file_private *ctx) { struct msm_drm_private *priv = drm->dev_private; - int default_prio; + int default_prio, max_priority; if (!priv->gpu) return -ENODEV; + max_priority = (priv->gpu->nr_rings * NR_SCHED_PRIORITIES) - 1; + /* - * Select priority 2 as the "default priority" unless nr_rings is less - * than 2 and then pick the lowest priority + * Pick a medium priority level as default. Lower numeric value is + * higher priority, so round-up to pick a priority that is not higher + * than the middle priority level. */ - default_prio = clamp_t(uint32_t, 2, 0, priv->gpu->nr_rings - 1); + default_prio = DIV_ROUND_UP(max_priority, 2); INIT_LIST_HEAD(&ctx->submitqueues); diff --git a/include/uapi/drm/msm_drm.h b/include/uapi/drm/msm_drm.h index f075851021c3..6b8fffc28a50 100644 --- a/include/uapi/drm/msm_drm.h +++ b/include/uapi/drm/msm_drm.h @@ -73,11 +73,19 @@ struct drm_msm_timespec { #define MSM_PARAM_MAX_FREQ 0x04 #define MSM_PARAM_TIMESTAMP 0x05 #define MSM_PARAM_GMEM_BASE 0x06 -#define MSM_PARAM_NR_RINGS 0x07 +#define MSM_PARAM_PRIORITIES 0x07 /* The # of priority levels */ #define MSM_PARAM_PP_PGTABLE 0x08 /* => 1 for per-process pagetables, else 0 */ #define MSM_PARAM_FAULTS 0x09 #define MSM_PARAM_SUSPENDS 0x0a +/* For backwards compat. The original support for preemption was based on + * a single ring per priority level so # of priority levels equals the # + * of rings. With drm/scheduler providing additional levels of priority, + * the number of priorities is greater than the # of rings. The param is + * renamed to better reflect this. + */ +#define MSM_PARAM_NR_RINGS MSM_PARAM_PRIORITIES + struct drm_msm_param { __u32 pipe; /* in, MSM_PIPE_x */ __u32 param; /* in, MSM_PARAM_x */ @@ -304,6 +312,10 @@ struct drm_msm_gem_madvise { #define MSM_SUBMITQUEUE_FLAGS (0) +/* + * The submitqueue priority should be between 0 and MSM_PARAM_PRIORITIES-1, + * a lower numeric value is higher priority. + */ struct drm_msm_submitqueue { __u32 flags; /* in, MSM_SUBMITQUEUE_x */ __u32 prio; /* in, Priority level */