diff mbox series

[2/4] drm/i915: Cancel outstanding work after disabling heartbeats on an engine

Message ID 20200916094219.3878-2-chris@chris-wilson.co.uk
State Accepted
Commit 7a991cd3e3da9a56d5616b62d425db000a3242f2
Headers show
Series None | expand

Commit Message

Chris Wilson Sept. 16, 2020, 9:42 a.m. UTC
We only allow persistent requests to remain on the GPU past the closure
of their containing context (and process) so long as they are continuously
checked for hangs or allow other requests to preempt them, as we need to
ensure forward progress of the system. If we allow persistent contexts
to remain on the system after the the hangcheck mechanism is disabled,
the system may grind to a halt. On disabling the mechanism, we sent a
pulse along the engine to remove all executing contexts from the engine
which would check for hung contexts -- but we did not prevent those
contexts from being resubmitted if they survived the final hangcheck.

Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
Testcase: igt/gem_ctx_persistence/heartbeat-stop
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: <stable@vger.kernel.org> # v5.7+
---
 drivers/gpu/drm/i915/gt/intel_engine.h | 9 +++++++++
 drivers/gpu/drm/i915/i915_request.c    | 5 +++++
 2 files changed, 14 insertions(+)

Comments

Tvrtko Ursulin Sept. 24, 2020, 1:35 p.m. UTC | #1
On 16/09/2020 10:42, Chris Wilson wrote:
> We only allow persistent requests to remain on the GPU past the closure

> of their containing context (and process) so long as they are continuously

> checked for hangs or allow other requests to preempt them, as we need to

> ensure forward progress of the system. If we allow persistent contexts

> to remain on the system after the the hangcheck mechanism is disabled,

> the system may grind to a halt. On disabling the mechanism, we sent a

> pulse along the engine to remove all executing contexts from the engine

> which would check for hung contexts -- but we did not prevent those

> contexts from being resubmitted if they survived the final hangcheck.

> 

> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")

> Testcase: igt/gem_ctx_persistence/heartbeat-stop

> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

> Cc: <stable@vger.kernel.org> # v5.7+

> ---

>   drivers/gpu/drm/i915/gt/intel_engine.h | 9 +++++++++

>   drivers/gpu/drm/i915/i915_request.c    | 5 +++++

>   2 files changed, 14 insertions(+)

> 

> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h

> index 08e2c000dcc3..7c3a1012e702 100644

> --- a/drivers/gpu/drm/i915/gt/intel_engine.h

> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h

> @@ -337,4 +337,13 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)

>   	return intel_engine_has_preemption(engine);

>   }

>   

> +static inline bool

> +intel_engine_has_heartbeat(const struct intel_engine_cs *engine)

> +{

> +	if (!IS_ACTIVE(CONFIG_DRM_I915_HEARTBEAT_INTERVAL))

> +		return false;

> +

> +	return READ_ONCE(engine->props.heartbeat_interval_ms);

> +}

> +

>   #endif /* _INTEL_RINGBUFFER_H_ */

> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c

> index 436ce368ddaa..0e813819b041 100644

> --- a/drivers/gpu/drm/i915/i915_request.c

> +++ b/drivers/gpu/drm/i915/i915_request.c

> @@ -542,8 +542,13 @@ bool __i915_request_submit(struct i915_request *request)

>   	if (i915_request_completed(request))

>   		goto xfer;

>   

> +	if (unlikely(intel_context_is_closed(request->context) &&

> +		     !intel_engine_has_heartbeat(engine)))

> +		intel_context_set_banned(request->context);

> +

>   	if (unlikely(intel_context_is_banned(request->context)))

>   		i915_request_set_error_once(request, -EIO);

> +

>   	if (unlikely(fatal_error(request->fence.error)))

>   		__i915_request_skip(request);

>   

> 


Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>


Regards,

Tvrtko
Joonas Lahtinen Sept. 25, 2020, 11:04 a.m. UTC | #2
Quoting Chris Wilson (2020-09-16 12:42:17)
> We only allow persistent requests to remain on the GPU past the closure

> of their containing context (and process) so long as they are continuously

> checked for hangs or allow other requests to preempt them, as we need to

> ensure forward progress of the system. If we allow persistent contexts

> to remain on the system after the the hangcheck mechanism is disabled,

> the system may grind to a halt. On disabling the mechanism, we sent a

> pulse along the engine to remove all executing contexts from the engine

> which would check for hung contexts -- but we did not prevent those

> contexts from being resubmitted if they survived the final hangcheck.

> 

> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")

> Testcase: igt/gem_ctx_persistence/heartbeat-stop

> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

> Cc: <stable@vger.kernel.org> # v5.7+


Definitely makes sense to ensure.

Acked-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>


Regards, Joonas

> ---

>  drivers/gpu/drm/i915/gt/intel_engine.h | 9 +++++++++

>  drivers/gpu/drm/i915/i915_request.c    | 5 +++++

>  2 files changed, 14 insertions(+)

> 

> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h

> index 08e2c000dcc3..7c3a1012e702 100644

> --- a/drivers/gpu/drm/i915/gt/intel_engine.h

> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h

> @@ -337,4 +337,13 @@ intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)

>         return intel_engine_has_preemption(engine);

>  }

>  

> +static inline bool

> +intel_engine_has_heartbeat(const struct intel_engine_cs *engine)

> +{

> +       if (!IS_ACTIVE(CONFIG_DRM_I915_HEARTBEAT_INTERVAL))

> +               return false;

> +

> +       return READ_ONCE(engine->props.heartbeat_interval_ms);

> +}

> +

>  #endif /* _INTEL_RINGBUFFER_H_ */

> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c

> index 436ce368ddaa..0e813819b041 100644

> --- a/drivers/gpu/drm/i915/i915_request.c

> +++ b/drivers/gpu/drm/i915/i915_request.c

> @@ -542,8 +542,13 @@ bool __i915_request_submit(struct i915_request *request)

>         if (i915_request_completed(request))

>                 goto xfer;

>  

> +       if (unlikely(intel_context_is_closed(request->context) &&

> +                    !intel_engine_has_heartbeat(engine)))

> +               intel_context_set_banned(request->context);

> +

>         if (unlikely(intel_context_is_banned(request->context)))

>                 i915_request_set_error_once(request, -EIO);

> +

>         if (unlikely(fatal_error(request->fence.error)))

>                 __i915_request_skip(request);

>  

> -- 

> 2.20.1

>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index 08e2c000dcc3..7c3a1012e702 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -337,4 +337,13 @@  intel_engine_has_preempt_reset(const struct intel_engine_cs *engine)
 	return intel_engine_has_preemption(engine);
 }
 
+static inline bool
+intel_engine_has_heartbeat(const struct intel_engine_cs *engine)
+{
+	if (!IS_ACTIVE(CONFIG_DRM_I915_HEARTBEAT_INTERVAL))
+		return false;
+
+	return READ_ONCE(engine->props.heartbeat_interval_ms);
+}
+
 #endif /* _INTEL_RINGBUFFER_H_ */
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 436ce368ddaa..0e813819b041 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -542,8 +542,13 @@  bool __i915_request_submit(struct i915_request *request)
 	if (i915_request_completed(request))
 		goto xfer;
 
+	if (unlikely(intel_context_is_closed(request->context) &&
+		     !intel_engine_has_heartbeat(engine)))
+		intel_context_set_banned(request->context);
+
 	if (unlikely(intel_context_is_banned(request->context)))
 		i915_request_set_error_once(request, -EIO);
+
 	if (unlikely(fatal_error(request->fence.error)))
 		__i915_request_skip(request);