diff mbox series

drm/msm/adreno: adreno_gpu: Use suspend() instead of idle() on load error

Message ID 20230329140445.2180662-1-konrad.dybcio@linaro.org
State Superseded
Headers show
Series drm/msm/adreno: adreno_gpu: Use suspend() instead of idle() on load error | expand

Commit Message

Konrad Dybcio March 29, 2023, 2:04 p.m. UTC
If we fail to initialize the GPU for whatever reason (say we don't
embed the GPU firmware files in the initrd), the error path involves
pm_runtime_put_sync() which then calls idle() instead of suspend().

This is suboptimal, as it means that we're not going through the
clean shutdown sequence. With at least A619_holi, this makes the GPU
not wake up until it goes through at least one more start-fail-stop
cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean
shutdown.

Test cases:
1. firmware baked into kernel
2. error loading fw in initrd -> load from rootfs at DE start

Both succeed on A619_holi (SM6375) and A630 (SDM845).

Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load")
Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
---
 drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Johan Hovold March 29, 2023, 2:37 p.m. UTC | #1
On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote:
> If we fail to initialize the GPU for whatever reason (say we don't
> embed the GPU firmware files in the initrd), the error path involves
> pm_runtime_put_sync() which then calls idle() instead of suspend().
> 
> This is suboptimal, as it means that we're not going through the
> clean shutdown sequence. With at least A619_holi, this makes the GPU
> not wake up until it goes through at least one more start-fail-stop
> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean
> shutdown.

This does not sound right. If pm_runtime_put_sync() fails to suspend the
device when the usage count drops to zero, then you have a bug somewhere
else.

Also since commit 2c087a336676 ("drm/msm/adreno: Load the firmware
before bringing up the hardware") the firmware is loaded before even
hitting these paths so the above description does not sound right in
that respect either (or is missing some details).

> Test cases:
> 1. firmware baked into kernel
> 2. error loading fw in initrd -> load from rootfs at DE start
> 
> Both succeed on A619_holi (SM6375) and A630 (SDM845).
> 
> Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load")
> Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
> ---
>  drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> index f61896629be6..59f3302e8167 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev)
>  	return gpu;
>  
>  err_put_rpm:
> -	pm_runtime_put_sync(&pdev->dev);
> +	pm_runtime_put_sync_suspend(&pdev->dev);
>  err_disable_rpm:
>  	pm_runtime_disable(&pdev->dev);

Johan
Konrad Dybcio March 29, 2023, 3:48 p.m. UTC | #2
On 29.03.2023 16:37, Johan Hovold wrote:
> On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote:
>> If we fail to initialize the GPU for whatever reason (say we don't
>> embed the GPU firmware files in the initrd), the error path involves
>> pm_runtime_put_sync() which then calls idle() instead of suspend().
>>
>> This is suboptimal, as it means that we're not going through the
>> clean shutdown sequence. With at least A619_holi, this makes the GPU
>> not wake up until it goes through at least one more start-fail-stop
>> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean
>> shutdown.
> 
> This does not sound right. If pm_runtime_put_sync() fails to suspend the
> device when the usage count drops to zero, then you have a bug somewhere
> else.
I was surprised to see that it was not called as well, but I wasn't able
to track it down before..

> 
> Also since commit 2c087a336676 ("drm/msm/adreno: Load the firmware
> before bringing up the hardware") the firmware is loaded before even
> hitting these paths so the above description does not sound right in
> that respect either (or is missing some details).
..but I did some more digging and I found that the precise "firmware"
that fails is the ZAP blob, which is not checked like SQE in the
commit you mentioned!

Now I don't think that we can easily check for it as-is since
zap_shader_load_mdt() does the entire find-load-authenticate
dance which is required with secure assets, but it's obviously
possible to rip out the find-load part of that and go on from
there.

Do you think that would be a better solution?

Konrad

> 
>> Test cases:
>> 1. firmware baked into kernel
>> 2. error loading fw in initrd -> load from rootfs at DE start
>>
>> Both succeed on A619_holi (SM6375) and A630 (SDM845).
>>
>> Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load")
>> Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
>> ---
>>  drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
>> index f61896629be6..59f3302e8167 100644
>> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
>> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
>> @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev)
>>  	return gpu;
>>  
>>  err_put_rpm:
>> -	pm_runtime_put_sync(&pdev->dev);
>> +	pm_runtime_put_sync_suspend(&pdev->dev);
>>  err_disable_rpm:
>>  	pm_runtime_disable(&pdev->dev);
> 
> Johan
Konrad Dybcio March 29, 2023, 5:31 p.m. UTC | #3
On 29.03.2023 19:30, Rob Clark wrote:
> On Wed, Mar 29, 2023 at 8:48 AM Konrad Dybcio <konrad.dybcio@linaro.org> wrote:
>>
>>
>>
>> On 29.03.2023 16:37, Johan Hovold wrote:
>>> On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote:
>>>> If we fail to initialize the GPU for whatever reason (say we don't
>>>> embed the GPU firmware files in the initrd), the error path involves
>>>> pm_runtime_put_sync() which then calls idle() instead of suspend().
>>>>
>>>> This is suboptimal, as it means that we're not going through the
>>>> clean shutdown sequence. With at least A619_holi, this makes the GPU
>>>> not wake up until it goes through at least one more start-fail-stop
>>>> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean
>>>> shutdown.
>>>
>>> This does not sound right. If pm_runtime_put_sync() fails to suspend the
>>> device when the usage count drops to zero, then you have a bug somewhere
>>> else.
>> I was surprised to see that it was not called as well, but I wasn't able
>> to track it down before..
>>
>>>
>>> Also since commit 2c087a336676 ("drm/msm/adreno: Load the firmware
>>> before bringing up the hardware") the firmware is loaded before even
>>> hitting these paths so the above description does not sound right in
>>> that respect either (or is missing some details).
>> ..but I did some more digging and I found that the precise "firmware"
>> that fails is the ZAP blob, which is not checked like SQE in the
>> commit you mentioned!
>>
>> Now I don't think that we can easily check for it as-is since
>> zap_shader_load_mdt() does the entire find-load-authenticate
>> dance which is required with secure assets, but it's obviously
>> possible to rip out the find-load part of that and go on from
>> there.
>>
>> Do you think that would be a better solution?
> 
> Hmm, to hit this it sounds like you'd need all the fw _except_ the zap
> in the initrd?
Correct.

Konrad
> 
> BR,
> -R
> 
>> Konrad
>>
>>>
>>>> Test cases:
>>>> 1. firmware baked into kernel
>>>> 2. error loading fw in initrd -> load from rootfs at DE start
>>>>
>>>> Both succeed on A619_holi (SM6375) and A630 (SDM845).
>>>>
>>>> Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load")
>>>> Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
>>>> ---
>>>>  drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
>>>> index f61896629be6..59f3302e8167 100644
>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
>>>> @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev)
>>>>      return gpu;
>>>>
>>>>  err_put_rpm:
>>>> -    pm_runtime_put_sync(&pdev->dev);
>>>> +    pm_runtime_put_sync_suspend(&pdev->dev);
>>>>  err_disable_rpm:
>>>>      pm_runtime_disable(&pdev->dev);
>>>
>>> Johan
Dmitry Baryshkov March 29, 2023, 7:45 p.m. UTC | #4
On Wed, 29 Mar 2023 at 18:48, Konrad Dybcio <konrad.dybcio@linaro.org> wrote:
>
>
>
> On 29.03.2023 16:37, Johan Hovold wrote:
> > On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote:
> >> If we fail to initialize the GPU for whatever reason (say we don't
> >> embed the GPU firmware files in the initrd), the error path involves
> >> pm_runtime_put_sync() which then calls idle() instead of suspend().
> >>
> >> This is suboptimal, as it means that we're not going through the
> >> clean shutdown sequence. With at least A619_holi, this makes the GPU
> >> not wake up until it goes through at least one more start-fail-stop
> >> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean
> >> shutdown.
> >
> > This does not sound right. If pm_runtime_put_sync() fails to suspend the
> > device when the usage count drops to zero, then you have a bug somewhere
> > else.
> I was surprised to see that it was not called as well, but I wasn't able
> to track it down before..

Could you please check that it's autosuspend who kicks in? In other
words, if we disable autosuspend, the pm_runtime_put_sync is enough()?

That would probably mean that we lack some kind of reset in the hw_init path.

On the other hand, I do not know how the device will react to the
error-in-the-middle state. Modems for example, can enter the state
where you can not properly turn it off once it starts the boot
process.

And if we remember the efforts that Akhil has put into making sure
that the GPU is properly reset in case of an _error_, it might be
nearly impossible to shut it down in a proper way.

Thus said, I think that unless there is an obvious way to restart the
init process, Korad's pm_runtime_put_sync_suspend() looks like a
correct fix to me.

> > Also since commit 2c087a336676 ("drm/msm/adreno: Load the firmware
> > before bringing up the hardware") the firmware is loaded before even
> > hitting these paths so the above description does not sound right in
> > that respect either (or is missing some details).
> ..but I did some more digging and I found that the precise "firmware"
> that fails is the ZAP blob, which is not checked like SQE in the
> commit you mentioned!
>
> Now I don't think that we can easily check for it as-is since
> zap_shader_load_mdt() does the entire find-load-authenticate
> dance which is required with secure assets, but it's obviously
> possible to rip out the find-load part of that and go on from
> there.

Yes, I think we should load all firmware early. ZAP shader is a bit
unique since the DT can override the name, but it might be nice to
check for its presence earlier.

At the same time it probably should not stop us from fixing the idle()
vs suspend() bug.

>
> Do you think that would be a better solution?
>
> Konrad
>
> >
> >> Test cases:
> >> 1. firmware baked into kernel
> >> 2. error loading fw in initrd -> load from rootfs at DE start
> >>
> >> Both succeed on A619_holi (SM6375) and A630 (SDM845).
> >>
> >> Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load")
> >> Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
> >> ---
> >>  drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> index f61896629be6..59f3302e8167 100644
> >> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
> >> @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev)
> >>      return gpu;
> >>
> >>  err_put_rpm:
> >> -    pm_runtime_put_sync(&pdev->dev);
> >> +    pm_runtime_put_sync_suspend(&pdev->dev);
> >>  err_disable_rpm:
> >>      pm_runtime_disable(&pdev->dev);
> >
> > Johan
Konrad Dybcio March 30, 2023, 2:34 p.m. UTC | #5
On 29.03.2023 21:45, Dmitry Baryshkov wrote:
> On Wed, 29 Mar 2023 at 18:48, Konrad Dybcio <konrad.dybcio@linaro.org> wrote:
>>
>>
>>
>> On 29.03.2023 16:37, Johan Hovold wrote:
>>> On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote:
>>>> If we fail to initialize the GPU for whatever reason (say we don't
>>>> embed the GPU firmware files in the initrd), the error path involves
>>>> pm_runtime_put_sync() which then calls idle() instead of suspend().
>>>>
>>>> This is suboptimal, as it means that we're not going through the
>>>> clean shutdown sequence. With at least A619_holi, this makes the GPU
>>>> not wake up until it goes through at least one more start-fail-stop
>>>> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean
>>>> shutdown.
>>>
>>> This does not sound right. If pm_runtime_put_sync() fails to suspend the
>>> device when the usage count drops to zero, then you have a bug somewhere
>>> else.
>> I was surprised to see that it was not called as well, but I wasn't able
>> to track it down before..
> 
> Could you please check that it's autosuspend who kicks in? In other
> words, if we disable autosuspend, the pm_runtime_put_sync is enough()?
> 
> That would probably mean that we lack some kind of reset in the hw_init path.
> 
> On the other hand, I do not know how the device will react to the
> error-in-the-middle state. Modems for example, can enter the state
> where you can not properly turn it off once it starts the boot
> process.
> 
> And if we remember the efforts that Akhil has put into making sure
> that the GPU is properly reset in case of an _error_, it might be
> nearly impossible to shut it down in a proper way.
> 
> Thus said, I think that unless there is an obvious way to restart the
> init process, Korad's pm_runtime_put_sync_suspend() looks like a
> correct fix to me.
On the GPU side, when you cut GX and CX GDSCs, the hardware is off.
Some clock / gdsc logic may be retained, but the GPU itself gets
cut off. Parking the clocks and shuttting down VDD_GX (if exists)
only makes that stronger.

> 
>>> Also since commit 2c087a336676 ("drm/msm/adreno: Load the firmware
>>> before bringing up the hardware") the firmware is loaded before even
>>> hitting these paths so the above description does not sound right in
>>> that respect either (or is missing some details).
>> ..but I did some more digging and I found that the precise "firmware"
>> that fails is the ZAP blob, which is not checked like SQE in the
>> commit you mentioned!
>>
>> Now I don't think that we can easily check for it as-is since
>> zap_shader_load_mdt() does the entire find-load-authenticate
>> dance which is required with secure assets, but it's obviously
>> possible to rip out the find-load part of that and go on from
>> there.
> 
> Yes, I think we should load all firmware early. ZAP shader is a bit
> unique since the DT can override the name, but it might be nice to
> check for its presence earlier.
> 
> At the same time it probably should not stop us from fixing the idle()
> vs suspend() bug.
I'm open to both solutions, as long as it can unblock me from
resubmitting the (hopefully) final version of GMU wrapper!

Konrad
> 
>>
>> Do you think that would be a better solution?
>>
>> Konrad
>>
>>>
>>>> Test cases:
>>>> 1. firmware baked into kernel
>>>> 2. error loading fw in initrd -> load from rootfs at DE start
>>>>
>>>> Both succeed on A619_holi (SM6375) and A630 (SDM845).
>>>>
>>>> Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load")
>>>> Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
>>>> ---
>>>>  drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
>>>> index f61896629be6..59f3302e8167 100644
>>>> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c
>>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
>>>> @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev)
>>>>      return gpu;
>>>>
>>>>  err_put_rpm:
>>>> -    pm_runtime_put_sync(&pdev->dev);
>>>> +    pm_runtime_put_sync_suspend(&pdev->dev);
>>>>  err_disable_rpm:
>>>>      pm_runtime_disable(&pdev->dev);
>>>
>>> Johan
> 
> 
>
Johan Hovold March 30, 2023, 2:49 p.m. UTC | #6
On Wed, Mar 29, 2023 at 10:45:52PM +0300, Dmitry Baryshkov wrote:
> On Wed, 29 Mar 2023 at 18:48, Konrad Dybcio <konrad.dybcio@linaro.org> wrote:
> > On 29.03.2023 16:37, Johan Hovold wrote:
> > > On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote:
> > >> If we fail to initialize the GPU for whatever reason (say we don't
> > >> embed the GPU firmware files in the initrd), the error path involves
> > >> pm_runtime_put_sync() which then calls idle() instead of suspend().
> > >>
> > >> This is suboptimal, as it means that we're not going through the
> > >> clean shutdown sequence. With at least A619_holi, this makes the GPU
> > >> not wake up until it goes through at least one more start-fail-stop
> > >> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean
> > >> shutdown.
> > >
> > > This does not sound right. If pm_runtime_put_sync() fails to suspend the
> > > device when the usage count drops to zero, then you have a bug somewhere
> > > else.
> > I was surprised to see that it was not called as well, but I wasn't able
> > to track it down before..
> 
> Could you please check that it's autosuspend who kicks in? In other
> words, if we disable autosuspend, the pm_runtime_put_sync is enough()?

Yes, that's it. The runtime PM implementation changed at one point and
since you need to disable autosuspend first to actually get synchronous
behaviour. My bad.

> That would probably mean that we lack some kind of reset in the hw_init path.
> 
> On the other hand, I do not know how the device will react to the
> error-in-the-middle state. Modems for example, can enter the state
> where you can not properly turn it off once it starts the boot
> process.
> 
> And if we remember the efforts that Akhil has put into making sure
> that the GPU is properly reset in case of an _error_, it might be
> nearly impossible to shut it down in a proper way.
> 
> Thus said, I think that unless there is an obvious way to restart the
> init process, Korad's pm_runtime_put_sync_suspend() looks like a
> correct fix to me.

I'd prefer to fix this by disabling autosuspend, but as that would
involve also moving the call to enable autosuspend to this function (and
add the missing disable on unbind), Konrad's patch using
pm_runtime_put_sync_suspend() is probably the best option for now. I can
send a patch to move the autosuspend handling on top.

Perhaps you can just amend the commit message to clarify that not all fw
is apparently preloaded and also mention that you need to use
pm_runtime_put_sync_suspend() due to autosuspend being enabled.

Johan
Dmitry Baryshkov March 30, 2023, 2:52 p.m. UTC | #7
On 30/03/2023 17:34, Konrad Dybcio wrote:
> 
> 
> On 29.03.2023 21:45, Dmitry Baryshkov wrote:
>> On Wed, 29 Mar 2023 at 18:48, Konrad Dybcio <konrad.dybcio@linaro.org> wrote:
>>>
>>>
>>>
>>> On 29.03.2023 16:37, Johan Hovold wrote:
>>>> On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote:
>>>>> If we fail to initialize the GPU for whatever reason (say we don't
>>>>> embed the GPU firmware files in the initrd), the error path involves
>>>>> pm_runtime_put_sync() which then calls idle() instead of suspend().
>>>>>
>>>>> This is suboptimal, as it means that we're not going through the
>>>>> clean shutdown sequence. With at least A619_holi, this makes the GPU
>>>>> not wake up until it goes through at least one more start-fail-stop
>>>>> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean
>>>>> shutdown.
>>>>
>>>> This does not sound right. If pm_runtime_put_sync() fails to suspend the
>>>> device when the usage count drops to zero, then you have a bug somewhere
>>>> else.
>>> I was surprised to see that it was not called as well, but I wasn't able
>>> to track it down before..
>>
>> Could you please check that it's autosuspend who kicks in? In other
>> words, if we disable autosuspend, the pm_runtime_put_sync is enough()?
>>
>> That would probably mean that we lack some kind of reset in the hw_init path.
>>
>> On the other hand, I do not know how the device will react to the
>> error-in-the-middle state. Modems for example, can enter the state
>> where you can not properly turn it off once it starts the boot
>> process.
>>
>> And if we remember the efforts that Akhil has put into making sure
>> that the GPU is properly reset in case of an _error_, it might be
>> nearly impossible to shut it down in a proper way.
>>
>> Thus said, I think that unless there is an obvious way to restart the
>> init process, Korad's pm_runtime_put_sync_suspend() looks like a
>> correct fix to me.
> On the GPU side, when you cut GX and CX GDSCs, the hardware is off.
> Some clock / gdsc logic may be retained, but the GPU itself gets
> cut off. Parking the clocks and shuttting down VDD_GX (if exists)
> only makes that stronger.

If I remember correctly, GX and CX GPU GDSCs might be voted by other 
users. Again, I'd direct you here to the series at [1]

[1]: https://patchwork.freedesktop.org/series/111966/
diff mbox series

Patch

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index f61896629be6..59f3302e8167 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -477,7 +477,7 @@  struct msm_gpu *adreno_load_gpu(struct drm_device *dev)
 	return gpu;
 
 err_put_rpm:
-	pm_runtime_put_sync(&pdev->dev);
+	pm_runtime_put_sync_suspend(&pdev->dev);
 err_disable_rpm:
 	pm_runtime_disable(&pdev->dev);