[0/2] Fix a black screen on sc7180 Trogdor devices

Message ID	20240327202740.3075378-1-swboyd@chromium.org
Headers	show Received: from mail-pf1-f172.google.com (mail-pf1-f172.google.com [209.85.210.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5A89814F9F8 for <linux-arm-msm@vger.kernel.org>; Wed, 27 Mar 2024 20:27:44 +0000 (UTC) From: Stephen Boyd <swboyd@chromium.org> To: Michael Turquette <mturquette@baylibre.com>, Stephen Boyd <sboyd@kernel.org>, Bjorn Andersson <andersson@kernel.org> Cc: linux-kernel@vger.kernel.org, linux-clk@vger.kernel.org, patches@lists.linux.dev, linux-arm-msm@vger.kernel.org, Dmitry Baryshkov <dmitry.baryshkov@linaro.org>, Douglas Anderson <dianders@chromium.org>, Taniya Das <quic_tdas@quicinc.com>, Laura Nao <laura.nao@collabora.com> Subject: [PATCH 0/2] Fix a black screen on sc7180 Trogdor devices Date: Wed, 27 Mar 2024 13:27:36 -0700 Message-ID: <20240327202740.3075378-1-swboyd@chromium.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Fix a black screen on sc7180 Trogdor devices \| expand [0/2] Fix a black screen on sc7180 Trogdor devices [2/2] clk: qcom: Properly initialize shared RCGs upon registration

Stephen Boyd March 27, 2024, 8:27 p.m. UTC

This patch series fixes a black screen seen at boot on Trogdor devices.
The details of that problem are in the second patch, but the TL;DR is
that shared RCGs report the wrong parent to the clk framework and shared
RCGs never get turned off if they're left force enabled out of boot,
wedging the display GDSC causing the display bridge to fail to probe
because it can't turn on DSI.

The first patch is basically a hack. It avoids a problem where the mdss
driver probes, turns on and off the mdp clk, and hangs the rotator clk
because the rotator clk is using the same parent. We don't care about
this case on sc7180, so we simply disable the clk at driver probe so it
can't get stuck on.

The second patch fixes the shared RCG implementation so that the parent
is properly reported and so that the force enable bit is cleared. Fixing
the parent causes the problem that the first patch is avoiding, which is
why that patch is first. Just applying this second patch will make it so
that disabling the mdp clk disables the display pll that the rotator clk
is also using, causing the rotator clk to be stuck on.

This problem comes about because of a combination of issues. The clk
framework doesn't handle the case where two clks share the same parent
and are enabled at boot. The first clk to enable and disable will turn
off the parent. The mdss driver doesn't stay out of runtime suspend
while populating the child devices. In fact, of_platform_populate()
triggers runtime resume and suspend of the mdss device multiple times
while devices are being added. These patches side-step the larger issues
here with the goal of fixing Trogdor in the short-term. Long-term we
need to fix the clk framework and display driver so that shared parents
aren't disabled during boot and so that mdss can't runtime suspend until
the display pipeline/card is fully formed.

Stephen Boyd (2):
  clk: qcom: dispcc-sc7180: Force off rotator clk at probe
  clk: qcom: Properly initialize shared RCGs upon registration

 drivers/clk/qcom/clk-rcg2.c      | 19 +++++++++++++++++++
 drivers/clk/qcom/dispcc-sc7180.c | 14 ++++++++++++++
 2 files changed, 33 insertions(+)

Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Taniya Das <quic_tdas@quicinc.com>
Cc: Laura Nao <laura.nao@collabora.com>

base-commit: e8f897f4afef0031fe618a8e94127a0934896aba

Doug Anderson March 28, 2024, 4:39 p.m. UTC | #1

Hi,

On Wed, Mar 27, 2024 at 1:27 PM Stephen Boyd <swboyd@chromium.org> wrote:
>
> This patch series fixes a black screen seen at boot on Trogdor devices.
> The details of that problem are in the second patch, but the TL;DR is
> that shared RCGs report the wrong parent to the clk framework and shared
> RCGs never get turned off if they're left force enabled out of boot,
> wedging the display GDSC causing the display bridge to fail to probe
> because it can't turn on DSI.
>
> The first patch is basically a hack. It avoids a problem where the mdss
> driver probes, turns on and off the mdp clk, and hangs the rotator clk
> because the rotator clk is using the same parent. We don't care about
> this case on sc7180, so we simply disable the clk at driver probe so it
> can't get stuck on.
>
> The second patch fixes the shared RCG implementation so that the parent
> is properly reported and so that the force enable bit is cleared. Fixing
> the parent causes the problem that the first patch is avoiding, which is
> why that patch is first. Just applying this second patch will make it so
> that disabling the mdp clk disables the display pll that the rotator clk
> is also using, causing the rotator clk to be stuck on.
>
> This problem comes about because of a combination of issues. The clk
> framework doesn't handle the case where two clks share the same parent
> and are enabled at boot. The first clk to enable and disable will turn
> off the parent. The mdss driver doesn't stay out of runtime suspend
> while populating the child devices. In fact, of_platform_populate()
> triggers runtime resume and suspend of the mdss device multiple times
> while devices are being added. These patches side-step the larger issues
> here with the goal of fixing Trogdor in the short-term. Long-term we
> need to fix the clk framework and display driver so that shared parents
> aren't disabled during boot and so that mdss can't runtime suspend until
> the display pipeline/card is fully formed.
>
> Stephen Boyd (2):
>   clk: qcom: dispcc-sc7180: Force off rotator clk at probe
>   clk: qcom: Properly initialize shared RCGs upon registration
>
>  drivers/clk/qcom/clk-rcg2.c      | 19 +++++++++++++++++++
>  drivers/clk/qcom/dispcc-sc7180.c | 14 ++++++++++++++
>  2 files changed, 33 insertions(+)

I spent a bunch of time discussing this offline with Stephen and I'll
try to summarize. Hopefully this isn't too much nonsense...

1. We'll likely land the patches downstream in ChromeOS for now while
we're figuring things out since we're seeing actual breakages. Whether
to land upstream is a question. The first patch is a bit of a hack but
unlikely to cause any real problems. The second patch seems correct
but it also feels like it's going to cause stuck clocks for a pile of
other SoCs because we're not adding hacks similar to the sc7180 hack
for all the other SoCs. I guess we could hope we get lucky or play
whack-a-mole? ...or we try to find a more generic solution... Dunno
what others think.

2. One thing we talked about would be adding something like
`CLK_OPS_PARENT_ENABLE_FOR_DISABLE` for all the RCGs. That should get
rid of the actual error we're seeing. Essentially what we're seeing
today is:

* RCG1 is parented on a PLL
* RCG2 is parented on the same PLL
* RCG1, RCG2, and the PLL are left enabled by the BIOS

When we enable/disable RCG1 then the PLL turns off (since we propagate
our disable to our parent and nothing else is keeping the PLL on). At
this time RCG2 is implicitly off (it's not producing a clock) since
its parent (the PLL) is off. ...but the hardware "force enable" bit
for RCG2 is still set to on and the kernel still thinks the child of
this clock is on.

If, after this, we "disabled" a branch clock child of RCG2 (because
it's unused and we get to the part of the kernel that disables unused
clocks) then we were getting the error since you can't change the
hardware "enable" bit for a branch clock if its parent is not
clocking.

...so `CLK_OPS_PARENT_ENABLE_FOR_DISABLE` would fix this specific case
because we'd turn the PLL on for the duration of the "disable" call.

...but there is still the concern here that there will be a period of
time that the RCG2 child is "stuck" even if we're not paying attention
to it. This would be the period of time between when we turned the PLL
off and we got around to disabling the RCG2 child because it was
unused. Stephen thought that perhaps something else in hardware
(perhaps a GDSC) might notice that the RCG2 child's hardware "enable"
bit was set and would assume that it was clocking. Then that other bit
of hardware would be unhappy because no clock came out. This concern
made us think that perhaps `CLK_OPS_PARENT_ENABLE_FOR_DISABLE` isn't
the right way to go.

3. Another idea was essentially to implement the long talked about
idea of "handoff". Essentially check the state of the clocks at boot
and if they're enabled then propagate the enable to our parents.

If we implement this then it would solve the problem because RCG1 and
RCG2 would add an implicit request for the PLL to be on. If we
enable/disable RCG1 at boot then the PLL will still stay on because
there's a request from RCG2. If/when we disable children of RCG2 then
the PLL will lose its request but that's fine. In no cases will we
have a hardware "enable" bit set without the parent.

This seems solid and probably the right solution.

Stephen was a bit concerned, though, because you don't know when all
your children have been registered. If a child shows up after "disable
unused" runs then you can get back into the situation again. That
probably isn't concern for the immediate problem at hand because all
the clocks involved show up in early boot, but it means that the
problem is half solved and that's not super satisfying.

To solve this, we need to overall solve the "disable_unused" problem
to be at the right time, after everything has had a chance to probe.
To me this feels a bit like the "sync state" problem with all the
baggage involved there. Specifically (like sync state) I think it
would have to be heavily based on analysis of the device tree and
links and that has the standard problem where you never enter sync
state when DT has a node "enabled" but you didn't happen to enable the
relevant driver in your kernel config. ...though I guess we've "sorta"
solved that with the timeout. NOTE: if I understand correctly this
would only be possible if drivers are using "struct clk_parent_data"
and not if they're just using global names to match clocks.

I'll also note that in general I believe Stephen isn't a fan of
"disable_unused", but (my opinion) is that it's important to have
something in the kernel that disables unused clocks when everything is
done loading. Without this we add a lot of complexity to the firmware
to turn off all the clocks that the SoC might have had on by default
and that's non-ideal.

-Doug

Stephen Boyd May 1, 2024, 12:17 a.m. UTC | #2

Quoting Doug Anderson (2024-03-28 09:39:54)
>
> I spent a bunch of time discussing this offline with Stephen and I'll
> try to summarize. Hopefully this isn't too much nonsense...
>
> 1. We'll likely land the patches downstream in ChromeOS for now while
> we're figuring things out since we're seeing actual breakages. Whether
> to land upstream is a question. The first patch is a bit of a hack but
> unlikely to cause any real problems. The second patch seems correct
> but it also feels like it's going to cause stuck clocks for a pile of
> other SoCs because we're not adding hacks similar to the sc7180 hack
> for all the other SoCs. I guess we could hope we get lucky or play
> whack-a-mole? ...or we try to find a more generic solution... Dunno
> what others think.

I think we should hope to get lucky or play whack-a-mole and merge
something like this series. If we have to we can similarly turn off RCGs
or branches during driver probe that are using shared parents like we
have on sc7180.

Put simply, the shared RCG implementation is broken because it reports
the wrong parent for clk_ops::get_parent() and doesn't clear the force
enable bit. With the current code we're switching the parent to XO when
the clk is enabled the first time. That's an obvious bug that we should
fix regardless of implementing proper clk handoff. We haven't
implemented handoff in over a decade. Blocking this bug fix on
implementing handoff isn't practical. Furthermore, we're relying on clk
consumers to clear that force enable bit by enabling the clk once. That
doesn't make any sense, although we could use that force enable bit to
consider the RCG as enabled for clk_disable_unused.

An alternative approach to this series would be to force all shared RCGs
to be parented to XO at clk registration time, and continue to clear
that RCG force enable bit. That's sort of what Dmitry was going for
earlier. Doing this would break anything that's relying on the clks
staying enabled at some frequency through boot, but that isn't supported
anyway because clk handoff isn't implemented. It avoids the problem that
the first patch is for too because XO doesn't turn off causing a clk to
get stuck on. I can certainly craft this patch up if folks think that's
better.

To ease the transition we can make a new clk_ops for the RCG as well so
that each SoC has to opt-in to use this behavior. Then we can be certain
that other platforms aren't affected without being tested first. I'd
prefer to not do that though because I fear we'll be leaving drivers in
the broken state for some time.

Doug Anderson May 1, 2024, 4:18 p.m. UTC | #3

Hi,

On Tue, Apr 30, 2024 at 5:17 PM Stephen Boyd <swboyd@chromium.org> wrote:
>
> Quoting Doug Anderson (2024-03-28 09:39:54)
> >
> > I spent a bunch of time discussing this offline with Stephen and I'll
> > try to summarize. Hopefully this isn't too much nonsense...
> >
> > 1. We'll likely land the patches downstream in ChromeOS for now while
> > we're figuring things out since we're seeing actual breakages. Whether
> > to land upstream is a question. The first patch is a bit of a hack but
> > unlikely to cause any real problems. The second patch seems correct
> > but it also feels like it's going to cause stuck clocks for a pile of
> > other SoCs because we're not adding hacks similar to the sc7180 hack
> > for all the other SoCs. I guess we could hope we get lucky or play
> > whack-a-mole? ...or we try to find a more generic solution... Dunno
> > what others think.
>
> I think we should hope to get lucky or play whack-a-mole and merge
> something like this series. If we have to we can similarly turn off RCGs
> or branches during driver probe that are using shared parents like we
> have on sc7180.

This is OK w/ me, but of course I'm super biased since the only
Qualcomm platform I'm involved in is sc7180 Chromebooks and that's
handled by your series. If it helps, I suppose you could add:

Reviewed-by: Douglas Anderson <dianders@chromium.org>

IMO it would be good to get Bjorn or Dmitry to buy in and maybe post a
PSA and/or request for testing to a few IRC channels where folks hang
out (#linux-msm, #freedreno and #aarch64-laptops, maybe?)


-Doug

Dmitry Baryshkov May 1, 2024, 6:11 p.m. UTC | #4

On Wed, 1 May 2024 at 03:17, Stephen Boyd <swboyd@chromium.org> wrote:
>
> Quoting Doug Anderson (2024-03-28 09:39:54)
> >
> > I spent a bunch of time discussing this offline with Stephen and I'll
> > try to summarize. Hopefully this isn't too much nonsense...
> >
> > 1. We'll likely land the patches downstream in ChromeOS for now while
> > we're figuring things out since we're seeing actual breakages. Whether
> > to land upstream is a question. The first patch is a bit of a hack but
> > unlikely to cause any real problems. The second patch seems correct
> > but it also feels like it's going to cause stuck clocks for a pile of
> > other SoCs because we're not adding hacks similar to the sc7180 hack
> > for all the other SoCs. I guess we could hope we get lucky or play
> > whack-a-mole? ...or we try to find a more generic solution... Dunno
> > what others think.
>
> I think we should hope to get lucky or play whack-a-mole and merge
> something like this series. If we have to we can similarly turn off RCGs
> or branches during driver probe that are using shared parents like we
> have on sc7180.
>
> Put simply, the shared RCG implementation is broken because it reports
> the wrong parent for clk_ops::get_parent() and doesn't clear the force
> enable bit. With the current code we're switching the parent to XO when
> the clk is enabled the first time. That's an obvious bug that we should
> fix regardless of implementing proper clk handoff. We haven't
> implemented handoff in over a decade. Blocking this bug fix on
> implementing handoff isn't practical.

Yes, that needs to be fixed. My approach was to drop the XO parent and
consider the clock to be off if it is fed by the XO.

> Furthermore, we're relying on clk
> consumers to clear that force enable bit by enabling the clk once. That
> doesn't make any sense, although we could use that force enable bit to
> consider the RCG as enabled for clk_disable_unused.

That patch seems fine to me. Will it work if we take the force-enable
bit into account when considering the clock to be on or off?

>
> An alternative approach to this series would be to force all shared RCGs
> to be parented to XO at clk registration time, and continue to clear
> that RCG force enable bit. That's sort of what Dmitry was going for
> earlier. Doing this would break anything that's relying on the clks
> staying enabled at some frequency through boot, but that isn't supported
> anyway because clk handoff isn't implemented. It avoids the problem that
> the first patch is for too because XO doesn't turn off causing a clk to
> get stuck on. I can certainly craft this patch up if folks think that's
> better.

I think this approach makes sense too (and might be preferable to
boot-time hacks).
On most of the platforms we are already resetting the MDSS as soon as
the mdss (root device) is being probed. Then the display is going to
be broken until DPU collects all the coonectors and outpus and finally
creates the DRM device.

But I think we should fix the get_parent() too irrespectively of this.

> To ease the transition we can make a new clk_ops for the RCG as well so
> that each SoC has to opt-in to use this behavior. Then we can be certain
> that other platforms aren't affected without being tested first. I'd
> prefer to not do that though because I fear we'll be leaving drivers in
> the broken state for some time.

SGTM

--
With best wishes
Dmitry

Stephen Boyd May 2, 2024, 12:20 a.m. UTC | #5

Quoting Dmitry Baryshkov (2024-05-01 11:11:14)
> On Wed, 1 May 2024 at 03:17, Stephen Boyd <swboyd@chromium.org> wrote:
> >
> > Quoting Doug Anderson (2024-03-28 09:39:54)
> > >
> > > I spent a bunch of time discussing this offline with Stephen and I'll
> > > try to summarize. Hopefully this isn't too much nonsense...
> > >
> > > 1. We'll likely land the patches downstream in ChromeOS for now while
> > > we're figuring things out since we're seeing actual breakages. Whether
> > > to land upstream is a question. The first patch is a bit of a hack but
> > > unlikely to cause any real problems. The second patch seems correct
> > > but it also feels like it's going to cause stuck clocks for a pile of
> > > other SoCs because we're not adding hacks similar to the sc7180 hack
> > > for all the other SoCs. I guess we could hope we get lucky or play
> > > whack-a-mole? ...or we try to find a more generic solution... Dunno
> > > what others think.
> >
> > I think we should hope to get lucky or play whack-a-mole and merge
> > something like this series. If we have to we can similarly turn off RCGs
> > or branches during driver probe that are using shared parents like we
> > have on sc7180.
> >
> > Put simply, the shared RCG implementation is broken because it reports
> > the wrong parent for clk_ops::get_parent() and doesn't clear the force
> > enable bit. With the current code we're switching the parent to XO when
> > the clk is enabled the first time. That's an obvious bug that we should
> > fix regardless of implementing proper clk handoff. We haven't
> > implemented handoff in over a decade. Blocking this bug fix on
> > implementing handoff isn't practical.
>
> Yes, that needs to be fixed. My approach was to drop the XO parent and
> consider the clock to be off if it is fed by the XO.
>
> > Furthermore, we're relying on clk
> > consumers to clear that force enable bit by enabling the clk once. That
> > doesn't make any sense, although we could use that force enable bit to
> > consider the RCG as enabled for clk_disable_unused.
>
> That patch seems fine to me. Will it work if we take the force-enable
> bit into account when considering the clock to be on or off?

What is "that patch"?

It would work to consider the rcg clk as on or off. During rcg clk
registration if a branch child is found to be enabled we would go and
write the force enable bit in the parent rcg. And similarly we would
modify the rcg clk_ops to set that bit whenever the clk is enabled and
clear it whenever it is disabled. This avoids the feedback mechanism
from confusing us about the enable state of the rcg, at the cost of
writing the register.

It wouldn't fix the problem that we have on Trogdor though. That's
because the display GDSC is turned off when the rotator clk is enabled
and parented to the display PLL, which is also turned off. We have to
either turn off the rotator clk, or switch the parent to something that
is always on, XO, or keep the display GDSC on until the rotator clk can
be turned off. On other SoCs we may need to turn off even more clks
depending on which ones the display GDSC is controlling.

>
> >
> > An alternative approach to this series would be to force all shared RCGs
> > to be parented to XO at clk registration time, and continue to clear
> > that RCG force enable bit. That's sort of what Dmitry was going for
> > earlier. Doing this would break anything that's relying on the clks
> > staying enabled at some frequency through boot, but that isn't supported
> > anyway because clk handoff isn't implemented. It avoids the problem that
> > the first patch is for too because XO doesn't turn off causing a clk to
> > get stuck on. I can certainly craft this patch up if folks think that's
> > better.
>
> I think this approach makes sense too (and might be preferable to
> boot-time hacks).
> On most of the platforms we are already resetting the MDSS as soon as
> the mdss (root device) is being probed. Then the display is going to
> be broken until DPU collects all the coonectors and outpus and finally
> creates the DRM device.

Ok. I'm leaning towards this approach now because it seems like keeping
the clk at whatever it was set at boot isn't useful. If it becomes
useful at some point we can implement a better handoff mechanism. I have
some idea how to do that by using the 'clocks' DT property to find child
clks that haven't probed yet. I'll test out this alternate approach to
park shared clks at probe and send the patch.

>
> But I think we should fix the get_parent() too irrespectively of this.
>

Sure, but it becomes sorta moot if we force the parent to be XO.

[0/2] Fix a black screen on sc7180 Trogdor devices

Message

Comments