diff mbox series

soc: qcom: pmic_glink: Fix boot when QRTR=m

Message ID 20231213210644.8702-1-robdclark@gmail.com
State New
Headers show
Series soc: qcom: pmic_glink: Fix boot when QRTR=m | expand

Commit Message

Rob Clark Dec. 13, 2023, 9:06 p.m. UTC
From: Rob Clark <robdclark@chromium.org>

We need to bail out before adding/removing devices, if we are going
to -EPROBE_DEFER.  Otherwise boot will get stuck forever at
deferred_probe_initcall().

Suggested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/soc/qcom/pmic_glink.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

Comments

Johan Hovold Dec. 14, 2023, 7:16 a.m. UTC | #1
On Wed, Dec 13, 2023 at 01:06:43PM -0800, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> We need to bail out before adding/removing devices, if we are going
> to -EPROBE_DEFER.  Otherwise boot will get stuck forever at
> deferred_probe_initcall().

Can please you expand on why this is a problem here in the commit
message?

The aux devices appear to be tore down correctly in the probe error
paths so how exactly does that lead to deferred_probe_initcall() being
stuck? This sounds like we may have a problem elsewhere which this patch
is papering over.

Also where does the probe deferral come from in your case?
pdr_handle_alloc()?

If this is a correct fix, I'd also expect there to be a Fixes and
CC-stable tag.

Johan
Dmitry Baryshkov Dec. 14, 2023, 11:04 a.m. UTC | #2
On Thu, 14 Dec 2023 at 09:16, Johan Hovold <johan@kernel.org> wrote:
>
> On Wed, Dec 13, 2023 at 01:06:43PM -0800, Rob Clark wrote:
> > From: Rob Clark <robdclark@chromium.org>
> >
> > We need to bail out before adding/removing devices, if we are going
> > to -EPROBE_DEFER.  Otherwise boot will get stuck forever at
> > deferred_probe_initcall().
>
> Can please you expand on why this is a problem here in the commit
> message?
>
> The aux devices appear to be tore down correctly in the probe error
> paths so how exactly does that lead to deferred_probe_initcall() being
> stuck? This sounds like we may have a problem elsewhere which this patch
> is papering over.

This is a known problem. Successful probes during the probe deferral
loop causes the whole loop to be reiterated. Creating child devices
usually results in  a successful probe. Aso I thought that just
creating new device also causes a reprobe, but I can not find any
evidence now.

>
> Also where does the probe deferral come from in your case?
> pdr_handle_alloc()?
>
> If this is a correct fix, I'd also expect there to be a Fixes and
> CC-stable tag.
>
> Johan
>
Johan Hovold Dec. 14, 2023, 2:01 p.m. UTC | #3
On Thu, Dec 14, 2023 at 01:04:43PM +0200, Dmitry Baryshkov wrote:
> On Thu, 14 Dec 2023 at 09:16, Johan Hovold <johan@kernel.org> wrote:
> >
> > On Wed, Dec 13, 2023 at 01:06:43PM -0800, Rob Clark wrote:
> > > From: Rob Clark <robdclark@chromium.org>
> > >
> > > We need to bail out before adding/removing devices, if we are going
> > > to -EPROBE_DEFER.  Otherwise boot will get stuck forever at
> > > deferred_probe_initcall().
> >
> > Can please you expand on why this is a problem here in the commit
> > message?
> >
> > The aux devices appear to be tore down correctly in the probe error
> > paths so how exactly does that lead to deferred_probe_initcall() being
> > stuck? This sounds like we may have a problem elsewhere which this patch
> > is papering over.
> 
> This is a known problem. Successful probes during the probe deferral
> loop causes the whole loop to be reiterated. Creating child devices
> usually results in  a successful probe. Aso I thought that just
> creating new device also causes a reprobe, but I can not find any
> evidence now.

This still needs to be described in the commit message.

Only a successful probe should trigger a reprobe, and when the child
devices are registered the parent is not yet on the deferred probe list.
So something is not right or missing here.

Johan
Johan Hovold Dec. 14, 2023, 2:09 p.m. UTC | #4
On Thu, Dec 14, 2023 at 04:04:49PM +0200, Dmitry Baryshkov wrote:
> On Thu, 14 Dec 2023 at 16:01, Johan Hovold <johan@kernel.org> wrote:
> > On Thu, Dec 14, 2023 at 01:04:43PM +0200, Dmitry Baryshkov wrote:
> > > On Thu, 14 Dec 2023 at 09:16, Johan Hovold <johan@kernel.org> wrote:
> > > > On Wed, Dec 13, 2023 at 01:06:43PM -0800, Rob Clark wrote:
> > > > > From: Rob Clark <robdclark@chromium.org>
> > > > >
> > > > > We need to bail out before adding/removing devices, if we are going
> > > > > to -EPROBE_DEFER.  Otherwise boot will get stuck forever at
> > > > > deferred_probe_initcall().
> > > >
> > > > Can please you expand on why this is a problem here in the commit
> > > > message?
> > > >
> > > > The aux devices appear to be tore down correctly in the probe error
> > > > paths so how exactly does that lead to deferred_probe_initcall() being
> > > > stuck? This sounds like we may have a problem elsewhere which this patch
> > > > is papering over.
> > >
> > > This is a known problem. Successful probes during the probe deferral
> > > loop causes the whole loop to be reiterated. Creating child devices
> > > usually results in  a successful probe. Aso I thought that just
> > > creating new device also causes a reprobe, but I can not find any
> > > evidence now.
> >
> > This still needs to be described in the commit message.
> >
> > Only a successful probe should trigger a reprobe, and when the child
> > devices are registered the parent is not yet on the deferred probe list.
> > So something is not right or missing here.
> 
> Child devices can be successfully probed, then the parent gets
> -EPROBE_DEFER, removes children and then it goes on and on.

So what? As I described above, the successful probe of the children
should have nothing to do with whether the parent is reprobed.

If that isn't the case, then explain how.

Johan
Johan Hovold Dec. 14, 2023, 3:38 p.m. UTC | #5
On Thu, Dec 14, 2023 at 03:09:36PM +0100, Johan Hovold wrote:
> On Thu, Dec 14, 2023 at 04:04:49PM +0200, Dmitry Baryshkov wrote:
> > On Thu, 14 Dec 2023 at 16:01, Johan Hovold <johan@kernel.org> wrote:
> > > On Thu, Dec 14, 2023 at 01:04:43PM +0200, Dmitry Baryshkov wrote:

> > > > This is a known problem. Successful probes during the probe deferral
> > > > loop causes the whole loop to be reiterated. Creating child devices
> > > > usually results in  a successful probe. Aso I thought that just
> > > > creating new device also causes a reprobe, but I can not find any
> > > > evidence now.
> > >
> > > This still needs to be described in the commit message.
> > >
> > > Only a successful probe should trigger a reprobe, and when the child
> > > devices are registered the parent is not yet on the deferred probe list.
> > > So something is not right or missing here.
> > 
> > Child devices can be successfully probed, then the parent gets
> > -EPROBE_DEFER, removes children and then it goes on and on.
> 
> So what? As I described above, the successful probe of the children
> should have nothing to do with whether the parent is reprobed.
> 
> If that isn't the case, then explain how.

I took a closer look at this and indeed we do have code that triggers a
reprobe of a device in case there was a successful probe while the
device was probing.

This was introduced by commit 58b116bce136 ("drivercore: deferral race
condition fix") and the workaround for the reprobe-loop bug that hack
led to is to not return -EPROBE_DEFER after registering child devices as
no one managed to come up with a proper fix. This was documented here:

	fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")

But please spell this out in some more detail in the commit message, and
add a Fixes and CC stable tag.

Johan
Rob Clark Dec. 14, 2023, 8:44 p.m. UTC | #6
On Thu, Dec 14, 2023 at 8:08 AM Johan Hovold <johan@kernel.org> wrote:
>
> On Thu, Dec 14, 2023 at 04:38:35PM +0100, Johan Hovold wrote:
>
> > I took a closer look at this and indeed we do have code that triggers a
> > reprobe of a device in case there was a successful probe while the
> > device was probing.
> >
> > This was introduced by commit 58b116bce136 ("drivercore: deferral race
> > condition fix") and the workaround for the reprobe-loop bug that hack
> > led to is to not return -EPROBE_DEFER after registering child devices as
> > no one managed to come up with a proper fix. This was documented here:
> >
> >       fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")
> >
> > But please spell this out in some more detail in the commit message, and
> > add a Fixes and CC stable tag.
>
> And please update the commit summary as I've been booting with QRTR=m
> all along just fine. I guess the issue is if you have pmic_glink
> built-in or in the initramfs but forgot to include qrtr or similar?

I do have both QRTR=m and QCOM_GLINK=m.  I'm honestly not sure what
started triggering this issue for me.. it seemed to have started after
merging msm-next + drm-misc-next on top of your
jhovold/wip/sc8280xp-v6.7-rc5 (the merged branches were based on -rc3
so this shouldn't have really brought in random non-drm things).
Maybe there is a timing element to it?  I felt like the problem was
obvious enough, and the exact details of why I started hitting this
were not important enough to spend time tracking down.

BR,
-R

> Johan
Johan Hovold Dec. 15, 2023, 7:50 a.m. UTC | #7
On Thu, Dec 14, 2023 at 12:44:10PM -0800, Rob Clark wrote:
> On Thu, Dec 14, 2023 at 8:08 AM Johan Hovold <johan@kernel.org> wrote:
> >
> > On Thu, Dec 14, 2023 at 04:38:35PM +0100, Johan Hovold wrote:
> >
> > > I took a closer look at this and indeed we do have code that triggers a
> > > reprobe of a device in case there was a successful probe while the
> > > device was probing.
> > >
> > > This was introduced by commit 58b116bce136 ("drivercore: deferral race
> > > condition fix") and the workaround for the reprobe-loop bug that hack
> > > led to is to not return -EPROBE_DEFER after registering child devices as
> > > no one managed to come up with a proper fix. This was documented here:
> > >
> > >       fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")
> > >
> > > But please spell this out in some more detail in the commit message, and
> > > add a Fixes and CC stable tag.
> >
> > And please update the commit summary as I've been booting with QRTR=m
> > all along just fine. I guess the issue is if you have pmic_glink
> > built-in or in the initramfs but forgot to include qrtr or similar?
> 
> I do have both QRTR=m and QCOM_GLINK=m.  I'm honestly not sure what
> started triggering this issue for me.. it seemed to have started after
> merging msm-next + drm-misc-next on top of your
> jhovold/wip/sc8280xp-v6.7-rc5 (the merged branches were based on -rc3
> so this shouldn't have really brought in random non-drm things).
> Maybe there is a timing element to it?

Yeah, possibly, and device links may also come into play here.

> I felt like the problem was obvious enough, and the exact details of
> why I started hitting this were not important enough to spend time
> tracking down.

The patch itself is of course fine itself as a clean up (or
microoptimisation) but the claim that it solves, and is the correct fix
for, a probe loop issue is not obvious at all and requires some further
justification.

Since he last time someone suggested reverting the commit that
introduced the probe-loop issue, the result was to leave things as they
were and just document the workaround, it should be fine to just refer
to that commit:

	fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")

Perhaps you can keep the Subject if you make it clear in the commit
message that this bug isn't always hit with QRTR=m.

Johan
diff mbox series

Patch

diff --git a/drivers/soc/qcom/pmic_glink.c b/drivers/soc/qcom/pmic_glink.c
index 914057331afd..2899746af6a6 100644
--- a/drivers/soc/qcom/pmic_glink.c
+++ b/drivers/soc/qcom/pmic_glink.c
@@ -268,10 +268,16 @@  static int pmic_glink_probe(struct platform_device *pdev)
 	else
 		pg->client_mask = PMIC_GLINK_CLIENT_DEFAULT;
 
+	pg->pdr = pdr_handle_alloc(pmic_glink_pdr_callback, pg);
+	if (IS_ERR(pg->pdr)) {
+		ret = dev_err_probe(&pdev->dev, PTR_ERR(pg->pdr), "failed to initialize pdr\n");
+		return ret;
+	}
+
 	if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_UCSI)) {
 		ret = pmic_glink_add_aux_device(pg, &pg->ucsi_aux, "ucsi");
 		if (ret)
-			return ret;
+			goto out_release_pdr_handle;
 	}
 	if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_ALTMODE)) {
 		ret = pmic_glink_add_aux_device(pg, &pg->altmode_aux, "altmode");
@@ -284,17 +290,11 @@  static int pmic_glink_probe(struct platform_device *pdev)
 			goto out_release_altmode_aux;
 	}
 
-	pg->pdr = pdr_handle_alloc(pmic_glink_pdr_callback, pg);
-	if (IS_ERR(pg->pdr)) {
-		ret = dev_err_probe(&pdev->dev, PTR_ERR(pg->pdr), "failed to initialize pdr\n");
-		goto out_release_aux_devices;
-	}
-
 	service = pdr_add_lookup(pg->pdr, "tms/servreg", "msm/adsp/charger_pd");
 	if (IS_ERR(service)) {
 		ret = dev_err_probe(&pdev->dev, PTR_ERR(service),
 				    "failed adding pdr lookup for charger_pd\n");
-		goto out_release_pdr_handle;
+		goto out_release_aux_devices;
 	}
 
 	mutex_lock(&__pmic_glink_lock);
@@ -303,8 +303,6 @@  static int pmic_glink_probe(struct platform_device *pdev)
 
 	return 0;
 
-out_release_pdr_handle:
-	pdr_handle_release(pg->pdr);
 out_release_aux_devices:
 	if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_BATT))
 		pmic_glink_del_aux_device(pg, &pg->ps_aux);
@@ -314,6 +312,8 @@  static int pmic_glink_probe(struct platform_device *pdev)
 out_release_ucsi_aux:
 	if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_UCSI))
 		pmic_glink_del_aux_device(pg, &pg->ucsi_aux);
+out_release_pdr_handle:
+	pdr_handle_release(pg->pdr);
 
 	return ret;
 }