mbox series

[00/23] interconnect: fix racy provider registration

Message ID 20230201101559.15529-1-johan+linaro@kernel.org
Headers show
Series interconnect: fix racy provider registration | expand

Message

Johan Hovold Feb. 1, 2023, 10:15 a.m. UTC
The current interconnect provider interface is inherently racy as
providers are expected to be registered before being fully initialised.

This can specifically cause racing DT lookups to fail as I recently
noticed when the Qualcomm cpufreq driver failed to probe:

	of_icc_xlate_onecell: invalid index 0
        cpu cpu0: error -EINVAL: error finding src node
        cpu cpu0: dev_pm_opp_of_find_icc_paths: Unable to get path0: -22
        qcom-cpufreq-hw: probe of 18591000.cpufreq failed with error -22

This only happens very rarely, but the bug is easily reproduced by
increasing the race window by adding an msleep() after registering
osm-l3 interconnect provider.

Note that the Qualcomm cpufreq driver is especially susceptible to this
race as the interconnect path is looked up from the CPU nodes so that
driver core does not guarantee the probe order even when device links
are enabled (which they not always are).

This series adds a new interconnect provider registration API which is
used to fix up the interconnect drivers before removing the old racy
API.

Included are also a number of fixes for other bugs found while preparing
the series.

Johan


Johan Hovold (23):
  interconnect: fix mem leak when freeing nodes
  interconnect: fix icc_provider_del() error handling
  interconnect: fix provider registration API
  interconnect: imx: fix registration race
  interconnect: qcom: osm-l3: fix registration race
  interconnect: qcom: rpm: fix probe child-node error handling
  interconnect: qcom: rpm: fix probe PM domain error handling
  interconnect: qcom: rpm: fix registration race
  interconnect: qcom: rpmh: fix probe child-node error handling
  interconnect: qcom: rpmh: fix registration race
  interconnect: qcom: msm8974: fix registration race
  interconnect: qcom: sm8450: fix registration race
  interconnect: qcom: sm8550: fix registration race
  interconnect: exynos: fix node leak in probe PM QoS error path
  interconnect: exynos: fix registration race
  interconnect: exynos: drop redundant link destroy
  memory: tegra: fix interconnect registration race
  memory: tegra124-emc: fix interconnect registration race
  memory: tegra20-emc: fix interconnect registration race
  memory: tegra30-emc: fix interconnect registration race
  interconnect: drop racy registration API
  interconnect: drop unused icc_get() interface
  interconnect: drop unused icc_link_destroy() interface

 drivers/interconnect/core.c           | 149 +++++---------------------
 drivers/interconnect/imx/imx.c        |  20 ++--
 drivers/interconnect/qcom/icc-rpm.c   |  33 +++---
 drivers/interconnect/qcom/icc-rpmh.c  |  30 ++++--
 drivers/interconnect/qcom/msm8974.c   |  20 ++--
 drivers/interconnect/qcom/osm-l3.c    |  14 ++-
 drivers/interconnect/qcom/sm8450.c    |  22 ++--
 drivers/interconnect/qcom/sm8550.c    |  22 ++--
 drivers/interconnect/samsung/exynos.c |  30 +++---
 drivers/memory/tegra/mc.c             |  16 ++-
 drivers/memory/tegra/tegra124-emc.c   |  12 +--
 drivers/memory/tegra/tegra20-emc.c    |  12 +--
 drivers/memory/tegra/tegra30-emc.c    |  12 +--
 include/linux/interconnect-provider.h |  19 ++--
 include/linux/interconnect.h          |   8 --
 15 files changed, 154 insertions(+), 265 deletions(-)

Comments

Konrad Dybcio Feb. 1, 2023, 11:16 a.m. UTC | #1
On 1.02.2023 11:15, Johan Hovold wrote:
> The interconnect framework currently expects that providers are only
> removed when there are no users and after all nodes have been removed.
> 
> There is currently nothing that guarantees this to be the case and the
> framework does not do any reference counting, but refusing to remove the
> provider is never correct as that would leave a dangling pointer to a
> resource that is about to be released in the global provider list (e.g.
> accessible through debugfs).
> 
> Replace the current sanity checks with WARN_ON() so that the provider is
> always removed.
I spent a considerable amount of time scratching my head what WARN_ON has
to do with removing list items.. I suppose "don't return early and replace
pr_warn with WARN_ON" would have been clearer, but maybe that's just me
in the morning..

> 
> Fixes: 11f1ceca7031 ("interconnect: Add generic on-chip interconnect API")
> Cc: stable@vger.kernel.org      # 5.1: 680f8666baf6: interconnect: Make icc_provider_del() return void
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---
Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>

Konrad
>  drivers/interconnect/core.c | 14 ++------------
>  1 file changed, 2 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/interconnect/core.c b/drivers/interconnect/core.c
> index dc61620a0191..43c5c8503ee8 100644
> --- a/drivers/interconnect/core.c
> +++ b/drivers/interconnect/core.c
> @@ -1062,18 +1062,8 @@ EXPORT_SYMBOL_GPL(icc_provider_add);
>  void icc_provider_del(struct icc_provider *provider)
>  {
>  	mutex_lock(&icc_lock);
> -	if (provider->users) {
> -		pr_warn("interconnect provider still has %d users\n",
> -			provider->users);
> -		mutex_unlock(&icc_lock);
> -		return;
> -	}
> -
> -	if (!list_empty(&provider->nodes)) {
> -		pr_warn("interconnect provider still has nodes\n");
> -		mutex_unlock(&icc_lock);
> -		return;
> -	}
> +	WARN_ON(provider->users);
> +	WARN_ON(!list_empty(&provider->nodes));
>  
>  	list_del(&provider->provider_list);
>  	mutex_unlock(&icc_lock);
Krzysztof Kozlowski Feb. 2, 2023, 11:09 a.m. UTC | #2
On 01/02/2023 11:15, Johan Hovold wrote:
> There is no longer any need to explicitly destroy node links as this is
> now done when the node is destroyed as part of icc_nodes_remove().
> 
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---
>  drivers/interconnect/samsung/exynos.c | 6 ------
>  1 file changed, 6 deletions(-)


Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>


Best regards,
Krzysztof
Krzysztof Kozlowski Feb. 2, 2023, 11:13 a.m. UTC | #3
On 01/02/2023 11:15, Johan Hovold wrote:
> The current interconnect provider interface is inherently racy as
> providers are expected to be registered before being fully initialised.
> 
> This can specifically cause racing DT lookups to fail as I recently
> noticed when the Qualcomm cpufreq driver failed to probe:
> 
> 	of_icc_xlate_onecell: invalid index 0
>         cpu cpu0: error -EINVAL: error finding src node
>         cpu cpu0: dev_pm_opp_of_find_icc_paths: Unable to get path0: -22
>         qcom-cpufreq-hw: probe of 18591000.cpufreq failed with error -22
> 
> This only happens very rarely, but the bug is easily reproduced by
> increasing the race window by adding an msleep() after registering
> osm-l3 interconnect provider.
> 
> Note that the Qualcomm cpufreq driver is especially susceptible to this
> race as the interconnect path is looked up from the CPU nodes so that
> driver core does not guarantee the probe order even when device links
> are enabled (which they not always are).
> 
> This series adds a new interconnect provider registration API which is
> used to fix up the interconnect drivers before removing the old racy
> API.
> 

So is there a dependency or not? Can you make it clear that I shouldn't
take memory controller bits?

Best regards,
Krzysztof
Johan Hovold Feb. 2, 2023, 12:20 p.m. UTC | #4
On Thu, Feb 02, 2023 at 12:13:33PM +0100, Krzysztof Kozlowski wrote:
> On 01/02/2023 11:15, Johan Hovold wrote:
> > The current interconnect provider interface is inherently racy as
> > providers are expected to be registered before being fully initialised.
> > 
> > This can specifically cause racing DT lookups to fail as I recently
> > noticed when the Qualcomm cpufreq driver failed to probe:
> > 
> > 	of_icc_xlate_onecell: invalid index 0
> >         cpu cpu0: error -EINVAL: error finding src node
> >         cpu cpu0: dev_pm_opp_of_find_icc_paths: Unable to get path0: -22
> >         qcom-cpufreq-hw: probe of 18591000.cpufreq failed with error -22
> > 
> > This only happens very rarely, but the bug is easily reproduced by
> > increasing the race window by adding an msleep() after registering
> > osm-l3 interconnect provider.
> > 
> > Note that the Qualcomm cpufreq driver is especially susceptible to this
> > race as the interconnect path is looked up from the CPU nodes so that
> > driver core does not guarantee the probe order even when device links
> > are enabled (which they not always are).
> > 
> > This series adds a new interconnect provider registration API which is
> > used to fix up the interconnect drivers before removing the old racy
> > API.
> > 
> 
> So is there a dependency or not? Can you make it clear that I shouldn't
> take memory controller bits?

As the fixes depend on the new API it is best if these could all go
through Georgi's tree.

Johan
Krzysztof Kozlowski Feb. 2, 2023, 12:21 p.m. UTC | #5
On 01/02/2023 11:15, Johan Hovold wrote:
> The current interconnect provider registration interface is inherently
> racy as nodes are not added until the after adding the provider. This
> can specifically cause racing DT lookups to fail.
> 
> Switch to using the new API where the provider is not registered until
> after it has been fully initialised.
> 
> Fixes: d5ef16ba5fbe ("memory: tegra20: Support interconnect framework")
> Cc: stable@vger.kernel.org      # 5.11
> Cc: Dmitry Osipenko <digetx@gmail.com>
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---


Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>

Best regards,
Krzysztof
Brian Masney Feb. 2, 2023, 11:13 p.m. UTC | #6
On Wed, Feb 01, 2023 at 11:15:47AM +0100, Johan Hovold wrote:
> The current interconnect provider registration interface is inherently
> racy as nodes are not added until the after adding the provider. This
> can specifically cause racing DT lookups to fail.
> 
> Switch to using the new API where the provider is not registered until
> after it has been fully initialised.
> 
> Fixes: 4e60a9568dc6 ("interconnect: qcom: add msm8974 driver")
> Cc: stable@vger.kernel.org      # 5.5
> Cc: Brian Masney <bmasney@redhat.com>
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>

Reviewed-by: Brian Masney <bmasney@redhat.com>
Konrad Dybcio Feb. 3, 2023, 2:53 a.m. UTC | #7
On 1.02.2023 11:15, Johan Hovold wrote:
> Make sure to disable clocks also in case attaching the power domain
> fails.
> 
> Fixes: 7de109c0abe9 ("interconnect: icc-rpm: Add support for bus power domain")
> Cc: stable@vger.kernel.org      # 5.17
> Cc: Yassine Oudjana <y.oudjana@protonmail.com>
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---
Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>

Konrad
>  drivers/interconnect/qcom/icc-rpm.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/interconnect/qcom/icc-rpm.c b/drivers/interconnect/qcom/icc-rpm.c
> index 91778cfcbc65..da595059cafd 100644
> --- a/drivers/interconnect/qcom/icc-rpm.c
> +++ b/drivers/interconnect/qcom/icc-rpm.c
> @@ -498,8 +498,7 @@ int qnoc_probe(struct platform_device *pdev)
>  
>  	if (desc->has_bus_pd) {
>  		ret = dev_pm_domain_attach(dev, true);
> -		if (ret)
> -			return ret;
> +		goto err_disable_clks;
>  	}
>  
>  	provider = &qp->provider;
> @@ -514,8 +513,7 @@ int qnoc_probe(struct platform_device *pdev)
>  	ret = icc_provider_add(provider);
>  	if (ret) {
>  		dev_err(dev, "error adding interconnect provider: %d\n", ret);
> -		clk_bulk_disable_unprepare(qp->num_clks, qp->bus_clks);
> -		return ret;
> +		goto err_disable_clks;
>  	}
>  
>  	for (i = 0; i < num_nodes; i++) {
> @@ -550,8 +548,9 @@ int qnoc_probe(struct platform_device *pdev)
>  	return 0;
>  err:
>  	icc_nodes_remove(provider);
> -	clk_bulk_disable_unprepare(qp->num_clks, qp->bus_clks);
>  	icc_provider_del(provider);
> +err_disable_clks:
> +	clk_bulk_disable_unprepare(qp->num_clks, qp->bus_clks);
>  
>  	return ret;
>  }
Konrad Dybcio Feb. 3, 2023, 2:54 a.m. UTC | #8
On 1.02.2023 11:15, Johan Hovold wrote:
> Make sure to clean up and release resources properly also in case probe
> fails when populating child devices.
> 
> Fixes: 57eb14779dfd ("interconnect: qcom: icc-rpmh: Support child NoC device probe")
> Cc: stable@vger.kernel.org      # 6.0
> Cc: Luca Weiss <luca.weiss@fairphone.com>
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---
Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>

Konrad
>  drivers/interconnect/qcom/icc-rpmh.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/interconnect/qcom/icc-rpmh.c b/drivers/interconnect/qcom/icc-rpmh.c
> index fd17291c61eb..5168bbf3d92f 100644
> --- a/drivers/interconnect/qcom/icc-rpmh.c
> +++ b/drivers/interconnect/qcom/icc-rpmh.c
> @@ -235,8 +235,11 @@ int qcom_icc_rpmh_probe(struct platform_device *pdev)
>  	platform_set_drvdata(pdev, qp);
>  
>  	/* Populate child NoC devices if any */
> -	if (of_get_child_count(dev->of_node) > 0)
> -		return of_platform_populate(dev->of_node, NULL, NULL, dev);
> +	if (of_get_child_count(dev->of_node) > 0) {
> +		ret = of_platform_populate(dev->of_node, NULL, NULL, dev);
> +		if (ret)
> +			goto err;
> +	}
>  
>  	return 0;
>  err:
Konrad Dybcio Feb. 3, 2023, 2:56 a.m. UTC | #9
On 1.02.2023 11:15, Johan Hovold wrote:
> The current interconnect provider registration interface is inherently
> racy as nodes are not added until the after adding the provider. This
> can specifically cause racing DT lookups to fail.
> 
> Switch to using the new API where the provider is not registered until
> after it has been fully initialised.
> 
> Fixes: 4e60a9568dc6 ("interconnect: qcom: add msm8974 driver")
> Cc: stable@vger.kernel.org      # 5.5
> Cc: Brian Masney <bmasney@redhat.com>
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---
Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>

Konrad
>  drivers/interconnect/qcom/msm8974.c | 20 ++++++++------------
>  1 file changed, 8 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/interconnect/qcom/msm8974.c b/drivers/interconnect/qcom/msm8974.c
> index 5ea192f1141d..1828deaca443 100644
> --- a/drivers/interconnect/qcom/msm8974.c
> +++ b/drivers/interconnect/qcom/msm8974.c
> @@ -692,7 +692,6 @@ static int msm8974_icc_probe(struct platform_device *pdev)
>  		return ret;
>  
>  	provider = &qp->provider;
> -	INIT_LIST_HEAD(&provider->nodes);
>  	provider->dev = dev;
>  	provider->set = msm8974_icc_set;
>  	provider->aggregate = icc_std_aggregate;
> @@ -700,11 +699,7 @@ static int msm8974_icc_probe(struct platform_device *pdev)
>  	provider->data = data;
>  	provider->get_bw = msm8974_get_bw;
>  
> -	ret = icc_provider_add(provider);
> -	if (ret) {
> -		dev_err(dev, "error adding interconnect provider: %d\n", ret);
> -		goto err_disable_clks;
> -	}
> +	icc_provider_init(provider);
>  
>  	for (i = 0; i < num_nodes; i++) {
>  		size_t j;
> @@ -712,7 +707,7 @@ static int msm8974_icc_probe(struct platform_device *pdev)
>  		node = icc_node_create(qnodes[i]->id);
>  		if (IS_ERR(node)) {
>  			ret = PTR_ERR(node);
> -			goto err_del_icc;
> +			goto err_remove_nodes;
>  		}
>  
>  		node->name = qnodes[i]->name;
> @@ -729,15 +724,16 @@ static int msm8974_icc_probe(struct platform_device *pdev)
>  	}
>  	data->num_nodes = num_nodes;
>  
> +	ret = icc_provider_register(provider);
> +	if (ret)
> +		goto err_remove_nodes;
> +
>  	platform_set_drvdata(pdev, qp);
>  
>  	return 0;
>  
> -err_del_icc:
> +err_remove_nodes:
>  	icc_nodes_remove(provider);
> -	icc_provider_del(provider);
> -
> -err_disable_clks:
>  	clk_bulk_disable_unprepare(qp->num_clks, qp->bus_clks);
>  
>  	return ret;
> @@ -747,9 +743,9 @@ static int msm8974_icc_remove(struct platform_device *pdev)
>  {
>  	struct msm8974_icc_provider *qp = platform_get_drvdata(pdev);
>  
> +	icc_provider_deregister(&qp->provider);
>  	icc_nodes_remove(&qp->provider);
>  	clk_bulk_disable_unprepare(qp->num_clks, qp->bus_clks);
> -	icc_provider_del(&qp->provider);
>  
>  	return 0;
>  }