[03/13] opp: Keep track of currently programmed OPP

Message ID 96b57316a2a307a5cc5ff7302b3cd0084123a2ed.1611227342.git.viresh.kumar@linaro.org
State New
Headers show
Series
  • opp: Implement dev_pm_opp_set_opp()
Related show

Commit Message

Viresh Kumar Jan. 21, 2021, 11:17 a.m.
The dev_pm_opp_set_rate() helper needs to know the currently programmed
OPP to make few decisions and currently we try to find it on every
invocation of this routine.

Lets start keeping track of the current_opp programmed for the devices
of the opp table, that will be quite useful going forward.

If we fail to find the current OPP, we pick the first one available in
the list, as the list is in ascending order of frequencies, level, or
bandwidth and that's the best guess we can make anyway.

Note that we used to do the frequency comparison a bit early in
dev_pm_opp_set_rate() previously, and now instead we check the target
opp, which shall be more accurate anyway.

We need to make sure that current_opp's memory doesn't get freed while
it is being used and so we keep a reference of it until the time it is
used.

Now that current_opp will always be set, we can drop some unnecessary
checks as well.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

---
 drivers/opp/core.c | 83 +++++++++++++++++++++++++++++-----------------
 drivers/opp/opp.h  |  2 ++
 2 files changed, 55 insertions(+), 30 deletions(-)

-- 
2.25.0.rc1.19.g042ed3e048af

Comments

Dmitry Osipenko Jan. 21, 2021, 9:41 p.m. | #1
21.01.2021 14:17, Viresh Kumar пишет:
> @@ -1074,15 +1091,18 @@ int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)

>  

>  	if (!ret) {

>  		ret = _set_opp_bw(opp_table, opp, dev, false);

> -		if (!ret)

> +		if (!ret) {

>  			opp_table->enabled = true;

> +			dev_pm_opp_put(old_opp);

> +

> +			/* Make sure current_opp doesn't get freed */

> +			dev_pm_opp_get(opp);

> +			opp_table->current_opp = opp;

> +		}

>  	}


I'm a bit surprised that _set_opp_bw() isn't used similarly to
_set_opp_voltage() in _generic_set_opp_regulator().

I'd expect the BW requirement to be raised before the clock rate goes UP.
Viresh Kumar Jan. 22, 2021, 4:45 a.m. | #2
On 22-01-21, 00:41, Dmitry Osipenko wrote:
> 21.01.2021 14:17, Viresh Kumar пишет:

> > @@ -1074,15 +1091,18 @@ int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)

> >  

> >  	if (!ret) {

> >  		ret = _set_opp_bw(opp_table, opp, dev, false);

> > -		if (!ret)

> > +		if (!ret) {

> >  			opp_table->enabled = true;

> > +			dev_pm_opp_put(old_opp);

> > +

> > +			/* Make sure current_opp doesn't get freed */

> > +			dev_pm_opp_get(opp);

> > +			opp_table->current_opp = opp;

> > +		}

> >  	}

> 

> I'm a bit surprised that _set_opp_bw() isn't used similarly to

> _set_opp_voltage() in _generic_set_opp_regulator().

> 

> I'd expect the BW requirement to be raised before the clock rate goes UP.


I remember discussing that earlier when this stuff came in, and this I
believe is the reason for that.

We need to scale regulators before/after frequency because when we
increase the frequency a regulator may _not_ be providing enough power
to sustain that (even for a short while) and this may have undesired
effects on the hardware and so it is important to prevent that
malfunction.

In case of bandwidth such issues will not happen (AFAIK) and doing it
just once is normally enough. It is just about allowing more data to
be transmitted, and won't make the hardware behave badly.

-- 
viresh
Dmitry Osipenko Jan. 22, 2021, 2:31 p.m. | #3
22.01.2021 07:45, Viresh Kumar пишет:
> On 22-01-21, 00:41, Dmitry Osipenko wrote:

>> 21.01.2021 14:17, Viresh Kumar пишет:

>>> @@ -1074,15 +1091,18 @@ int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)

>>>  

>>>  	if (!ret) {

>>>  		ret = _set_opp_bw(opp_table, opp, dev, false);

>>> -		if (!ret)

>>> +		if (!ret) {

>>>  			opp_table->enabled = true;

>>> +			dev_pm_opp_put(old_opp);

>>> +

>>> +			/* Make sure current_opp doesn't get freed */

>>> +			dev_pm_opp_get(opp);

>>> +			opp_table->current_opp = opp;

>>> +		}

>>>  	}

>>

>> I'm a bit surprised that _set_opp_bw() isn't used similarly to

>> _set_opp_voltage() in _generic_set_opp_regulator().

>>

>> I'd expect the BW requirement to be raised before the clock rate goes UP.

> 

> I remember discussing that earlier when this stuff came in, and this I

> believe is the reason for that.

> 

> We need to scale regulators before/after frequency because when we

> increase the frequency a regulator may _not_ be providing enough power

> to sustain that (even for a short while) and this may have undesired

> effects on the hardware and so it is important to prevent that

> malfunction.

> 

> In case of bandwidth such issues will not happen (AFAIK) and doing it

> just once is normally enough. It is just about allowing more data to

> be transmitted, and won't make the hardware behave badly.

> 


This may not be true for all kinds of hardware, a display controller is
one example. If display's pixclock is raised before the memory bandwidth
of the display's memory client, then display controller may get a memory
underflow since it won't be able to fetch memory fast enough and it's
not possible to pause data transmission to display panel, hence display
panel may get out of sync and a full hardware reset will be needed in
order to recover. At least this is the case for NVIDIA Tegra SoCs.

I guess it's not a real problem for any of OPP API users right now, but
this is something to keep in mind.
Viresh Kumar Jan. 25, 2021, 3:12 a.m. | #4
On 22-01-21, 17:31, Dmitry Osipenko wrote:
> This may not be true for all kinds of hardware, a display controller is

> one example. If display's pixclock is raised before the memory bandwidth

> of the display's memory client, then display controller may get a memory

> underflow since it won't be able to fetch memory fast enough and it's

> not possible to pause data transmission to display panel, hence display

> panel may get out of sync and a full hardware reset will be needed in

> order to recover. At least this is the case for NVIDIA Tegra SoCs.


Hmm, but I expected that the request for more data will only come after the
opp-set-rate has finished and not in between. May be I am wrong. There is
nothing wrong in doing it the regulator way if required.

> I guess it's not a real problem for any of OPP API users right now, but

> this is something to keep in mind.


Sure, I am not against it. Just that we thought it isn't worth the code.

-- 
viresh
Akhil P Oommen Jan. 27, 2021, 4:31 p.m. | #5
On 1/22/2021 10:15 AM, Viresh Kumar wrote:
> On 22-01-21, 00:41, Dmitry Osipenko wrote:

>> 21.01.2021 14:17, Viresh Kumar пишет:

>>> @@ -1074,15 +1091,18 @@ int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)

>>>   

>>>   	if (!ret) {

>>>   		ret = _set_opp_bw(opp_table, opp, dev, false);

>>> -		if (!ret)

>>> +		if (!ret) {

>>>   			opp_table->enabled = true;

>>> +			dev_pm_opp_put(old_opp);

>>> +

>>> +			/* Make sure current_opp doesn't get freed */

>>> +			dev_pm_opp_get(opp);

>>> +			opp_table->current_opp = opp;

>>> +		}

>>>   	}

>>

>> I'm a bit surprised that _set_opp_bw() isn't used similarly to

>> _set_opp_voltage() in _generic_set_opp_regulator().

>>

>> I'd expect the BW requirement to be raised before the clock rate goes UP.

> 

> I remember discussing that earlier when this stuff came in, and this I

> believe is the reason for that.

> 

> We need to scale regulators before/after frequency because when we

> increase the frequency a regulator may _not_ be providing enough power

> to sustain that (even for a short while) and this may have undesired

> effects on the hardware and so it is important to prevent that

> malfunction.

> 

> In case of bandwidth such issues will not happen (AFAIK) and doing it

> just once is normally enough. It is just about allowing more data to

> be transmitted, and won't make the hardware behave badly.

> 

I agree with Dmitry. BW is a shared resource in a lot of architectures. 
Raising clk before increasing the bw can lead to a scenario where this 
client saturate the entire BW for whatever small duration it may be. 
This will impact the latency requirements of other clients.

-Akhil.
Viresh Kumar Jan. 28, 2021, 4:14 a.m. | #6
On 27-01-21, 22:01, Akhil P Oommen wrote:
> On 1/22/2021 10:15 AM, Viresh Kumar wrote:

> > On 22-01-21, 00:41, Dmitry Osipenko wrote:

> > > 21.01.2021 14:17, Viresh Kumar пишет:

> > > > @@ -1074,15 +1091,18 @@ int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)

> > > >   	if (!ret) {

> > > >   		ret = _set_opp_bw(opp_table, opp, dev, false);

> > > > -		if (!ret)

> > > > +		if (!ret) {

> > > >   			opp_table->enabled = true;

> > > > +			dev_pm_opp_put(old_opp);

> > > > +

> > > > +			/* Make sure current_opp doesn't get freed */

> > > > +			dev_pm_opp_get(opp);

> > > > +			opp_table->current_opp = opp;

> > > > +		}

> > > >   	}

> > > 

> > > I'm a bit surprised that _set_opp_bw() isn't used similarly to

> > > _set_opp_voltage() in _generic_set_opp_regulator().

> > > 

> > > I'd expect the BW requirement to be raised before the clock rate goes UP.

> > 

> > I remember discussing that earlier when this stuff came in, and this I

> > believe is the reason for that.

> > 

> > We need to scale regulators before/after frequency because when we

> > increase the frequency a regulator may _not_ be providing enough power

> > to sustain that (even for a short while) and this may have undesired

> > effects on the hardware and so it is important to prevent that

> > malfunction.

> > 

> > In case of bandwidth such issues will not happen (AFAIK) and doing it

> > just once is normally enough. It is just about allowing more data to

> > be transmitted, and won't make the hardware behave badly.

> > 

> I agree with Dmitry. BW is a shared resource in a lot of architectures.

> Raising clk before increasing the bw can lead to a scenario where this

> client saturate the entire BW for whatever small duration it may be. This

> will impact the latency requirements of other clients.


I see. I will make the necessary changes then to fix it. Thanks guys.

-- 
viresh
Ionela Voinescu July 7, 2021, 10:24 a.m. | #7
Hi,

On Thursday 21 Jan 2021 at 16:47:43 (+0530), Viresh Kumar wrote:
> The dev_pm_opp_set_rate() helper needs to know the currently programmed

> OPP to make few decisions and currently we try to find it on every

> invocation of this routine.

> 

> Lets start keeping track of the current_opp programmed for the devices

> of the opp table, that will be quite useful going forward.

> 

> If we fail to find the current OPP, we pick the first one available in

> the list, as the list is in ascending order of frequencies, level, or

> bandwidth and that's the best guess we can make anyway.

> 

> Note that we used to do the frequency comparison a bit early in

> dev_pm_opp_set_rate() previously, and now instead we check the target

> opp, which shall be more accurate anyway.

> 

> We need to make sure that current_opp's memory doesn't get freed while

> it is being used and so we keep a reference of it until the time it is

> used.

> 

> Now that current_opp will always be set, we can drop some unnecessary

> checks as well.

> 


I'm seeing some intermittent issues on Hikey960 after this patch,
which reproduces as follows. I've used v5.13 for my testing.


root@buildroot:~# while true; do \
>     cd /sys/devices/system/cpu/cpufreq/; \

>     for policy in policy*; do \

>            cd /sys/devices/system/cpu/cpufreq/$policy; \

>            for freq in $(cat scaling_available_frequencies); do \

>                 echo "userspace" > scaling_governor; \

>                 sleep 1; \

>                 echo $freq > scaling_setspeed; \

>                 sleep 1; \

>                 cpu="${policy: -1}"; \

>                 mask="0x$(printf '%x\n' $((1 << $cpu)))"; \

>                 sysev=$(~/taskset $mask ~/sysbench run --test=cpu --max-time=1 | grep "total number of events"); \

>                 delivered=$(cat cpuinfo_cur_freq); \

>                 if [ "$freq" != "$delivered" ]; then \

>                         echo "CPU$cpu - $freq setting failed: delivered $delivered, sysevents: $sysev"; \

>                 else \

>                         echo "CPU$cpu - $freq setting succeeded: delivered $delivered, sysevents: $sysev"; \

>                 fi; \

>                 echo "schedutil" > scaling_governor; \

>                 sleep 1; \

>                 done; done; done;


CPU0 - 533000 setting succeeded: delivered 533000, sysevents:     total number of events:              112
CPU0 - 999000 setting succeeded: delivered 999000, sysevents:     total number of events:              209
CPU0 - 1402000 setting succeeded: delivered 1402000, sysevents:     total number of events:              293
CPU0 - 1709000 setting succeeded: delivered 1709000, sysevents:     total number of events:              357
CPU0 - 1844000 setting succeeded: delivered 1844000, sysevents:     total number of events:              385
CPU4 - 903000 setting succeeded: delivered 903000, sysevents:     total number of events:              249
CPU4 - 1421000 setting succeeded: delivered 1421000, sysevents:     total number of events:              395
CPU4 - 1805000 setting succeeded: delivered 1805000, sysevents:     total number of events:              502
CPU4 - 2112000 setting succeeded: delivered 2112000, sysevents:     total number of events:              588
CPU4 - 2362000 setting succeeded: delivered 2362000, sysevents:     total number of events:              657

This is an example of good behavior of changing frequencies. I'm putting
this here first to show the sysbench results for each frequency, which is
helping me make sure that the performance matches the new set frequency.

Notes: the change to the schedutil governor after each userspace driven
frequency change was added because if the change is always to higher
frequencies, the issue does not reproduce as easily; the sleep commands
are added just to make sure the change gets the time to take effect.

From time to time (7/400 fail rate), I get the following failures:

CPU0 - 533000 setting failed: delivered 1402000, sysevents:     total number of events:              293
CPU0 - 1402000 setting failed: delivered 533000, sysevents:     total number of events:              112
CPU0 - 1402000 setting failed: delivered 533000, sysevents:     total number of events:              112
CPU4 - 903000 setting failed: delivered 1421000, sysevents:     total number of events:              394
CPU4 - 1805000 setting failed: delivered 903000, sysevents:     total number of events:              249
CPU0 - 533000 setting failed: delivered 1402000, sysevents:     total number of events:              293
CPU4 - 1805000 setting failed: delivered 903000, sysevents:     total number of events:              251

Now comes the interesting part: what seems to fix it is a call to
clk_get_rate(opp_table->clk) in _set_opp(), which is what basically
happened before this patch, as _find_current_opp() was always called.
I do not need to do anything with the returned frequency.

Therefore, by adding:
diff --git a/drivers/opp/core.c b/drivers/opp/core.c
index e366218d6736..2fdaf97f7ded 100644
--- a/drivers/opp/core.c
+++ b/drivers/opp/core.c
@@ -987,6 +987,7 @@ static int _set_opp(struct device *dev, struct opp_table *opp_table,
 {
        struct dev_pm_opp *old_opp;
        int scaling_down, ret;
+       unsigned long cur_freq;

        if (unlikely(!opp))
                return _disable_opp_table(dev, opp_table);
@@ -994,6 +995,13 @@ static int _set_opp(struct device *dev, struct opp_table *opp_table,
        /* Find the currently set OPP if we don't know already */
        if (unlikely(!opp_table->current_opp))
                _find_current_opp(dev, opp_table);
+       else if (!IS_ERR(opp_table->clk)) {
+                       cur_freq = clk_get_rate(opp_table->clk);
+                       if (opp_table->current_rate != cur_freq)
+                               pr_err("OPP mismatch: %lu vs %lu!",
+                                      opp_table->current_rate,
+                                      cur_freq);
+               }

        old_opp = opp_table->current_opp;

.. it does seem to solve the problem (no failures in 1000 frequency changes),
although I do get a few OPP mismatch logs:

[  667.495112] core: OPP mismatch: 1709000000 vs 1402000000!
[ 7260.656154] core: OPP mismatch: 1421000000 vs 903000000!
[ 7260.727717] core: OPP mismatch: 903000000 vs 1421000000!
[ 8847.304323] core: OPP mismatch: 1709000000 vs 1402000000!


I'm not sure what is happening here so I'm hoping you guys have more
knowledge to steer debugging in the right direction.

To be noted that I'm running an equivalent variant of this test on
multiple boards, and none of them have issues, except for Hikey960.

Thanks,
Ionela.


> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

> ---

>  drivers/opp/core.c | 83 +++++++++++++++++++++++++++++-----------------

>  drivers/opp/opp.h  |  2 ++

>  2 files changed, 55 insertions(+), 30 deletions(-)

> 

> diff --git a/drivers/opp/core.c b/drivers/opp/core.c

> index cb5b67ccf5cf..4ee598344e6a 100644

> --- a/drivers/opp/core.c

> +++ b/drivers/opp/core.c

> @@ -788,8 +788,7 @@ static int _generic_set_opp_regulator(struct opp_table *opp_table,

>  			__func__, old_freq);

>  restore_voltage:

>  	/* This shouldn't harm even if the voltages weren't updated earlier */

> -	if (old_supply)

> -		_set_opp_voltage(dev, reg, old_supply);

> +	_set_opp_voltage(dev, reg, old_supply);

>  

>  	return ret;

>  }

> @@ -839,10 +838,7 @@ static int _set_opp_custom(const struct opp_table *opp_table,

>  

>  	data->old_opp.rate = old_freq;

>  	size = sizeof(*old_supply) * opp_table->regulator_count;

> -	if (!old_supply)

> -		memset(data->old_opp.supplies, 0, size);

> -	else

> -		memcpy(data->old_opp.supplies, old_supply, size);

> +	memcpy(data->old_opp.supplies, old_supply, size);

>  

>  	data->new_opp.rate = freq;

>  	memcpy(data->new_opp.supplies, new_supply, size);

> @@ -943,6 +939,31 @@ int dev_pm_opp_set_bw(struct device *dev, struct dev_pm_opp *opp)

>  }

>  EXPORT_SYMBOL_GPL(dev_pm_opp_set_bw);

>  

> +static void _find_current_opp(struct device *dev, struct opp_table *opp_table)

> +{

> +	struct dev_pm_opp *opp = ERR_PTR(-ENODEV);

> +	unsigned long freq;

> +

> +	if (!IS_ERR(opp_table->clk)) {

> +		freq = clk_get_rate(opp_table->clk);

> +		opp = _find_freq_ceil(opp_table, &freq);

> +	}

> +

> +	/*

> +	 * Unable to find the current OPP ? Pick the first from the list since

> +	 * it is in ascending order, otherwise rest of the code will need to

> +	 * make special checks to validate current_opp.

> +	 */

> +	if (IS_ERR(opp)) {

> +		mutex_lock(&opp_table->lock);

> +		opp = list_first_entry(&opp_table->opp_list, struct dev_pm_opp, node);

> +		dev_pm_opp_get(opp);

> +		mutex_unlock(&opp_table->lock);

> +	}

> +

> +	opp_table->current_opp = opp;

> +}

> +

>  static int _disable_opp_table(struct device *dev, struct opp_table *opp_table)

>  {

>  	int ret;

> @@ -1004,16 +1025,6 @@ int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)

>  	if ((long)freq <= 0)

>  		freq = target_freq;

>  

> -	old_freq = clk_get_rate(opp_table->clk);

> -

> -	/* Return early if nothing to do */

> -	if (opp_table->enabled && old_freq == freq) {

> -		dev_dbg(dev, "%s: old/new frequencies (%lu Hz) are same, nothing to do\n",

> -			__func__, freq);

> -		ret = 0;

> -		goto put_opp_table;

> -	}

> -

>  	/*

>  	 * For IO devices which require an OPP on some platforms/SoCs

>  	 * while just needing to scale the clock on some others

> @@ -1026,12 +1037,9 @@ int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)

>  		goto put_opp_table;

>  	}

>  

> -	temp_freq = old_freq;

> -	old_opp = _find_freq_ceil(opp_table, &temp_freq);

> -	if (IS_ERR(old_opp)) {

> -		dev_err(dev, "%s: failed to find current OPP for freq %lu (%ld)\n",

> -			__func__, old_freq, PTR_ERR(old_opp));

> -	}

> +	/* Find the currently set OPP if we don't know already */

> +	if (unlikely(!opp_table->current_opp))

> +		_find_current_opp(dev, opp_table);

>  

>  	temp_freq = freq;

>  	opp = _find_freq_ceil(opp_table, &temp_freq);

> @@ -1039,7 +1047,17 @@ int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)

>  		ret = PTR_ERR(opp);

>  		dev_err(dev, "%s: failed to find OPP for freq %lu (%d)\n",

>  			__func__, freq, ret);

> -		goto put_old_opp;

> +		goto put_opp_table;

> +	}

> +

> +	old_opp = opp_table->current_opp;

> +	old_freq = old_opp->rate;

> +

> +	/* Return early if nothing to do */

> +	if (opp_table->enabled && old_opp == opp) {

> +		dev_dbg(dev, "%s: OPPs are same, nothing to do\n", __func__);

> +		ret = 0;

> +		goto put_opp;

>  	}

>  

>  	dev_dbg(dev, "%s: switching OPP: %lu Hz --> %lu Hz\n", __func__,

> @@ -1054,11 +1072,10 @@ int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)

>  

>  	if (opp_table->set_opp) {

>  		ret = _set_opp_custom(opp_table, dev, old_freq, freq,

> -				      IS_ERR(old_opp) ? NULL : old_opp->supplies,

> -				      opp->supplies);

> +				      old_opp->supplies, opp->supplies);

>  	} else if (opp_table->regulators) {

>  		ret = _generic_set_opp_regulator(opp_table, dev, old_freq, freq,

> -						 IS_ERR(old_opp) ? NULL : old_opp->supplies,

> +						 old_opp->supplies,

>  						 opp->supplies);

>  	} else {

>  		/* Only frequency scaling */

> @@ -1074,15 +1091,18 @@ int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)

>  

>  	if (!ret) {

>  		ret = _set_opp_bw(opp_table, opp, dev, false);

> -		if (!ret)

> +		if (!ret) {

>  			opp_table->enabled = true;

> +			dev_pm_opp_put(old_opp);

> +

> +			/* Make sure current_opp doesn't get freed */

> +			dev_pm_opp_get(opp);

> +			opp_table->current_opp = opp;

> +		}

>  	}

>  

>  put_opp:

>  	dev_pm_opp_put(opp);

> -put_old_opp:

> -	if (!IS_ERR(old_opp))

> -		dev_pm_opp_put(old_opp);

>  put_opp_table:

>  	dev_pm_opp_put_opp_table(opp_table);

>  	return ret;

> @@ -1276,6 +1296,9 @@ static void _opp_table_kref_release(struct kref *kref)

>  	list_del(&opp_table->node);

>  	mutex_unlock(&opp_table_lock);

>  

> +	if (opp_table->current_opp)

> +		dev_pm_opp_put(opp_table->current_opp);

> +

>  	_of_clear_opp_table(opp_table);

>  

>  	/* Release clk */

> diff --git a/drivers/opp/opp.h b/drivers/opp/opp.h

> index 4408cfcb0f31..359fd89d5770 100644

> --- a/drivers/opp/opp.h

> +++ b/drivers/opp/opp.h

> @@ -135,6 +135,7 @@ enum opp_table_access {

>   * @clock_latency_ns_max: Max clock latency in nanoseconds.

>   * @parsed_static_opps: Count of devices for which OPPs are initialized from DT.

>   * @shared_opp: OPP is shared between multiple devices.

> + * @current_opp: Currently configured OPP for the table.

>   * @suspend_opp: Pointer to OPP to be used during device suspend.

>   * @genpd_virt_dev_lock: Mutex protecting the genpd virtual device pointers.

>   * @genpd_virt_devs: List of virtual devices for multiple genpd support.

> @@ -183,6 +184,7 @@ struct opp_table {

>  

>  	unsigned int parsed_static_opps;

>  	enum opp_table_access shared_opp;

> +	struct dev_pm_opp *current_opp;

>  	struct dev_pm_opp *suspend_opp;

>  

>  	struct mutex genpd_virt_dev_lock;

> -- 

> 2.25.0.rc1.19.g042ed3e048af

> 

>
Viresh Kumar July 8, 2021, 7:53 a.m. | #8
On 07-07-21, 11:24, Ionela Voinescu wrote:
> Now comes the interesting part: what seems to fix it is a call to

> clk_get_rate(opp_table->clk) in _set_opp(), which is what basically

> happened before this patch, as _find_current_opp() was always called.

> I do not need to do anything with the returned frequency.


Wow, thanks for narrowing it down this far :)

I had a quick look and this is what I think is the problem here.

This platform uses mailbox API to send its frequency change requests to another
processor.  And the way it is written currently, I don't see any guarantee
whatsoever which say

  "once clk_set_rate() returns, the frequency would have already changed".

And this may exactly be the thing you are able to hit, luckily because of this
patchset :)

As a quick way of checking if that is right or not, this may make it work:

diff --git a/drivers/mailbox/hi3660-mailbox.c b/drivers/mailbox/hi3660-mailbox.c
index 395ddc250828..9856c1c84dcf 100644
--- a/drivers/mailbox/hi3660-mailbox.c
+++ b/drivers/mailbox/hi3660-mailbox.c
@@ -201,6 +201,9 @@ static int hi3660_mbox_send_data(struct mbox_chan *chan, void *msg)

        /* Trigger data transferring */
        writel(BIT(mchan->ack_irq), base + MBOX_SEND_REG);
+
+       hi3660_mbox_check_state(chan);
+
        return 0;
 }

-------------------------8<-------------------------

As a proper fix, something like this (not even compile tested) is required I
believe as I don't see the clients would know if the transfer is over. Cc'ing
mailbox guys to see what can be done.

diff --git a/drivers/clk/hisilicon/clk-hi3660-stub.c b/drivers/clk/hisilicon/clk-hi3660-stub.c
index 3a653d54bee0..c1e62ea4cf01 100644
--- a/drivers/clk/hisilicon/clk-hi3660-stub.c
+++ b/drivers/clk/hisilicon/clk-hi3660-stub.c
@@ -89,7 +89,6 @@ static int hi3660_stub_clk_set_rate(struct clk_hw *hw, unsigned long rate,
                stub_clk->msg[0], stub_clk->msg[1]);
 
        mbox_send_message(stub_clk_chan.mbox, stub_clk->msg);
-       mbox_client_txdone(stub_clk_chan.mbox, 0);
 
        stub_clk->rate = rate;
        return 0;
@@ -131,7 +130,7 @@ static int hi3660_stub_clk_probe(struct platform_device *pdev)
        /* Use mailbox client without blocking */
        stub_clk_chan.cl.dev = dev;
        stub_clk_chan.cl.tx_done = NULL;
-       stub_clk_chan.cl.tx_block = false;
+       stub_clk_chan.cl.tx_block = true;
        stub_clk_chan.cl.knows_txdone = false;
 
        /* Allocate mailbox channel */
diff --git a/drivers/mailbox/hi3660-mailbox.c b/drivers/mailbox/hi3660-mailbox.c
index 395ddc250828..8f6b787c0aba 100644
--- a/drivers/mailbox/hi3660-mailbox.c
+++ b/drivers/mailbox/hi3660-mailbox.c
@@ -1,5 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
-// Copyright (c) 2017-2018 HiSilicon Limited.
+// Copyright (c) 2017-2018 Hisilicon Limited.
 // Copyright (c) 2017-2018 Linaro Limited.
 
 #include <linux/bitops.h>
@@ -83,7 +83,7 @@ static struct hi3660_mbox *to_hi3660_mbox(struct mbox_controller *mbox)
        return container_of(mbox, struct hi3660_mbox, controller);
 }
 
-static int hi3660_mbox_check_state(struct mbox_chan *chan)
+static bool hi3660_mbox_last_tx_done(struct mbox_chan *chan)
 {
        unsigned long ch = (unsigned long)chan->con_priv;
        struct hi3660_mbox *mbox = to_hi3660_mbox(chan->mbox);
@@ -94,20 +94,20 @@ static int hi3660_mbox_check_state(struct mbox_chan *chan)
 
        /* Mailbox is ready to use */
        if (readl(base + MBOX_MODE_REG) & MBOX_STATE_READY)
-               return 0;
+               return true;
 
        /* Wait for acknowledge from remote */
        ret = readx_poll_timeout_atomic(readl, base + MBOX_MODE_REG,
                        val, (val & MBOX_STATE_ACK), 1000, 300000);
        if (ret) {
                dev_err(mbox->dev, "%s: timeout for receiving ack\n", __func__);
-               return ret;
+               return false;
        }
 
        /* clear ack state, mailbox will get back to ready state */
        writel(BIT(mchan->ack_irq), base + MBOX_ICLR_REG);
 
-       return 0;
+       return true;
 }
 
 static int hi3660_mbox_unlock(struct mbox_chan *chan)
@@ -182,10 +182,6 @@ static int hi3660_mbox_send_data(struct mbox_chan *chan, void *msg)
        unsigned int i;
        int ret;
 
-       ret = hi3660_mbox_check_state(chan);
-       if (ret)
-               return ret;
-
        /* Clear mask for destination interrupt */
        writel_relaxed(~BIT(mchan->dst_irq), base + MBOX_IMASK_REG);
 
@@ -207,6 +203,7 @@ static int hi3660_mbox_send_data(struct mbox_chan *chan, void *msg)
 static const struct mbox_chan_ops hi3660_mbox_ops = {
        .startup        = hi3660_mbox_startup,
        .send_data      = hi3660_mbox_send_data,
+       .last_tx_done   = hi3660_mbox_last_tx_done,
 };
 
 static struct mbox_chan *hi3660_mbox_xlate(struct mbox_controller *controller,
@@ -259,6 +256,7 @@ static int hi3660_mbox_probe(struct platform_device *pdev)
        mbox->controller.num_chans = MBOX_CHAN_MAX;
        mbox->controller.ops = &hi3660_mbox_ops;
        mbox->controller.of_xlate = hi3660_mbox_xlate;
+       mbox->controller.txdone_poll = true;
 
        /* Initialize mailbox channel data */
        chan = mbox->chan;

-- 
viresh
Ionela Voinescu July 9, 2021, 8:57 a.m. | #9
Hi Viresh,

On Thursday 08 Jul 2021 at 13:23:53 (+0530), Viresh Kumar wrote:
> On 07-07-21, 11:24, Ionela Voinescu wrote:

> > Now comes the interesting part: what seems to fix it is a call to

> > clk_get_rate(opp_table->clk) in _set_opp(), which is what basically

> > happened before this patch, as _find_current_opp() was always called.

> > I do not need to do anything with the returned frequency.

> 

> Wow, thanks for narrowing it down this far :)

> 

> I had a quick look and this is what I think is the problem here.

> 

> This platform uses mailbox API to send its frequency change requests to another

> processor.  And the way it is written currently, I don't see any guarantee

> whatsoever which say

> 

>   "once clk_set_rate() returns, the frequency would have already changed".

> 


I think what was strange to me was that the frequency never seems to
change, there isn't just a delay in the new frequency taking effect, as
I would expect in these cases. Or if there is a delay, that's quite large
- at least a second.

> And this may exactly be the thing you are able to hit, luckily because of this

> patchset :)

> 

> As a quick way of checking if that is right or not, this may make it work:

> 

> diff --git a/drivers/mailbox/hi3660-mailbox.c b/drivers/mailbox/hi3660-mailbox.c

> index 395ddc250828..9856c1c84dcf 100644

> --- a/drivers/mailbox/hi3660-mailbox.c

> +++ b/drivers/mailbox/hi3660-mailbox.c

> @@ -201,6 +201,9 @@ static int hi3660_mbox_send_data(struct mbox_chan *chan, void *msg)

> 

>         /* Trigger data transferring */

>         writel(BIT(mchan->ack_irq), base + MBOX_SEND_REG);

> +

> +       hi3660_mbox_check_state(chan);

> +


I gave this a try an it does work for me.

>         return 0;

>  }

> 

> -------------------------8<-------------------------

> 

> As a proper fix, something like this (not even compile tested) is required I

> believe as I don't see the clients would know if the transfer is over. Cc'ing

> mailbox guys to see what can be done.

> 


I'll give this a try as well when there is consensus. I might even try to
review it, if the time allows.

Many thanks,
Ionela.

> diff --git a/drivers/clk/hisilicon/clk-hi3660-stub.c b/drivers/clk/hisilicon/clk-hi3660-stub.c

> index 3a653d54bee0..c1e62ea4cf01 100644

> --- a/drivers/clk/hisilicon/clk-hi3660-stub.c

> +++ b/drivers/clk/hisilicon/clk-hi3660-stub.c

> @@ -89,7 +89,6 @@ static int hi3660_stub_clk_set_rate(struct clk_hw *hw, unsigned long rate,

>                 stub_clk->msg[0], stub_clk->msg[1]);

>  

>         mbox_send_message(stub_clk_chan.mbox, stub_clk->msg);

> -       mbox_client_txdone(stub_clk_chan.mbox, 0);

>  

>         stub_clk->rate = rate;

>         return 0;

> @@ -131,7 +130,7 @@ static int hi3660_stub_clk_probe(struct platform_device *pdev)

>         /* Use mailbox client without blocking */

>         stub_clk_chan.cl.dev = dev;

>         stub_clk_chan.cl.tx_done = NULL;

> -       stub_clk_chan.cl.tx_block = false;

> +       stub_clk_chan.cl.tx_block = true;

>         stub_clk_chan.cl.knows_txdone = false;

>  

>         /* Allocate mailbox channel */

> diff --git a/drivers/mailbox/hi3660-mailbox.c b/drivers/mailbox/hi3660-mailbox.c

> index 395ddc250828..8f6b787c0aba 100644

> --- a/drivers/mailbox/hi3660-mailbox.c

> +++ b/drivers/mailbox/hi3660-mailbox.c

> @@ -1,5 +1,5 @@

>  // SPDX-License-Identifier: GPL-2.0

> -// Copyright (c) 2017-2018 HiSilicon Limited.

> +// Copyright (c) 2017-2018 Hisilicon Limited.

>  // Copyright (c) 2017-2018 Linaro Limited.

>  

>  #include <linux/bitops.h>

> @@ -83,7 +83,7 @@ static struct hi3660_mbox *to_hi3660_mbox(struct mbox_controller *mbox)

>         return container_of(mbox, struct hi3660_mbox, controller);

>  }

>  

> -static int hi3660_mbox_check_state(struct mbox_chan *chan)

> +static bool hi3660_mbox_last_tx_done(struct mbox_chan *chan)

>  {

>         unsigned long ch = (unsigned long)chan->con_priv;

>         struct hi3660_mbox *mbox = to_hi3660_mbox(chan->mbox);

> @@ -94,20 +94,20 @@ static int hi3660_mbox_check_state(struct mbox_chan *chan)

>  

>         /* Mailbox is ready to use */

>         if (readl(base + MBOX_MODE_REG) & MBOX_STATE_READY)

> -               return 0;

> +               return true;

>  

>         /* Wait for acknowledge from remote */

>         ret = readx_poll_timeout_atomic(readl, base + MBOX_MODE_REG,

>                         val, (val & MBOX_STATE_ACK), 1000, 300000);

>         if (ret) {

>                 dev_err(mbox->dev, "%s: timeout for receiving ack\n", __func__);

> -               return ret;

> +               return false;

>         }

>  

>         /* clear ack state, mailbox will get back to ready state */

>         writel(BIT(mchan->ack_irq), base + MBOX_ICLR_REG);

>  

> -       return 0;

> +       return true;

>  }

>  

>  static int hi3660_mbox_unlock(struct mbox_chan *chan)

> @@ -182,10 +182,6 @@ static int hi3660_mbox_send_data(struct mbox_chan *chan, void *msg)

>         unsigned int i;

>         int ret;

>  

> -       ret = hi3660_mbox_check_state(chan);

> -       if (ret)

> -               return ret;

> -

>         /* Clear mask for destination interrupt */

>         writel_relaxed(~BIT(mchan->dst_irq), base + MBOX_IMASK_REG);

>  

> @@ -207,6 +203,7 @@ static int hi3660_mbox_send_data(struct mbox_chan *chan, void *msg)

>  static const struct mbox_chan_ops hi3660_mbox_ops = {

>         .startup        = hi3660_mbox_startup,

>         .send_data      = hi3660_mbox_send_data,

> +       .last_tx_done   = hi3660_mbox_last_tx_done,

>  };

>  

>  static struct mbox_chan *hi3660_mbox_xlate(struct mbox_controller *controller,

> @@ -259,6 +256,7 @@ static int hi3660_mbox_probe(struct platform_device *pdev)

>         mbox->controller.num_chans = MBOX_CHAN_MAX;

>         mbox->controller.ops = &hi3660_mbox_ops;

>         mbox->controller.of_xlate = hi3660_mbox_xlate;

> +       mbox->controller.txdone_poll = true;

>  

>         /* Initialize mailbox channel data */

>         chan = mbox->chan;

> 

> -- 

> viresh
Viresh Kumar July 12, 2021, 4:14 a.m. | #10
On 09-07-21, 09:57, Ionela Voinescu wrote:
> On Thursday 08 Jul 2021 at 13:23:53 (+0530), Viresh Kumar wrote:

> > On 07-07-21, 11:24, Ionela Voinescu wrote:

> > > Now comes the interesting part: what seems to fix it is a call to

> > > clk_get_rate(opp_table->clk) in _set_opp(), which is what basically

> > > happened before this patch, as _find_current_opp() was always called.

> > > I do not need to do anything with the returned frequency.

> > 

> > Wow, thanks for narrowing it down this far :)

> > 

> > I had a quick look and this is what I think is the problem here.

> > 

> > This platform uses mailbox API to send its frequency change requests to another

> > processor.  And the way it is written currently, I don't see any guarantee

> > whatsoever which say

> > 

> >   "once clk_set_rate() returns, the frequency would have already changed".

> > 

> 

> I think what was strange to me was that the frequency never seems to

> change, there isn't just a delay in the new frequency taking effect, as

> I would expect in these cases. Or if there is a delay, that's quite large

> - at least a second.


No idea on what the firmware is doing behind the scene :)

> > And this may exactly be the thing you are able to hit, luckily because of this

> > patchset :)

> > 

> > As a quick way of checking if that is right or not, this may make it work:

> > 

> > diff --git a/drivers/mailbox/hi3660-mailbox.c b/drivers/mailbox/hi3660-mailbox.c

> > index 395ddc250828..9856c1c84dcf 100644

> > --- a/drivers/mailbox/hi3660-mailbox.c

> > +++ b/drivers/mailbox/hi3660-mailbox.c

> > @@ -201,6 +201,9 @@ static int hi3660_mbox_send_data(struct mbox_chan *chan, void *msg)

> > 

> >         /* Trigger data transferring */

> >         writel(BIT(mchan->ack_irq), base + MBOX_SEND_REG);

> > +

> > +       hi3660_mbox_check_state(chan);

> > +

> 

> I gave this a try an it does work for me.


Good, so that kind of proves what I was suspecting. The mailbox driver looks
buggy here.

> > -------------------------8<-------------------------

> > 

> > As a proper fix, something like this (not even compile tested) is required I

> > believe as I don't see the clients would know if the transfer is over. Cc'ing

> > mailbox guys to see what can be done.

> > 

> 

> I'll give this a try as well when there is consensus. I might even try to

> review it, if the time allows.


Sure, lets see what the platform guys think about this first.

Kevin, Kaihua ?

-- 
viresh

Patch

diff --git a/drivers/opp/core.c b/drivers/opp/core.c
index cb5b67ccf5cf..4ee598344e6a 100644
--- a/drivers/opp/core.c
+++ b/drivers/opp/core.c
@@ -788,8 +788,7 @@  static int _generic_set_opp_regulator(struct opp_table *opp_table,
 			__func__, old_freq);
 restore_voltage:
 	/* This shouldn't harm even if the voltages weren't updated earlier */
-	if (old_supply)
-		_set_opp_voltage(dev, reg, old_supply);
+	_set_opp_voltage(dev, reg, old_supply);
 
 	return ret;
 }
@@ -839,10 +838,7 @@  static int _set_opp_custom(const struct opp_table *opp_table,
 
 	data->old_opp.rate = old_freq;
 	size = sizeof(*old_supply) * opp_table->regulator_count;
-	if (!old_supply)
-		memset(data->old_opp.supplies, 0, size);
-	else
-		memcpy(data->old_opp.supplies, old_supply, size);
+	memcpy(data->old_opp.supplies, old_supply, size);
 
 	data->new_opp.rate = freq;
 	memcpy(data->new_opp.supplies, new_supply, size);
@@ -943,6 +939,31 @@  int dev_pm_opp_set_bw(struct device *dev, struct dev_pm_opp *opp)
 }
 EXPORT_SYMBOL_GPL(dev_pm_opp_set_bw);
 
+static void _find_current_opp(struct device *dev, struct opp_table *opp_table)
+{
+	struct dev_pm_opp *opp = ERR_PTR(-ENODEV);
+	unsigned long freq;
+
+	if (!IS_ERR(opp_table->clk)) {
+		freq = clk_get_rate(opp_table->clk);
+		opp = _find_freq_ceil(opp_table, &freq);
+	}
+
+	/*
+	 * Unable to find the current OPP ? Pick the first from the list since
+	 * it is in ascending order, otherwise rest of the code will need to
+	 * make special checks to validate current_opp.
+	 */
+	if (IS_ERR(opp)) {
+		mutex_lock(&opp_table->lock);
+		opp = list_first_entry(&opp_table->opp_list, struct dev_pm_opp, node);
+		dev_pm_opp_get(opp);
+		mutex_unlock(&opp_table->lock);
+	}
+
+	opp_table->current_opp = opp;
+}
+
 static int _disable_opp_table(struct device *dev, struct opp_table *opp_table)
 {
 	int ret;
@@ -1004,16 +1025,6 @@  int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)
 	if ((long)freq <= 0)
 		freq = target_freq;
 
-	old_freq = clk_get_rate(opp_table->clk);
-
-	/* Return early if nothing to do */
-	if (opp_table->enabled && old_freq == freq) {
-		dev_dbg(dev, "%s: old/new frequencies (%lu Hz) are same, nothing to do\n",
-			__func__, freq);
-		ret = 0;
-		goto put_opp_table;
-	}
-
 	/*
 	 * For IO devices which require an OPP on some platforms/SoCs
 	 * while just needing to scale the clock on some others
@@ -1026,12 +1037,9 @@  int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)
 		goto put_opp_table;
 	}
 
-	temp_freq = old_freq;
-	old_opp = _find_freq_ceil(opp_table, &temp_freq);
-	if (IS_ERR(old_opp)) {
-		dev_err(dev, "%s: failed to find current OPP for freq %lu (%ld)\n",
-			__func__, old_freq, PTR_ERR(old_opp));
-	}
+	/* Find the currently set OPP if we don't know already */
+	if (unlikely(!opp_table->current_opp))
+		_find_current_opp(dev, opp_table);
 
 	temp_freq = freq;
 	opp = _find_freq_ceil(opp_table, &temp_freq);
@@ -1039,7 +1047,17 @@  int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)
 		ret = PTR_ERR(opp);
 		dev_err(dev, "%s: failed to find OPP for freq %lu (%d)\n",
 			__func__, freq, ret);
-		goto put_old_opp;
+		goto put_opp_table;
+	}
+
+	old_opp = opp_table->current_opp;
+	old_freq = old_opp->rate;
+
+	/* Return early if nothing to do */
+	if (opp_table->enabled && old_opp == opp) {
+		dev_dbg(dev, "%s: OPPs are same, nothing to do\n", __func__);
+		ret = 0;
+		goto put_opp;
 	}
 
 	dev_dbg(dev, "%s: switching OPP: %lu Hz --> %lu Hz\n", __func__,
@@ -1054,11 +1072,10 @@  int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)
 
 	if (opp_table->set_opp) {
 		ret = _set_opp_custom(opp_table, dev, old_freq, freq,
-				      IS_ERR(old_opp) ? NULL : old_opp->supplies,
-				      opp->supplies);
+				      old_opp->supplies, opp->supplies);
 	} else if (opp_table->regulators) {
 		ret = _generic_set_opp_regulator(opp_table, dev, old_freq, freq,
-						 IS_ERR(old_opp) ? NULL : old_opp->supplies,
+						 old_opp->supplies,
 						 opp->supplies);
 	} else {
 		/* Only frequency scaling */
@@ -1074,15 +1091,18 @@  int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq)
 
 	if (!ret) {
 		ret = _set_opp_bw(opp_table, opp, dev, false);
-		if (!ret)
+		if (!ret) {
 			opp_table->enabled = true;
+			dev_pm_opp_put(old_opp);
+
+			/* Make sure current_opp doesn't get freed */
+			dev_pm_opp_get(opp);
+			opp_table->current_opp = opp;
+		}
 	}
 
 put_opp:
 	dev_pm_opp_put(opp);
-put_old_opp:
-	if (!IS_ERR(old_opp))
-		dev_pm_opp_put(old_opp);
 put_opp_table:
 	dev_pm_opp_put_opp_table(opp_table);
 	return ret;
@@ -1276,6 +1296,9 @@  static void _opp_table_kref_release(struct kref *kref)
 	list_del(&opp_table->node);
 	mutex_unlock(&opp_table_lock);
 
+	if (opp_table->current_opp)
+		dev_pm_opp_put(opp_table->current_opp);
+
 	_of_clear_opp_table(opp_table);
 
 	/* Release clk */
diff --git a/drivers/opp/opp.h b/drivers/opp/opp.h
index 4408cfcb0f31..359fd89d5770 100644
--- a/drivers/opp/opp.h
+++ b/drivers/opp/opp.h
@@ -135,6 +135,7 @@  enum opp_table_access {
  * @clock_latency_ns_max: Max clock latency in nanoseconds.
  * @parsed_static_opps: Count of devices for which OPPs are initialized from DT.
  * @shared_opp: OPP is shared between multiple devices.
+ * @current_opp: Currently configured OPP for the table.
  * @suspend_opp: Pointer to OPP to be used during device suspend.
  * @genpd_virt_dev_lock: Mutex protecting the genpd virtual device pointers.
  * @genpd_virt_devs: List of virtual devices for multiple genpd support.
@@ -183,6 +184,7 @@  struct opp_table {
 
 	unsigned int parsed_static_opps;
 	enum opp_table_access shared_opp;
+	struct dev_pm_opp *current_opp;
 	struct dev_pm_opp *suspend_opp;
 
 	struct mutex genpd_virt_dev_lock;