[V3,0/4] cpufreq: cppc: Add support for frequency invariance

Message ID	cover.1624266901.git.viresh.kumar@linaro.org
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of linux-pm-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; From: Viresh Kumar <viresh.kumar@linaro.org> To: Rafael Wysocki <rjw@rjwysocki.net>, Ionela Voinescu <ionela.voinescu@arm.com>, Ben Segall <bsegall@google.com>, Daniel Bristot de Oliveira <bristot@redhat.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>, Mel Gorman <mgorman@suse.de>, Peter Zijlstra <peterz@infradead.org>, "Rafael J. Wysocki" <rafael@kernel.org>, Steven Rostedt <rostedt@goodmis.org>, Sudeep Holla <sudeep.holla@arm.com>, Vincent Guittot <vincent.guittot@linaro.org>, Viresh Kumar <viresh.kumar@linaro.org>, Will Deacon <will@kernel.org> Cc: linux-pm@vger.kernel.org, Qian Cai <quic_qiancai@quicinc.com>, linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, "Paul E. McKenney" <paulmck@kernel.org>, "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Subject: [PATCH V3 0/4] cpufreq: cppc: Add support for frequency invariance Date: Mon, 21 Jun 2021 14:49:33 +0530 Message-Id: <cover.1624266901.git.viresh.kumar@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	cpufreq: cppc: Add support for frequency invariance \| expand [V3,0/4] cpufreq: cppc: Add support for frequency invariance [V3,1/4] cpufreq: cppc: Fix potential memleak in cppc_cpufreq_cpu_init [V3,2/4] cpufreq: cppc: Pass structure instance by reference [V3,3/4] arch_topology: Avoid use-after-free for scale_freq_data [V3,4/4] cpufreq: CPPC: Add support for frequency invariance

Viresh Kumar June 21, 2021, 9:19 a.m. UTC

Hello,

Changes since V2:

- We don't need start_cpu() and stop_cpu() callbacks anymore, we can make it
  work using policy ->init() and exit() alone.

- Two new cleanup patches 1/4 and 2/4.

- Improved commit log of 3/4.

- Dropped WARN_ON(local_freq_scale > 1024), since this can occur on counter's
  overlap (seen with Vincent's setup).

- Handle stuff from init/exit() callbacks only.

Changes since V1:

- Few of the patches migrating users to ->exit() callback are posted separately.

- The CPPC patch was completely reverted and so the support for FIE is again
  added here from scratch.

- The start_cpu() and stop_cpu() interface is reworked a little so stop_cpu() is
  only ever called for a CPU if start_cpu() was called for it earlier.

- A new patch to implement RCU locking in arch_topology core to avoid some
  races.

- Some cleanup and very clear/separate paths for FIE in cppc driver now.


-------------------------8<-------------------------

CPPC cpufreq driver is used for ARM servers and this patch series tries to
provide counter-based frequency invariance support for them in the absence for
architecture specific counters (like AMUs).

This was reverted earlier for the 5.13 kernel after Qian Cai reported kernel
oops during suspend/resume.

This is based of v5.13-rc7 + a cleanup patchset:

https://lore.kernel.org/linux-pm/cover.1623825358.git.viresh.kumar@linaro.org/

All the patches are pushed here together for people to run.

https://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git/log/?h=cpufreq/cppc

This is tested on my Hikey platform (without the actual read/write to
performance counters), with this script for over an hour:

while true; do
    for i in `seq 1 7`;
    do
        echo 0 > /sys/devices/system/cpu/cpu$i/online;
    done;

    for i in `seq 1 7`;
    do
        echo 1 > /sys/devices/system/cpu/cpu$i/online;
    done;
done


The same is done by Vincent on ThunderX2 and no issues were seen.

I would like to get this merged for 5.14, since it was recently reverted from
5.13. And that it is still an independent change to a single driver and topology
APIs that no one is using apart from arm64 topology stuff.

Thanks.

--
Viresh

Viresh Kumar (4):
  cpufreq: cppc: Fix potential memleak in cppc_cpufreq_cpu_init
  cpufreq: cppc: Pass structure instance by reference
  arch_topology: Avoid use-after-free for scale_freq_data
  cpufreq: CPPC: Add support for frequency invariance

 drivers/base/arch_topology.c   |  27 +++-
 drivers/cpufreq/Kconfig.arm    |  10 ++
 drivers/cpufreq/cppc_cpufreq.c | 287 +++++++++++++++++++++++++++++----
 include/linux/arch_topology.h  |   1 +
 kernel/sched/core.c            |   1 +
 5 files changed, 292 insertions(+), 34 deletions(-)

-- 
2.31.1.272.g89b43f80a514

Qian Cai June 21, 2021, 8:48 p.m. UTC | #1

On 6/21/2021 5:19 AM, Viresh Kumar wrote:
> CPPC cpufreq driver is used for ARM servers and this patch series tries to

> provide counter-based frequency invariance support for them in the absence for

> architecture specific counters (like AMUs).


Viresh, this series works fine on my quick tests so far. BTW, I noticed some strange things even with the series applied mentioned below when reading acpi_cppc vs cpufreq sysfs. Do you happen to know are those just hardware/firmware issues because Linux just faithfully exported the register values?

== Arm64 server Foo ==
CPU max MHz:                     3000.0000
CPU min MHz:                     1000.0000

/sys/devices/system/cpu/cpu0/acpi_cppc/highest_perf
300
/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_freq
1000
/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_nonlinear_perf
200
/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_perf
100
/sys/devices/system/cpu/cpu0/acpi_cppc/nominal_freq <--- should be 3000?
2800
/sys/devices/system/cpu/cpu0/acpi_cppc/nominal_perf <--- should be 300?
280
/sys/devices/system/cpu/cpu0/acpi_cppc/reference_perf
100

== Arm64 server Bar ==
CPU max MHz:                     3000.0000
CPU min MHz:                     375.0000

/sys/devices/system/cpu/cpu0/acpi_cppc/highest_perf <--- should be 3000? There is no cpufreq boost.
3300
/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_freq <--- don't understand why 0.
0
/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_nonlinear_perf
375
/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_perf
375
/sys/devices/system/cpu/cpu0/acpi_cppc/nominal_freq <--- ditto
0
/sys/devices/system/cpu/cpu0/acpi_cppc/nominal_perf
3000
/sys/devices/system/cpu/cpu0/acpi_cppc/reference_perf
100

Viresh Kumar June 22, 2021, 6:52 a.m. UTC | #2

On 21-06-21, 16:48, Qian Cai wrote:
> Viresh, this series works fine on my quick tests so far.


Thanks for testing.

> BTW, I

> noticed some strange things even with the series applied mentioned

> below when reading acpi_cppc vs cpufreq sysfs. Do you happen to know

> are those just hardware/firmware issues because Linux just

> faithfully exported the register values?


The values are exported by drivers/acpi/cppc_acpi.c I believe and they
look to be based on simple register reads.

> == Arm64 server Foo ==

> CPU max MHz:                     3000.0000

> CPU min MHz:                     1000.0000

> 

> /sys/devices/system/cpu/cpu0/acpi_cppc/highest_perf

> 300

> /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_freq

> 1000

> /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_nonlinear_perf

> 200

> /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_perf

> 100

> /sys/devices/system/cpu/cpu0/acpi_cppc/nominal_freq <--- should be 3000?

> 2800

> /sys/devices/system/cpu/cpu0/acpi_cppc/nominal_perf <--- should be 300?

> 280


nominal-perf is max perf, and highest-perf is boost-perf. Same goes
with nominal-freq (i.e. policy->max).

So 280 and 2800 look to be the correct values, 300 and 3000 come with
boost enabled. Look at the first entry, highest_perf.

> /sys/devices/system/cpu/cpu0/acpi_cppc/reference_perf

> 100

> 

> == Arm64 server Bar ==

> CPU max MHz:                     3000.0000

> CPU min MHz:                     375.0000

> 

> /sys/devices/system/cpu/cpu0/acpi_cppc/highest_perf <--- should be 3000? There is no cpufreq boost.

> 3300


This isn't exported by cpufreq driver but acpi, and it just exports
hardware values of highest_perf (with boost i.e.). cpufreq may or
may not use this to support boost.

> /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_freq <--- don't understand why 0.

> 0


Because corresponding hardware registers aren't implemented for your
platform, this is the function that reads these registers:

int cppc_get_perf_caps(int cpunum, struct cppc_perf_caps *perf_caps)
{
        ...

	/* Read optional lowest and nominal frequencies if present */
	if (CPC_SUPPORTED(low_freq_reg))
		cpc_read(cpunum, low_freq_reg, &low_f);

	if (CPC_SUPPORTED(nom_freq_reg))
		cpc_read(cpunum, nom_freq_reg, &nom_f);

	perf_caps->lowest_freq = low_f;
	perf_caps->nominal_freq = nom_f;
}

> /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_nonlinear_perf

> 375

> /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_perf

> 375

> /sys/devices/system/cpu/cpu0/acpi_cppc/nominal_freq <--- ditto

> 0

> /sys/devices/system/cpu/cpu0/acpi_cppc/nominal_perf

> 3000

> /sys/devices/system/cpu/cpu0/acpi_cppc/reference_perf

> 100


-- 
viresh

Viresh Kumar June 23, 2021, 4:16 a.m. UTC | #3

On 21-06-21, 16:48, Qian Cai wrote:
> 
> 
> On 6/21/2021 5:19 AM, Viresh Kumar wrote:
> > CPPC cpufreq driver is used for ARM servers and this patch series tries to
> > provide counter-based frequency invariance support for them in the absence for
> > architecture specific counters (like AMUs).
> 
> Viresh, this series works fine on my quick tests so far.

Do you want me to add your Tested-by for the series ?

Qian Cai June 23, 2021, 12:57 p.m. UTC | #4

On 6/23/2021 12:16 AM, Viresh Kumar wrote:
> On 21-06-21, 16:48, Qian Cai wrote:

>>

>>

>> On 6/21/2021 5:19 AM, Viresh Kumar wrote:

>>> CPPC cpufreq driver is used for ARM servers and this patch series tries to

>>> provide counter-based frequency invariance support for them in the absence for

>>> architecture specific counters (like AMUs).

>>

>> Viresh, this series works fine on my quick tests so far.

> 

> Do you want me to add your Tested-by for the series ?

Viresh, I am afraid I don't feel comfortable yet. I have a few new tests in development, and will provide an update once ready. Also, I noticed the delivered perf is even smaller than lowest_perf (100).

# cat /sys/devices/system/cpu/cpu8/acpi_cppc/feedback_ctrs
 ref:103377547901 del:54540736873
# cat /sys/devices/system/cpu/cpu8/acpi_cppc/feedback_ctrs
 ref:103379170101 del:54541599117

100 * (54541599117 - 54540736873) / (103379170101 - 103377547901) = 53

My understanding is that the delivered perf should fail into the range between lowest_perf and highest_perf. Is that assumption correct? This happens on 5.4-based kernel, so I am in process running your series on that system to see if there is any differences. In any case, if it is a bug it is pre-existing, but I'd like to understand a bit better in that front first.

Viresh Kumar June 24, 2021, 2:54 a.m. UTC | #5

On 23-06-21, 08:57, Qian Cai wrote:
> Viresh, I am afraid I don't feel comfortable yet. I have a few new tests in

> development, and will provide an update once ready.


Oh sure, np.

> Also, I noticed the delivered perf is even smaller than lowest_perf (100).


> # cat /sys/devices/system/cpu/cpu8/acpi_cppc/feedback_ctrs

>  ref:103377547901 del:54540736873

> # cat /sys/devices/system/cpu/cpu8/acpi_cppc/feedback_ctrs

>  ref:103379170101 del:54541599117

> 

> 100 * (54541599117 - 54540736873) / (103379170101 - 103377547901) = 53

> 

> My understanding is that the delivered perf should fail into the range between

> lowest_perf and highest_perf. Is that assumption correct? This happens on

> 5.4-based kernel, so I am in process running your series on that system to see

> if there is any differences. In any case, if it is a bug it is pre-existing,

> but I'd like to understand a bit better in that front first.


Vincent:

Can that happen because of CPU idle ?

-- 
viresh

Vincent Guittot June 24, 2021, 9:49 a.m. UTC | #6

On Thu, 24 Jun 2021 at 04:54, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>

> On 23-06-21, 08:57, Qian Cai wrote:

> > Viresh, I am afraid I don't feel comfortable yet. I have a few new tests in

> > development, and will provide an update once ready.

>

> Oh sure, np.

>

> > Also, I noticed the delivered perf is even smaller than lowest_perf (100).

>

> > # cat /sys/devices/system/cpu/cpu8/acpi_cppc/feedback_ctrs

> >  ref:103377547901 del:54540736873

> > # cat /sys/devices/system/cpu/cpu8/acpi_cppc/feedback_ctrs

> >  ref:103379170101 del:54541599117

> >

> > 100 * (54541599117 - 54540736873) / (103379170101 - 103377547901) = 53


I'm not sure that I understand your point. The formula above says that
cpu8 run @ 53% of nominal performance

> >

> > My understanding is that the delivered perf should fail into the range between

> > lowest_perf and highest_perf. Is that assumption correct? This happens on

> > 5.4-based kernel, so I am in process running your series on that system to see

> > if there is any differences. In any case, if it is a bug it is pre-existing,

> > but I'd like to understand a bit better in that front first.

>

> Vincent:

>

> Can that happen because of CPU idle ?

>

> --

> viresh

Vincent Guittot June 24, 2021, 11:15 a.m. UTC | #7

On Thu, 24 Jun 2021 at 12:48, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>

> Hi guys,

>

> On Thursday 24 Jun 2021 at 11:49:53 (+0200), Vincent Guittot wrote:

> > On Thu, 24 Jun 2021 at 04:54, Viresh Kumar <viresh.kumar@linaro.org> wrote:

> > >

> > > On 23-06-21, 08:57, Qian Cai wrote:

> > > > Viresh, I am afraid I don't feel comfortable yet. I have a few new tests in

> > > > development, and will provide an update once ready.

> > >

> > > Oh sure, np.

> > >

> > > > Also, I noticed the delivered perf is even smaller than lowest_perf (100).

> > >

> > > > # cat /sys/devices/system/cpu/cpu8/acpi_cppc/feedback_ctrs

> > > >  ref:103377547901 del:54540736873

> > > > # cat /sys/devices/system/cpu/cpu8/acpi_cppc/feedback_ctrs

> > > >  ref:103379170101 del:54541599117

> > > >

> > > > 100 * (54541599117 - 54540736873) / (103379170101 - 103377547901) = 53

> >

> > I'm not sure that I understand your point. The formula above says that

> > cpu8 run @ 53% of nominal performance

> >

>

> I think this is based on a previous example Qian had where:

>

> /sys/devices/system/cpu/cpu0/acpi_cppc/highest_perf

> 300

> /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_freq

> 1000

> /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_perf

> 100

> /sys/devices/system/cpu/cpu0/acpi_cppc/reference_perf

> 100

>

> ..so the 100 is not from obtaining percentage, is the reference

> performance.

>

> The logic of the formula is to obtain the delivered performance when

> knowing the number of ticks for each counter, so:

>

> So if one gets (103379170101 - 103377547901) ticks for the counter at

> running at 1GHz(perf 100), what is the frequency of the core, if its

> counter ticked (54541599117 - 54540736873) times in the same interval

> of time?

>

> The answer is 530MHz(perf 53), which is lower than the lowest frequency

> at 1GHz(perf 100).


But the nominal_perf is 280 and not 100 if i'm not wrong so the perf
value is 148 > lowest_perf in this case


>

>

> > > >

> > > > My understanding is that the delivered perf should fail into the range between

> > > > lowest_perf and highest_perf. Is that assumption correct? This happens on

> > > > 5.4-based kernel, so I am in process running your series on that system to see

> > > > if there is any differences. In any case, if it is a bug it is pre-existing,

> > > > but I'd like to understand a bit better in that front first.

> > >

> > > Vincent:

> > >

> > > Can that happen because of CPU idle ?

> > >

>

> Not if the counters are implemented properly. The kernel considers that

> both reference and delivered performance counters should stop or reset

> during idle. The kernel would not account for idle itself.

>

> If the reference performance counter does not stop during idle, while

> the core performance counter (delivered) does stop, the behavior above

> should be seen very often.

>

> Qian, do you see these small delivered performance values often or

> seldom?

>

> Thanks,

> Ionela.

>

> > > --

> > > viresh

Vincent Guittot June 24, 2021, 11:59 a.m. UTC | #8

On Thu, 24 Jun 2021 at 13:23, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>
> On Thursday 24 Jun 2021 at 13:15:04 (+0200), Vincent Guittot wrote:
> > On Thu, 24 Jun 2021 at 12:48, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> > >
> > > Hi guys,
> > >
> > > On Thursday 24 Jun 2021 at 11:49:53 (+0200), Vincent Guittot wrote:
> > > > On Thu, 24 Jun 2021 at 04:54, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > > > >
> > > > > On 23-06-21, 08:57, Qian Cai wrote:
> > > > > > Viresh, I am afraid I don't feel comfortable yet. I have a few new tests in
> > > > > > development, and will provide an update once ready.
> > > > >
> > > > > Oh sure, np.
> > > > >
> > > > > > Also, I noticed the delivered perf is even smaller than lowest_perf (100).
> > > > >
> > > > > > # cat /sys/devices/system/cpu/cpu8/acpi_cppc/feedback_ctrs
> > > > > >  ref:103377547901 del:54540736873
> > > > > > # cat /sys/devices/system/cpu/cpu8/acpi_cppc/feedback_ctrs
> > > > > >  ref:103379170101 del:54541599117
> > > > > >
> > > > > > 100 * (54541599117 - 54540736873) / (103379170101 - 103377547901) = 53
> > > >
> > > > I'm not sure that I understand your point. The formula above says that
> > > > cpu8 run @ 53% of nominal performance
> > > >
> > >
> > > I think this is based on a previous example Qian had where:
> > >
> > > /sys/devices/system/cpu/cpu0/acpi_cppc/highest_perf
> > > 300
> > > /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_freq
> > > 1000
> > > /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_perf
> > > 100
> > > /sys/devices/system/cpu/cpu0/acpi_cppc/reference_perf
> > > 100
> > >
> > > ..so the 100 is not from obtaining percentage, is the reference
> > > performance.
> > >
> > > The logic of the formula is to obtain the delivered performance when
> > > knowing the number of ticks for each counter, so:
> > >
> > > So if one gets (103379170101 - 103377547901) ticks for the counter at
> > > running at 1GHz(perf 100), what is the frequency of the core, if its
> > > counter ticked (54541599117 - 54540736873) times in the same interval
> > > of time?
> > >
> > > The answer is 530MHz(perf 53), which is lower than the lowest frequency
> > > at 1GHz(perf 100).
> >
> > But the nominal_perf is 280 and not 100 if i'm not wrong so the perf
> > value is 148 > lowest_perf in this case
> >
>
> Nominal performance has no meaning here. The reference counter ticks
> with the frequency equivalent to reference performance.
>
> Nominal performance is the maximum performance when !boost. Highest
> performance is the maximum performance available including boost
> frequencies. So nominal performance has no impact in these translations
> from counter values to delivered performance.

my bad, nominal_perf == reference_perf on the systems that I have locally

>
> Hope it helps,
> Ionela.
>
> >
> > >
> > >
> > > > > >
> > > > > > My understanding is that the delivered perf should fail into the range between
> > > > > > lowest_perf and highest_perf. Is that assumption correct? This happens on
> > > > > > 5.4-based kernel, so I am in process running your series on that system to see
> > > > > > if there is any differences. In any case, if it is a bug it is pre-existing,
> > > > > > but I'd like to understand a bit better in that front first.
> > > > >
> > > > > Vincent:
> > > > >
> > > > > Can that happen because of CPU idle ?
> > > > >
> > >
> > > Not if the counters are implemented properly. The kernel considers that
> > > both reference and delivered performance counters should stop or reset
> > > during idle. The kernel would not account for idle itself.
> > >
> > > If the reference performance counter does not stop during idle, while
> > > the core performance counter (delivered) does stop, the behavior above
> > > should be seen very often.
> > >
> > > Qian, do you see these small delivered performance values often or
> > > seldom?
> > >
> > > Thanks,
> > > Ionela.
> > >
> > > > > --
> > > > > viresh

Qian Cai June 24, 2021, 3:17 p.m. UTC | #9

On 6/24/2021 6:48 AM, Ionela Voinescu wrote:
> Not if the counters are implemented properly. The kernel considers that

> both reference and delivered performance counters should stop or reset

> during idle. The kernel would not account for idle itself.

> 

> If the reference performance counter does not stop during idle, while

> the core performance counter (delivered) does stop, the behavior above

> should be seen very often.

> 

> Qian, do you see these small delivered performance values often or

> seldom?

Ionela, so I managed to upgrade the kernel on the system to today's linux-next which suppose to include this series. The delivered perf is now 280. However, scaling_min_freq (200 MHz) is not equal to lowest_perf (100).

scaling_driver: acpi_cppc
scaling_governor: schedutil

Is that normal because lowest_nonlinear_perf is 200? 

Also, on this pretty idle system, 158 of 160 CPUs are always running in max freq (280 MHz). The other 2 are running in 243 and 213 MHz according to scaling_cur_freq. Apparently, "schedutil" does not work proper on this system. I am going to try other governors to narrow down the issue a bit.

FYI, here is the acpi_cppc registers reading:

/sys/devices/system/cpu/cpu0/acpi_cppc/feedback_ctrs
ref:160705801 del:449594095
/sys/devices/system/cpu/cpu0/acpi_cppc/highest_perf
300
/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_freq
1000
/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_nonlinear_perf
200
/sys/devices/system/cpu/cpu0/acpi_cppc/lowest_perf
100
/sys/devices/system/cpu/cpu0/acpi_cppc/nominal_freq
2800
/sys/devices/system/cpu/cpu0/acpi_cppc/nominal_perf
280
/sys/devices/system/cpu/cpu0/acpi_cppc/reference_perf
100
/sys/devices/system/cpu/cpu0/acpi_cppc/wraparound_time
18446744073709551615

Qian Cai June 24, 2021, 8:44 p.m. UTC | #10

On 6/24/2021 6:48 AM, Ionela Voinescu wrote: 
> Not if the counters are implemented properly. The kernel considers that

> both reference and delivered performance counters should stop or reset

> during idle. The kernel would not account for idle itself.

> 

> If the reference performance counter does not stop during idle, while

> the core performance counter (delivered) does stop, the behavior above

> should be seen very often.

> 

> Qian, do you see these small delivered performance values often or

> seldom?

FYI, the latest data point it that on the new kernel, the delivered performance does match the cpufreq_cur_freq. IOW, feedback_ctrs works fine. Also, "powersave" governor could bring down the scaling_cur_freq to scaling_min_freq. Right now, looks like the puzzles on this particular system as mentioned in the previous post are,

1) lowest_freq/lowest_perf != scaling_min_freq
2) CPPC + schedutil is not able to scale down CPUs.

Ionela Voinescu June 25, 2021, 10:21 a.m. UTC | #11

Hi Qian,

On Thursday 24 Jun 2021 at 11:17:55 (-0400), Qian Cai wrote:
> 

> 

> On 6/24/2021 6:48 AM, Ionela Voinescu wrote:

> > Not if the counters are implemented properly. The kernel considers that

> > both reference and delivered performance counters should stop or reset

> > during idle. The kernel would not account for idle itself.

> > 

> > If the reference performance counter does not stop during idle, while

> > the core performance counter (delivered) does stop, the behavior above

> > should be seen very often.

> > 

> > Qian, do you see these small delivered performance values often or

> > seldom?

> 

> Ionela, so I managed to upgrade the kernel on the system to today's

> linux-next which suppose to include this series. The delivered perf

> is now 280. However, scaling_min_freq (200 MHz) is not equal to

> lowest_perf (100).

> 

> scaling_driver: acpi_cppc

                  ^^^^^^^^^
I suppose you mean "cppc-cpufreq"?

"acpi_cppc" is not a scaling driver option.


> scaling_governor: schedutil

> 

> Is that normal because lowest_nonlinear_perf is 200? 

> 


Yes, that is right : [1]

> Also, on this pretty idle system, 158 of 160 CPUs are always running

> in max freq (280 MHz). The other 2 are running in 243 and 213 MHz

> according to scaling_cur_freq. Apparently, "schedutil" does not work

> proper on this system. I am going to try other governors to narrow

> down the issue a bit.


So your CPUs run at frequencies between 200MHz and 280MHz?
Based on your acpi_cppc information below I would have assumed 2GHz as
lowest nonlinear and 2.8GHz as nominal. The reason for this is that
according to the ACPI spec the frequency values in the _CPC objects are
supposed to be in MHz, so 2800 MHz for nominal frequency would be
2.8GHz.

When you try more governors, make sure to check out the difference
between scaling_cur_freq and cpuinfo_cur_freq at [2]. The first gives
you the frequency that the governor (schedutil) is asking for, while the
second is giving you the current frequency obtained from the counters.

So to check the actual frequency the cores are running at, please check
cpuinfo_cur_freq.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/cpufreq/cppc_cpufreq.c?h=v5.13-rc7#n296
[2] https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt

Thanks,
Ionela.

> 

> FYI, here is the acpi_cppc registers reading:

> 

> /sys/devices/system/cpu/cpu0/acpi_cppc/feedback_ctrs

> ref:160705801 del:449594095

> /sys/devices/system/cpu/cpu0/acpi_cppc/highest_perf

> 300

> /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_freq

> 1000

> /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_nonlinear_perf

> 200

> /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_perf

> 100

> /sys/devices/system/cpu/cpu0/acpi_cppc/nominal_freq

> 2800

> /sys/devices/system/cpu/cpu0/acpi_cppc/nominal_perf

> 280

> /sys/devices/system/cpu/cpu0/acpi_cppc/reference_perf

> 100

> /sys/devices/system/cpu/cpu0/acpi_cppc/wraparound_time

> 18446744073709551615

Qian Cai June 25, 2021, 1:31 p.m. UTC | #12

On 6/25/2021 6:21 AM, Ionela Voinescu wrote:
>> scaling_driver: acpi_cppc
>                   ^^^^^^^^^
> I suppose you mean "cppc-cpufreq"?
> 
> "acpi_cppc" is not a scaling driver option.

Ionela, yes. Sorry about that.

> So your CPUs run at frequencies between 200MHz and 280MHz?

2000 to 2800 MHz.

> Based on your acpi_cppc information below I would have assumed 2GHz as
> lowest nonlinear and 2.8GHz as nominal. The reason for this is that
> according to the ACPI spec the frequency values in the _CPC objects are
> supposed to be in MHz, so 2800 MHz for nominal frequency would be
> 2.8GHz.
> 
> When you try more governors, make sure to check out the difference
> between scaling_cur_freq and cpuinfo_cur_freq at [2]. The first gives
> you the frequency that the governor (schedutil) is asking for, while the
> second is giving you the current frequency obtained from the counters.
> 
> So to check the actual frequency the cores are running at, please check
> cpuinfo_cur_freq.

The problem is that all CPUs are never scaling down. "cpuinfo_cur_freq" and "scaling_cur_freq" are always the 2800 MHz on all CPUs on this idle system. This looks like a regression somewhere as in 5.4-based kernel, I can see "cpuinfo_cur_freq" can go down to 2000 MHz in the same scenario. I'll bisect a bit unless you have better ideas?

Ionela Voinescu June 25, 2021, 2:37 p.m. UTC | #13

Hey,

On Friday 25 Jun 2021 at 09:31:58 (-0400), Qian Cai wrote:
> 

> 

> On 6/25/2021 6:21 AM, Ionela Voinescu wrote:

> >> scaling_driver: acpi_cppc

> >                   ^^^^^^^^^

> > I suppose you mean "cppc-cpufreq"?

> > 

> > "acpi_cppc" is not a scaling driver option.

> 

> Ionela, yes. Sorry about that.

> 

> > So your CPUs run at frequencies between 200MHz and 280MHz?

> 

> 2000 to 2800 MHz.

> 

Thank you for the clarification.

> > Based on your acpi_cppc information below I would have assumed 2GHz as

> > lowest nonlinear and 2.8GHz as nominal. The reason for this is that

> > according to the ACPI spec the frequency values in the _CPC objects are

> > supposed to be in MHz, so 2800 MHz for nominal frequency would be

> > 2.8GHz.

> > 

> > When you try more governors, make sure to check out the difference

> > between scaling_cur_freq and cpuinfo_cur_freq at [2]. The first gives

> > you the frequency that the governor (schedutil) is asking for, while the

> > second is giving you the current frequency obtained from the counters.

> > 

> > So to check the actual frequency the cores are running at, please check

> > cpuinfo_cur_freq.

> 

> The problem is that all CPUs are never scaling down. "cpuinfo_cur_freq"

> and "scaling_cur_freq" are always the 2800 MHz on all CPUs on this idle

> system. This looks like a regression somewhere as in 5.4-based kernel,

> I can see "cpuinfo_cur_freq" can go down to 2000 MHz in the same

> scenario. I'll bisect a bit unless you have better ideas?

Quick questions for you:

1. When you say you tried a 5.4 kernel, did you try it with these
patches backported? They also have some dependencies with the recent
changes in the arch topology driver and cpufreq so they would not be
straight forward to backport.

If the 5.4 kernel you tried did not have these patches, it might be best
to try next/master that has these patches, but with
CONFIG_ACPI_CPPC_CPUFREQ_FIE=n, just to eliminate the possibility that
an incorrect frequency scale factor here would affect utilization that
would then affect the schedutil frequency selection. I would not expect
this behavior even if the scale factor was wrong, but it would be good
to rule out.

2. Is your platform booting with all CPUs? Are any hotplug operations
done in your scenario?

Thanks,
Ionela.

Qian Cai June 25, 2021, 4:56 p.m. UTC | #14

On 6/25/2021 10:37 AM, Ionela Voinescu wrote:
> Quick questions for you:

> 

> 1. When you say you tried a 5.4 kernel, did you try it with these

> patches backported? They also have some dependencies with the recent

> changes in the arch topology driver and cpufreq so they would not be

> straight forward to backport.

No. It turned out that this 5.4-based kernel has "ondemand" governor by default which works fine which could even scale down to the lowest_perf (1000 MHz). Once switched the governor to "schedutil", it could keep the freq to the lowest. While on the latest kernel, it also works fine by using "ondemand" first and then switch to "schedutil". Even though it can only scale down to lowest_nonlinear_perf (2000 MHz). It is more of that using "schedutil" by default would not work. Also, on the latest kernel, even "userspace" governor only allows to scale down to 2000 MHz.

> If the 5.4 kernel you tried did not have these patches, it might be best

> to try next/master that has these patches, but with

> CONFIG_ACPI_CPPC_CPUFREQ_FIE=n, just to eliminate the possibility that

> an incorrect frequency scale factor here would affect utilization that

> would then affect the schedutil frequency selection. I would not expect

> this behavior even if the scale factor was wrong, but it would be good

> to rule out.

I'll try that at least see if CONFIG_ACPI_CPPC_CPUFREQ_FIE=n would make the latest kernel to be able to scale down to 1000 MHz.

> 2. Is your platform booting with all CPUs? Are any hotplug operations

> done in your scenario?

Yes, booting with all CPUs. No additional hotplug there.

Qian Cai June 26, 2021, 2:29 a.m. UTC | #15

On 6/25/2021 10:37 AM, Ionela Voinescu wrote:
> Quick questions for you:
> 
> 1. When you say you tried a 5.4 kernel, did you try it with these
> patches backported? They also have some dependencies with the recent
> changes in the arch topology driver and cpufreq so they would not be
> straight forward to backport.
> 
> If the 5.4 kernel you tried did not have these patches, it might be best
> to try next/master that has these patches, but with
> CONFIG_ACPI_CPPC_CPUFREQ_FIE=n, just to eliminate the possibility that
> an incorrect frequency scale factor here would affect utilization that
> would then affect the schedutil frequency selection. I would not expect
> this behavior even if the scale factor was wrong, but it would be good
> to rule out.
> 
> 2. Is your platform booting with all CPUs? Are any hotplug operations
> done in your scenario?

Ionela, I found that set ACPI_PROCESSOR=y instead of ACPI_PROCESSOR=m will fix the previous mentioned issues here (any explanations of that?) even though the scaling down is not perfect. Now, we have the following on this idle system:

# cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq | sort | uniq  -c
 	79 1000000
  	1 1160000
 	73 1400000
  	1 2000000
  	4 2010000
  	1 2800000
  	1 860000

Even if I rerun a few times, there could still have a few CPUs running lower than lowest_perf (1GHz). Also, even though I set all CPUs to use "userspace" governor and set freq to the lowest. A few CPUs keep changing at will.

# cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq | sort | uniq  -c
	156 1000000
  	3 2000000
  	1 760000

Qian Cai June 26, 2021, 1:41 p.m. UTC | #16

On 6/25/2021 10:29 PM, Qian Cai wrote:
> Ionela, I found that set ACPI_PROCESSOR=y instead of ACPI_PROCESSOR=m will fix the previous mentioned issues here (any explanations of that?) even though the scaling down is not perfect. Now, we have the following on this idle system:
> 
> # cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq | sort | uniq  -c
>  	79 1000000
>   	1 1160000
>  	73 1400000
>   	1 2000000
>   	4 2010000
>   	1 2800000
>   	1 860000
> 
> Even if I rerun a few times, there could still have a few CPUs running lower than lowest_perf (1GHz). Also, even though I set all CPUs to use "userspace" governor and set freq to the lowest. A few CPUs keep changing at will.
> 
> # cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq | sort | uniq  -c
> 	156 1000000
>   	3 2000000
>   	1 760000

Another date point is that set ACPI_CPPC_CPUFREQ_FIE=n fixed the issue that any CPU could run below the lowest freq.

schedutil:
# cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq | sort | uniq  -c
     80 1000000
     78 1400000
      1 2010000
      1 2800000

userspace:
# cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq | sort | uniq  -c
    158 1000000
      2 2000000

Ionela Voinescu June 28, 2021, 11:54 a.m. UTC | #17

Hi guys,

On Monday 21 Jun 2021 at 14:49:33 (+0530), Viresh Kumar wrote:
> Hello,
> 
> Changes since V2:
> 
> - We don't need start_cpu() and stop_cpu() callbacks anymore, we can make it
>   work using policy ->init() and exit() alone.
> 
> - Two new cleanup patches 1/4 and 2/4.
> 
> - Improved commit log of 3/4.
> 
> - Dropped WARN_ON(local_freq_scale > 1024), since this can occur on counter's
>   overlap (seen with Vincent's setup).
> 

If you happen to have the data around, I would like to know more about
your observations on ThunderX2.


I tried ThunderX2 as well, with the following observations:

Booting with userspace governor and all CPUs online, the CPPC frequency
scale factor was all over the place (even much larger than 1024).

My initial assumptions:
 - Counters do not behave properly in light of SMT
 - Firmware does not do a good job to keep the reference and core
   counters monotonic: save and restore at core off.

So I offlined all CPUs with the exception of 0, 32, 64, 96 - threads of
a single core (part of policy0). With this all works very well:

root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 1056000 > scaling_setspeed
root@target:/sys/devices/system/cpu/cpufreq/policy0#
[ 1863.095370] CPU96: cppc scale: 697.
[ 1863.175370] CPU0: cppc scale: 492.
[ 1863.215367] CPU64: cppc scale: 492.
[ 1863.235366] CPU96: cppc scale: 492.
[ 1863.485368] CPU32: cppc scale: 492.

root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 1936000 > scaling_setspeed
root@target:/sys/devices/system/cpu/cpufreq/policy0#
[ 1891.395363] CPU96: cppc scale: 558.
[ 1891.415362] CPU0: cppc scale: 595.
[ 1891.435362] CPU32: cppc scale: 615.
[ 1891.465363] CPU96: cppc scale: 635.
[ 1891.495361] CPU0: cppc scale: 673.
[ 1891.515360] CPU32: cppc scale: 703.
[ 1891.545360] CPU96: cppc scale: 738.
[ 1891.575360] CPU0: cppc scale: 779.
[ 1891.605360] CPU96: cppc scale: 829.
[ 1891.635360] CPU0: cppc scale: 879.

root@target:/sys/devices/system/cpu/cpufreq/policy0#
root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 2200000 > scaling_setspeed
root@target:/sys/devices/system/cpu/cpufreq/policy0#
[ 1896.585363] CPU32: cppc scale: 1004.
[ 1896.675359] CPU64: cppc scale: 973.
[ 1896.715359] CPU0: cppc scale: 1024.

I'm doing a rate limited printk only for increase/decrease values over
64 in the scale factor value.

This showed me that SMT is handled properly.

Then, as soon as I start onlining CPUs 1, 33, 65, 97, the scale factor
stops being even close to correct, for example:

[238394.770328] CPU96: cppc scale: 22328.
[238395.628846] CPU96: cppc scale: 245.
[238516.087115] CPU96: cppc scale: 930.
[238523.385009] CPU96: cppc scale: 245.
[238538.767473] CPU96: cppc scale: 936.
[238538.867546] CPU96: cppc scale: 245.
[238599.367932] CPU97: cppc scale: 2728.
[238599.859865] CPU97: cppc scale: 452.
[238647.786284] CPU96: cppc scale: 1438.
[238669.604684] CPU96: cppc scale: 27306.
[238676.805049] CPU96: cppc scale: 245.
[238737.642902] CPU97: cppc scale: 2035.
[238737.664995] CPU97: cppc scale: 452.
[238788.066193] CPU96: cppc scale: 2749.
[238788.110192] CPU96: cppc scale: 245.
[238817.231659] CPU96: cppc scale: 2698.
[238818.083687] CPU96: cppc scale: 245.
[238845.466850] CPU97: cppc scale: 2990.
[238847.477805] CPU97: cppc scale: 452.
[238936.984107] CPU97: cppc scale: 1590.
[238937.029079] CPU97: cppc scale: 452.
[238979.052464] CPU97: cppc scale: 911.
[238980.900668] CPU97: cppc scale: 452.
[239149.587889] CPU96: cppc scale: 803.
[239151.085516] CPU96: cppc scale: 245.
[239303.871373] CPU64: cppc scale: 956.
[239303.906837] CPU64: cppc scale: 245.
[239308.666786] CPU96: cppc scale: 821.
[239319.440634] CPU96: cppc scale: 245.
[239389.978395] CPU97: cppc scale: 4229.
[239391.969562] CPU97: cppc scale: 452.
[239415.894738] CPU96: cppc scale: 630.
[239417.875326] CPU96: cppc scale: 245.

The counter values shown by feedback_ctrs do not seem monotonic even
when only core 0 threads are online.

ref:2812420736 del:166051103
ref:3683620736 del:641578595
ref:1049653440 del:1548202980
ref:2099053440 del:2120997459
ref:3185853440 del:2714205997
ref:712486144  del:3708490753
ref:3658438336 del:3401357212
ref:1570998080 del:2279728438

For now I was just wondering if you have seen the same and whether you
have an opinion on this.

> This is tested on my Hikey platform (without the actual read/write to
> performance counters), with this script for over an hour:
> 
> while true; do
>     for i in `seq 1 7`;
>     do
>         echo 0 > /sys/devices/system/cpu/cpu$i/online;
>     done;
> 
>     for i in `seq 1 7`;
>     do
>         echo 1 > /sys/devices/system/cpu/cpu$i/online;
>     done;
> done
> 
> 
> The same is done by Vincent on ThunderX2 and no issues were seen.

Hotplug worked fine for me as well on both platforms I tested (Juno R2
and ThunderX2).

Thanks,
Ionela.

Vincent Guittot June 28, 2021, 12:17 p.m. UTC | #18

On Mon, 28 Jun 2021 at 14:14, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Mon, 28 Jun 2021 at 13:54, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> >
> > Hi guys,
> >
> > On Monday 21 Jun 2021 at 14:49:33 (+0530), Viresh Kumar wrote:
> > > Hello,
> > >
> > > Changes since V2:
> > >
> > > - We don't need start_cpu() and stop_cpu() callbacks anymore, we can make it
> > >   work using policy ->init() and exit() alone.
> > >
> > > - Two new cleanup patches 1/4 and 2/4.
> > >
> > > - Improved commit log of 3/4.
> > >
> > > - Dropped WARN_ON(local_freq_scale > 1024), since this can occur on counter's
> > >   overlap (seen with Vincent's setup).
> > >
> >
> > If you happen to have the data around, I would like to know more about
> > your observations on ThunderX2.
> >
> >
> > I tried ThunderX2 as well, with the following observations:
> >
> > Booting with userspace governor and all CPUs online, the CPPC frequency
> > scale factor was all over the place (even much larger than 1024).
> >
> > My initial assumptions:
> >  - Counters do not behave properly in light of SMT
> >  - Firmware does not do a good job to keep the reference and core
> >    counters monotonic: save and restore at core off.
> >
> > So I offlined all CPUs with the exception of 0, 32, 64, 96 - threads of
> > a single core (part of policy0). With this all works very well:
> >
> > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 1056000 > scaling_setspeed
> > root@target:/sys/devices/system/cpu/cpufreq/policy0#
> > [ 1863.095370] CPU96: cppc scale: 697.
> > [ 1863.175370] CPU0: cppc scale: 492.
> > [ 1863.215367] CPU64: cppc scale: 492.
> > [ 1863.235366] CPU96: cppc scale: 492.
> > [ 1863.485368] CPU32: cppc scale: 492.
> >
> > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 1936000 > scaling_setspeed
> > root@target:/sys/devices/system/cpu/cpufreq/policy0#
> > [ 1891.395363] CPU96: cppc scale: 558.
> > [ 1891.415362] CPU0: cppc scale: 595.
> > [ 1891.435362] CPU32: cppc scale: 615.
> > [ 1891.465363] CPU96: cppc scale: 635.
> > [ 1891.495361] CPU0: cppc scale: 673.
> > [ 1891.515360] CPU32: cppc scale: 703.
> > [ 1891.545360] CPU96: cppc scale: 738.
> > [ 1891.575360] CPU0: cppc scale: 779.
> > [ 1891.605360] CPU96: cppc scale: 829.
> > [ 1891.635360] CPU0: cppc scale: 879.
> >
> > root@target:/sys/devices/system/cpu/cpufreq/policy0#
> > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 2200000 > scaling_setspeed
> > root@target:/sys/devices/system/cpu/cpufreq/policy0#
> > [ 1896.585363] CPU32: cppc scale: 1004.
> > [ 1896.675359] CPU64: cppc scale: 973.
> > [ 1896.715359] CPU0: cppc scale: 1024.
> >
> > I'm doing a rate limited printk only for increase/decrease values over
> > 64 in the scale factor value.
> >
> > This showed me that SMT is handled properly.
> >
> > Then, as soon as I start onlining CPUs 1, 33, 65, 97, the scale factor
> > stops being even close to correct, for example:
> >
> > [238394.770328] CPU96: cppc scale: 22328.
> > [238395.628846] CPU96: cppc scale: 245.
> > [238516.087115] CPU96: cppc scale: 930.
> > [238523.385009] CPU96: cppc scale: 245.
> > [238538.767473] CPU96: cppc scale: 936.
> > [238538.867546] CPU96: cppc scale: 245.
> > [238599.367932] CPU97: cppc scale: 2728.
> > [238599.859865] CPU97: cppc scale: 452.
> > [238647.786284] CPU96: cppc scale: 1438.
> > [238669.604684] CPU96: cppc scale: 27306.
> > [238676.805049] CPU96: cppc scale: 245.
> > [238737.642902] CPU97: cppc scale: 2035.
> > [238737.664995] CPU97: cppc scale: 452.
> > [238788.066193] CPU96: cppc scale: 2749.
> > [238788.110192] CPU96: cppc scale: 245.
> > [238817.231659] CPU96: cppc scale: 2698.
> > [238818.083687] CPU96: cppc scale: 245.
> > [238845.466850] CPU97: cppc scale: 2990.
> > [238847.477805] CPU97: cppc scale: 452.
> > [238936.984107] CPU97: cppc scale: 1590.
> > [238937.029079] CPU97: cppc scale: 452.
> > [238979.052464] CPU97: cppc scale: 911.
> > [238980.900668] CPU97: cppc scale: 452.
> > [239149.587889] CPU96: cppc scale: 803.
> > [239151.085516] CPU96: cppc scale: 245.
> > [239303.871373] CPU64: cppc scale: 956.
> > [239303.906837] CPU64: cppc scale: 245.
> > [239308.666786] CPU96: cppc scale: 821.
> > [239319.440634] CPU96: cppc scale: 245.
> > [239389.978395] CPU97: cppc scale: 4229.
> > [239391.969562] CPU97: cppc scale: 452.
> > [239415.894738] CPU96: cppc scale: 630.
> > [239417.875326] CPU96: cppc scale: 245.
> >
>
> With the counter being 32bits and the freq scaling being update at
> tick, you can easily get a overflow on it in idle system. I can easily
> imagine that when you unplug CPUs there is enough activity on the CPU
> to update it regularly whereas with all CPUs, the idle time is longer
> that the counter overflow
>
> > The counter values shown by feedback_ctrs do not seem monotonic even
> > when only core 0 threads are online.
> >
> > ref:2812420736 del:166051103
> > ref:3683620736 del:641578595
> > ref:1049653440 del:1548202980
> > ref:2099053440 del:2120997459
> > ref:3185853440 del:2714205997
> > ref:712486144  del:3708490753
> > ref:3658438336 del:3401357212
> > ref:1570998080 del:2279728438

There are 32bits and the overflow need to be handled by cppc_cpufreq driver

> >
> > For now I was just wondering if you have seen the same and whether you
> > have an opinion on this.
> >
> > > This is tested on my Hikey platform (without the actual read/write to
> > > performance counters), with this script for over an hour:
> > >
> > > while true; do
> > >     for i in `seq 1 7`;
> > >     do
> > >         echo 0 > /sys/devices/system/cpu/cpu$i/online;
> > >     done;
> > >
> > >     for i in `seq 1 7`;
> > >     do
> > >         echo 1 > /sys/devices/system/cpu/cpu$i/online;
> > >     done;
> > > done
> > >
> > >
> > > The same is done by Vincent on ThunderX2 and no issues were seen.
> >
> > Hotplug worked fine for me as well on both platforms I tested (Juno R2
> > and ThunderX2).
> >
> > Thanks,
> > Ionela.

Ionela Voinescu June 28, 2021, 1:08 p.m. UTC | #19

On Monday 28 Jun 2021 at 14:14:14 (+0200), Vincent Guittot wrote:
> On Mon, 28 Jun 2021 at 13:54, Ionela Voinescu <ionela.voinescu@arm.com> wrote:

> >

> > Hi guys,

> >

> > On Monday 21 Jun 2021 at 14:49:33 (+0530), Viresh Kumar wrote:

> > > Hello,

> > >

> > > Changes since V2:

> > >

> > > - We don't need start_cpu() and stop_cpu() callbacks anymore, we can make it

> > >   work using policy ->init() and exit() alone.

> > >

> > > - Two new cleanup patches 1/4 and 2/4.

> > >

> > > - Improved commit log of 3/4.

> > >

> > > - Dropped WARN_ON(local_freq_scale > 1024), since this can occur on counter's

> > >   overlap (seen with Vincent's setup).

> > >

> >

> > If you happen to have the data around, I would like to know more about

> > your observations on ThunderX2.

> >

> >

> > I tried ThunderX2 as well, with the following observations:

> >

> > Booting with userspace governor and all CPUs online, the CPPC frequency

> > scale factor was all over the place (even much larger than 1024).

> >

> > My initial assumptions:

> >  - Counters do not behave properly in light of SMT

> >  - Firmware does not do a good job to keep the reference and core

> >    counters monotonic: save and restore at core off.

> >

> > So I offlined all CPUs with the exception of 0, 32, 64, 96 - threads of

> > a single core (part of policy0). With this all works very well:

> >

> > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 1056000 > scaling_setspeed

> > root@target:/sys/devices/system/cpu/cpufreq/policy0#

> > [ 1863.095370] CPU96: cppc scale: 697.

> > [ 1863.175370] CPU0: cppc scale: 492.

> > [ 1863.215367] CPU64: cppc scale: 492.

> > [ 1863.235366] CPU96: cppc scale: 492.

> > [ 1863.485368] CPU32: cppc scale: 492.

> >

> > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 1936000 > scaling_setspeed

> > root@target:/sys/devices/system/cpu/cpufreq/policy0#

> > [ 1891.395363] CPU96: cppc scale: 558.

> > [ 1891.415362] CPU0: cppc scale: 595.

> > [ 1891.435362] CPU32: cppc scale: 615.

> > [ 1891.465363] CPU96: cppc scale: 635.

> > [ 1891.495361] CPU0: cppc scale: 673.

> > [ 1891.515360] CPU32: cppc scale: 703.

> > [ 1891.545360] CPU96: cppc scale: 738.

> > [ 1891.575360] CPU0: cppc scale: 779.

> > [ 1891.605360] CPU96: cppc scale: 829.

> > [ 1891.635360] CPU0: cppc scale: 879.

> >

> > root@target:/sys/devices/system/cpu/cpufreq/policy0#

> > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 2200000 > scaling_setspeed

> > root@target:/sys/devices/system/cpu/cpufreq/policy0#

> > [ 1896.585363] CPU32: cppc scale: 1004.

> > [ 1896.675359] CPU64: cppc scale: 973.

> > [ 1896.715359] CPU0: cppc scale: 1024.

> >

> > I'm doing a rate limited printk only for increase/decrease values over

> > 64 in the scale factor value.

> >

> > This showed me that SMT is handled properly.

> >

> > Then, as soon as I start onlining CPUs 1, 33, 65, 97, the scale factor

> > stops being even close to correct, for example:

> >

> > [238394.770328] CPU96: cppc scale: 22328.

> > [238395.628846] CPU96: cppc scale: 245.

> > [238516.087115] CPU96: cppc scale: 930.

> > [238523.385009] CPU96: cppc scale: 245.

> > [238538.767473] CPU96: cppc scale: 936.

> > [238538.867546] CPU96: cppc scale: 245.

> > [238599.367932] CPU97: cppc scale: 2728.

> > [238599.859865] CPU97: cppc scale: 452.

> > [238647.786284] CPU96: cppc scale: 1438.

> > [238669.604684] CPU96: cppc scale: 27306.

> > [238676.805049] CPU96: cppc scale: 245.

> > [238737.642902] CPU97: cppc scale: 2035.

> > [238737.664995] CPU97: cppc scale: 452.

> > [238788.066193] CPU96: cppc scale: 2749.

> > [238788.110192] CPU96: cppc scale: 245.

> > [238817.231659] CPU96: cppc scale: 2698.

> > [238818.083687] CPU96: cppc scale: 245.

> > [238845.466850] CPU97: cppc scale: 2990.

> > [238847.477805] CPU97: cppc scale: 452.

> > [238936.984107] CPU97: cppc scale: 1590.

> > [238937.029079] CPU97: cppc scale: 452.

> > [238979.052464] CPU97: cppc scale: 911.

> > [238980.900668] CPU97: cppc scale: 452.

> > [239149.587889] CPU96: cppc scale: 803.

> > [239151.085516] CPU96: cppc scale: 245.

> > [239303.871373] CPU64: cppc scale: 956.

> > [239303.906837] CPU64: cppc scale: 245.

> > [239308.666786] CPU96: cppc scale: 821.

> > [239319.440634] CPU96: cppc scale: 245.

> > [239389.978395] CPU97: cppc scale: 4229.

> > [239391.969562] CPU97: cppc scale: 452.

> > [239415.894738] CPU96: cppc scale: 630.

> > [239417.875326] CPU96: cppc scale: 245.

> >

> 

> With the counter being 32bits and the freq scaling being update at

> tick, you can easily get a overflow on it in idle system. I can easily

> imagine that when you unplug CPUs there is enough activity on the CPU

> to update it regularly whereas with all CPUs, the idle time is longer

> that the counter overflow

> 


Thanks! Yes, given the high wraparound time I thought they were 64 bit.
All variables in software are 64 bit, but looking at bit width in the
_CPC entries, the platform counters are 32 bit counters.

> There are 32bits and the overflow need to be handled by cppc_cpufreq

> driver


I'm wondering if this would be best handled in the function that reads
the counters or in the cppc_cpufreq driver that uses them. Probably the
latter, as you say, as the read function should only return the raw
values, but it does complicate things.

Thanks,
Ionela.



> > The counter values shown by feedback_ctrs do not seem monotonic even

> > when only core 0 threads are online.

> >

> > ref:2812420736 del:166051103

> > ref:3683620736 del:641578595

> > ref:1049653440 del:1548202980

> > ref:2099053440 del:2120997459

> > ref:3185853440 del:2714205997

> > ref:712486144  del:3708490753

> > ref:3658438336 del:3401357212

> > ref:1570998080 del:2279728438

> >

> > For now I was just wondering if you have seen the same and whether you

> > have an opinion on this.

> >

> > > This is tested on my Hikey platform (without the actual read/write to

> > > performance counters), with this script for over an hour:

> > >

> > > while true; do

> > >     for i in `seq 1 7`;

> > >     do

> > >         echo 0 > /sys/devices/system/cpu/cpu$i/online;

> > >     done;

> > >

> > >     for i in `seq 1 7`;

> > >     do

> > >         echo 1 > /sys/devices/system/cpu/cpu$i/online;

> > >     done;

> > > done

> > >

> > >

> > > The same is done by Vincent on ThunderX2 and no issues were seen.

> >

> > Hotplug worked fine for me as well on both platforms I tested (Juno R2

> > and ThunderX2).

> >

> > Thanks,

> > Ionela.

Ionela Voinescu June 28, 2021, 9:37 p.m. UTC | #20

On Monday 28 Jun 2021 at 14:08:13 (+0100), Ionela Voinescu wrote:
> On Monday 28 Jun 2021 at 14:14:14 (+0200), Vincent Guittot wrote:

> > On Mon, 28 Jun 2021 at 13:54, Ionela Voinescu <ionela.voinescu@arm.com> wrote:

> > >

> > > Hi guys,

> > >

> > > On Monday 21 Jun 2021 at 14:49:33 (+0530), Viresh Kumar wrote:

> > > > Hello,

> > > >

> > > > Changes since V2:

> > > >

> > > > - We don't need start_cpu() and stop_cpu() callbacks anymore, we can make it

> > > >   work using policy ->init() and exit() alone.

> > > >

> > > > - Two new cleanup patches 1/4 and 2/4.

> > > >

> > > > - Improved commit log of 3/4.

> > > >

> > > > - Dropped WARN_ON(local_freq_scale > 1024), since this can occur on counter's

> > > >   overlap (seen with Vincent's setup).

> > > >

> > >

> > > If you happen to have the data around, I would like to know more about

> > > your observations on ThunderX2.

> > >

> > >

> > > I tried ThunderX2 as well, with the following observations:

> > >

> > > Booting with userspace governor and all CPUs online, the CPPC frequency

> > > scale factor was all over the place (even much larger than 1024).

> > >

> > > My initial assumptions:

> > >  - Counters do not behave properly in light of SMT

> > >  - Firmware does not do a good job to keep the reference and core

> > >    counters monotonic: save and restore at core off.

> > >

> > > So I offlined all CPUs with the exception of 0, 32, 64, 96 - threads of

> > > a single core (part of policy0). With this all works very well:

> > >

> > > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 1056000 > scaling_setspeed

> > > root@target:/sys/devices/system/cpu/cpufreq/policy0#

> > > [ 1863.095370] CPU96: cppc scale: 697.

> > > [ 1863.175370] CPU0: cppc scale: 492.

> > > [ 1863.215367] CPU64: cppc scale: 492.

> > > [ 1863.235366] CPU96: cppc scale: 492.

> > > [ 1863.485368] CPU32: cppc scale: 492.

> > >

> > > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 1936000 > scaling_setspeed

> > > root@target:/sys/devices/system/cpu/cpufreq/policy0#

> > > [ 1891.395363] CPU96: cppc scale: 558.

> > > [ 1891.415362] CPU0: cppc scale: 595.

> > > [ 1891.435362] CPU32: cppc scale: 615.

> > > [ 1891.465363] CPU96: cppc scale: 635.

> > > [ 1891.495361] CPU0: cppc scale: 673.

> > > [ 1891.515360] CPU32: cppc scale: 703.

> > > [ 1891.545360] CPU96: cppc scale: 738.

> > > [ 1891.575360] CPU0: cppc scale: 779.

> > > [ 1891.605360] CPU96: cppc scale: 829.

> > > [ 1891.635360] CPU0: cppc scale: 879.

> > >

> > > root@target:/sys/devices/system/cpu/cpufreq/policy0#

> > > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 2200000 > scaling_setspeed

> > > root@target:/sys/devices/system/cpu/cpufreq/policy0#

> > > [ 1896.585363] CPU32: cppc scale: 1004.

> > > [ 1896.675359] CPU64: cppc scale: 973.

> > > [ 1896.715359] CPU0: cppc scale: 1024.

> > >

> > > I'm doing a rate limited printk only for increase/decrease values over

> > > 64 in the scale factor value.

> > >

> > > This showed me that SMT is handled properly.

> > >

> > > Then, as soon as I start onlining CPUs 1, 33, 65, 97, the scale factor

> > > stops being even close to correct, for example:

> > >

> > > [238394.770328] CPU96: cppc scale: 22328.

> > > [238395.628846] CPU96: cppc scale: 245.

> > > [238516.087115] CPU96: cppc scale: 930.

> > > [238523.385009] CPU96: cppc scale: 245.

> > > [238538.767473] CPU96: cppc scale: 936.

> > > [238538.867546] CPU96: cppc scale: 245.

> > > [238599.367932] CPU97: cppc scale: 2728.

> > > [238599.859865] CPU97: cppc scale: 452.

> > > [238647.786284] CPU96: cppc scale: 1438.

> > > [238669.604684] CPU96: cppc scale: 27306.

> > > [238676.805049] CPU96: cppc scale: 245.

> > > [238737.642902] CPU97: cppc scale: 2035.

> > > [238737.664995] CPU97: cppc scale: 452.

> > > [238788.066193] CPU96: cppc scale: 2749.

> > > [238788.110192] CPU96: cppc scale: 245.

> > > [238817.231659] CPU96: cppc scale: 2698.

> > > [238818.083687] CPU96: cppc scale: 245.

> > > [238845.466850] CPU97: cppc scale: 2990.

> > > [238847.477805] CPU97: cppc scale: 452.

> > > [238936.984107] CPU97: cppc scale: 1590.

> > > [238937.029079] CPU97: cppc scale: 452.

> > > [238979.052464] CPU97: cppc scale: 911.

> > > [238980.900668] CPU97: cppc scale: 452.

> > > [239149.587889] CPU96: cppc scale: 803.

> > > [239151.085516] CPU96: cppc scale: 245.

> > > [239303.871373] CPU64: cppc scale: 956.

> > > [239303.906837] CPU64: cppc scale: 245.

> > > [239308.666786] CPU96: cppc scale: 821.

> > > [239319.440634] CPU96: cppc scale: 245.

> > > [239389.978395] CPU97: cppc scale: 4229.

> > > [239391.969562] CPU97: cppc scale: 452.

> > > [239415.894738] CPU96: cppc scale: 630.

> > > [239417.875326] CPU96: cppc scale: 245.

> > >

> > 

> > With the counter being 32bits and the freq scaling being update at

> > tick, you can easily get a overflow on it in idle system. I can easily

> > imagine that when you unplug CPUs there is enough activity on the CPU

> > to update it regularly whereas with all CPUs, the idle time is longer

> > that the counter overflow


For sane counters, how long the CPU is idle should not matter (please
see below my definition of sane counters).

> > 

> 

> Thanks! Yes, given the high wraparound time I thought they were 64 bit.

> All variables in software are 64 bit, but looking at bit width in the

> _CPC entries, the platform counters are 32 bit counters.

> 


I've looked a bit more over the code, and considering this particular
system (32 bit counters, maximum frequency of CPUs = 2.2GHz), I believe
the wraparound is considered, and this should not cause these strange
values in the scale factor.

I consider the counters sane if both stop during idle - either they stop
when CPU is clock gated, or some firmware does save/restore at core off.
Therefore, in all idle cases they seem to have stopped, from the point of
view of the OS. The ACPI spec mentions that both count "any time the
processor is active".

After the cores return from idle, the counters will wraparound at a
minimum of 1.95s. So with a tick every 4ms at most 1 wraparound would
have happened which allows the getDelta() function in cppc_cpufreq
driver to get the proper difference in values.

Let me know what you think.

Thanks,
Ionela.

> > There are 32bits and the overflow need to be handled by cppc_cpufreq

> > driver

> 

> I'm wondering if this would be best handled in the function that reads

> the counters or in the cppc_cpufreq driver that uses them. Probably the

> latter, as you say, as the read function should only return the raw

> values, but it does complicate things.

> 

> Thanks,

> Ionela.

> 

> 

> 

> > > The counter values shown by feedback_ctrs do not seem monotonic even

> > > when only core 0 threads are online.

> > >

> > > ref:2812420736 del:166051103

> > > ref:3683620736 del:641578595

> > > ref:1049653440 del:1548202980

> > > ref:2099053440 del:2120997459

> > > ref:3185853440 del:2714205997

> > > ref:712486144  del:3708490753

> > > ref:3658438336 del:3401357212

> > > ref:1570998080 del:2279728438

> > >

> > > For now I was just wondering if you have seen the same and whether you

> > > have an opinion on this.

> > >

> > > > This is tested on my Hikey platform (without the actual read/write to

> > > > performance counters), with this script for over an hour:

> > > >

> > > > while true; do

> > > >     for i in `seq 1 7`;

> > > >     do

> > > >         echo 0 > /sys/devices/system/cpu/cpu$i/online;

> > > >     done;

> > > >

> > > >     for i in `seq 1 7`;

> > > >     do

> > > >         echo 1 > /sys/devices/system/cpu/cpu$i/online;

> > > >     done;

> > > > done

> > > >

> > > >

> > > > The same is done by Vincent on ThunderX2 and no issues were seen.

> > >

> > > Hotplug worked fine for me as well on both platforms I tested (Juno R2

> > > and ThunderX2).

> > >

> > > Thanks,

> > > Ionela.

Viresh Kumar June 29, 2021, 4:45 a.m. UTC | #21

On 25-06-21, 09:31, Qian Cai wrote:
> The problem is that all CPUs are never scaling down.

> "cpuinfo_cur_freq" and "scaling_cur_freq" are always the 2800 MHz on

> all CPUs on this idle system. This looks like a regression somewhere

> as in 5.4-based kernel, I can see "cpuinfo_cur_freq" can go down to

> 2000 MHz in the same scenario. I'll bisect a bit unless you have

> better ideas?


Few things which may let us understand the readings properly.

- cpuinfo_cur_freq: eventually makes a call to cppc_cpufreq_get_rate()
  and returns the *actual* frequency hardware is running at (based on
  counter diff around 2 us delay).

- scaling_cur_freq: is the frequency the cpufreq core thinks the
  hardware is running at, it would more in sync with what schedutil
  (or other governors) wants the CPU to run at. This can be different
  from what the hardware is running at, i.e. given by
  cpuinfo_cur_freq.

-- 
viresh

Viresh Kumar June 29, 2021, 4:52 a.m. UTC | #22

On 25-06-21, 22:29, Qian Cai wrote:
> Ionela, I found that set ACPI_PROCESSOR=y instead of
> ACPI_PROCESSOR=m will fix the previous mentioned issues here (any
> explanations of that?) even though the scaling down is not perfect.

Not sure how this affects it.

> Now, we have the following on this idle system:
> 
> # cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq | sort | uniq  -c
>  	79 1000000
>   	1 1160000
>  	73 1400000
>   	1 2000000
>   	4 2010000
>   	1 2800000
>   	1 860000
> 
> Even if I rerun a few times, there could still have a few CPUs
> running lower than lowest_perf (1GHz).

(Please wrap your lines at 80 columns, it makes it harder to read
otherwise).

I think only the counters stopping on idle can get us that.

> Also, even though I set all CPUs to use "userspace" governor and set
> freq to the lowest. A few CPUs keep changing at will.
> 
> # cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq | sort | uniq  -c
> 	156 1000000
>   	3 2000000
>   	1 760000

I think this is expected since the hardware is in control of frequency
here. The software can only request it to run at X frequency, the
hardware may choose to do something else nevertheless.

Viresh Kumar June 29, 2021, 4:55 a.m. UTC | #23

On 26-06-21, 09:41, Qian Cai wrote:
> Another date point is that set ACPI_CPPC_CPUFREQ_FIE=n fixed the issue that any CPU could run below the lowest freq.
> 
> schedutil:
> # cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq | sort | uniq  -c
>      80 1000000
>      78 1400000
>       1 2010000
>       1 2800000
> 
> userspace:
> # cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq | sort | uniq  -c
>     158 1000000
>       2 2000000

ACPI_CPPC_CPUFREQ_FIE can play a role with schedutil, but not with
userspace governor. Userspace doesn't use the values being updated by
ACPI_CPPC_CPUFREQ_FIE. So I think the CPUs may not have been idle,
just for some reason.

Also, now that you are able run on latest kernel (linux-next), it
would be better if we can talk in terms of that only going forward.
5.4 adds more to the already unstable results :)

Viresh Kumar June 29, 2021, 5:20 a.m. UTC | #24

On 28-06-21, 12:54, Ionela Voinescu wrote:
> If you happen to have the data around, I would like to know more about

> your observations on ThunderX2.

> 

> 

> I tried ThunderX2 as well, with the following observations:

> 

> Booting with userspace governor and all CPUs online, the CPPC frequency

> scale factor was all over the place (even much larger than 1024).

> 

> My initial assumptions:

>  - Counters do not behave properly in light of SMT

>  - Firmware does not do a good job to keep the reference and core

>    counters monotonic: save and restore at core off.

> 

> So I offlined all CPUs with the exception of 0, 32, 64, 96 - threads of

> a single core (part of policy0). With this all works very well:


Interesting.

> root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 1056000 > scaling_setspeed

> root@target:/sys/devices/system/cpu/cpufreq/policy0#

> [ 1863.095370] CPU96: cppc scale: 697.

> [ 1863.175370] CPU0: cppc scale: 492.

> [ 1863.215367] CPU64: cppc scale: 492.

> [ 1863.235366] CPU96: cppc scale: 492.

> [ 1863.485368] CPU32: cppc scale: 492.

> 

> root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 1936000 > scaling_setspeed

> root@target:/sys/devices/system/cpu/cpufreq/policy0#

> [ 1891.395363] CPU96: cppc scale: 558.

> [ 1891.415362] CPU0: cppc scale: 595.

> [ 1891.435362] CPU32: cppc scale: 615.

> [ 1891.465363] CPU96: cppc scale: 635.

> [ 1891.495361] CPU0: cppc scale: 673.

> [ 1891.515360] CPU32: cppc scale: 703.

> [ 1891.545360] CPU96: cppc scale: 738.

> [ 1891.575360] CPU0: cppc scale: 779.

> [ 1891.605360] CPU96: cppc scale: 829.

> [ 1891.635360] CPU0: cppc scale: 879.

> 

> root@target:/sys/devices/system/cpu/cpufreq/policy0#

> root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 2200000 > scaling_setspeed

> root@target:/sys/devices/system/cpu/cpufreq/policy0#

> [ 1896.585363] CPU32: cppc scale: 1004.

> [ 1896.675359] CPU64: cppc scale: 973.

> [ 1896.715359] CPU0: cppc scale: 1024.

> 

> I'm doing a rate limited printk only for increase/decrease values over

> 64 in the scale factor value.

> 

> This showed me that SMT is handled properly.

> 

> Then, as soon as I start onlining CPUs 1, 33, 65, 97, the scale factor

> stops being even close to correct, for example:

> 

> [238394.770328] CPU96: cppc scale: 22328.

> [238395.628846] CPU96: cppc scale: 245.

> [238516.087115] CPU96: cppc scale: 930.

> [238523.385009] CPU96: cppc scale: 245.

> [238538.767473] CPU96: cppc scale: 936.

> [238538.867546] CPU96: cppc scale: 245.

> [238599.367932] CPU97: cppc scale: 2728.

> [238599.859865] CPU97: cppc scale: 452.

> [238647.786284] CPU96: cppc scale: 1438.

> [238669.604684] CPU96: cppc scale: 27306.

> [238676.805049] CPU96: cppc scale: 245.

> [238737.642902] CPU97: cppc scale: 2035.

> [238737.664995] CPU97: cppc scale: 452.

> [238788.066193] CPU96: cppc scale: 2749.

> [238788.110192] CPU96: cppc scale: 245.

> [238817.231659] CPU96: cppc scale: 2698.

> [238818.083687] CPU96: cppc scale: 245.

> [238845.466850] CPU97: cppc scale: 2990.

> [238847.477805] CPU97: cppc scale: 452.

> [238936.984107] CPU97: cppc scale: 1590.

> [238937.029079] CPU97: cppc scale: 452.

> [238979.052464] CPU97: cppc scale: 911.

> [238980.900668] CPU97: cppc scale: 452.

> [239149.587889] CPU96: cppc scale: 803.

> [239151.085516] CPU96: cppc scale: 245.

> [239303.871373] CPU64: cppc scale: 956.

> [239303.906837] CPU64: cppc scale: 245.

> [239308.666786] CPU96: cppc scale: 821.

> [239319.440634] CPU96: cppc scale: 245.

> [239389.978395] CPU97: cppc scale: 4229.

> [239391.969562] CPU97: cppc scale: 452.

> [239415.894738] CPU96: cppc scale: 630.

> [239417.875326] CPU96: cppc scale: 245.

> 

> The counter values shown by feedback_ctrs do not seem monotonic even

> when only core 0 threads are online.

> 

> ref:2812420736 del:166051103

> ref:3683620736 del:641578595

> ref:1049653440 del:1548202980

> ref:2099053440 del:2120997459

> ref:3185853440 del:2714205997

> ref:712486144  del:3708490753

> ref:3658438336 del:3401357212

> ref:1570998080 del:2279728438

> 

> For now I was just wondering if you have seen the same and whether you

> have an opinion on this.


I think we also saw numbers like this, which didn't explain a lot on
ThunderX2. We thought they may be due to rounding issues, but the
offlining stuff adds an interesting factor to that.

-- 
viresh

Vincent Guittot June 29, 2021, 8:45 a.m. UTC | #25

On Mon, 28 Jun 2021 at 23:37, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
>
> On Monday 28 Jun 2021 at 14:08:13 (+0100), Ionela Voinescu wrote:
> > On Monday 28 Jun 2021 at 14:14:14 (+0200), Vincent Guittot wrote:
> > > On Mon, 28 Jun 2021 at 13:54, Ionela Voinescu <ionela.voinescu@arm.com> wrote:
> > > >
> > > > Hi guys,
> > > >
> > > > On Monday 21 Jun 2021 at 14:49:33 (+0530), Viresh Kumar wrote:
> > > > > Hello,
> > > > >
> > > > > Changes since V2:
> > > > >
> > > > > - We don't need start_cpu() and stop_cpu() callbacks anymore, we can make it
> > > > >   work using policy ->init() and exit() alone.
> > > > >
> > > > > - Two new cleanup patches 1/4 and 2/4.
> > > > >
> > > > > - Improved commit log of 3/4.
> > > > >
> > > > > - Dropped WARN_ON(local_freq_scale > 1024), since this can occur on counter's
> > > > >   overlap (seen with Vincent's setup).
> > > > >
> > > >
> > > > If you happen to have the data around, I would like to know more about
> > > > your observations on ThunderX2.
> > > >
> > > >
> > > > I tried ThunderX2 as well, with the following observations:
> > > >
> > > > Booting with userspace governor and all CPUs online, the CPPC frequency
> > > > scale factor was all over the place (even much larger than 1024).
> > > >
> > > > My initial assumptions:
> > > >  - Counters do not behave properly in light of SMT
> > > >  - Firmware does not do a good job to keep the reference and core
> > > >    counters monotonic: save and restore at core off.
> > > >
> > > > So I offlined all CPUs with the exception of 0, 32, 64, 96 - threads of
> > > > a single core (part of policy0). With this all works very well:
> > > >
> > > > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 1056000 > scaling_setspeed
> > > > root@target:/sys/devices/system/cpu/cpufreq/policy0#
> > > > [ 1863.095370] CPU96: cppc scale: 697.
> > > > [ 1863.175370] CPU0: cppc scale: 492.
> > > > [ 1863.215367] CPU64: cppc scale: 492.
> > > > [ 1863.235366] CPU96: cppc scale: 492.
> > > > [ 1863.485368] CPU32: cppc scale: 492.
> > > >
> > > > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 1936000 > scaling_setspeed
> > > > root@target:/sys/devices/system/cpu/cpufreq/policy0#
> > > > [ 1891.395363] CPU96: cppc scale: 558.
> > > > [ 1891.415362] CPU0: cppc scale: 595.
> > > > [ 1891.435362] CPU32: cppc scale: 615.
> > > > [ 1891.465363] CPU96: cppc scale: 635.
> > > > [ 1891.495361] CPU0: cppc scale: 673.
> > > > [ 1891.515360] CPU32: cppc scale: 703.
> > > > [ 1891.545360] CPU96: cppc scale: 738.
> > > > [ 1891.575360] CPU0: cppc scale: 779.
> > > > [ 1891.605360] CPU96: cppc scale: 829.
> > > > [ 1891.635360] CPU0: cppc scale: 879.
> > > >
> > > > root@target:/sys/devices/system/cpu/cpufreq/policy0#
> > > > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 2200000 > scaling_setspeed
> > > > root@target:/sys/devices/system/cpu/cpufreq/policy0#
> > > > [ 1896.585363] CPU32: cppc scale: 1004.
> > > > [ 1896.675359] CPU64: cppc scale: 973.
> > > > [ 1896.715359] CPU0: cppc scale: 1024.
> > > >
> > > > I'm doing a rate limited printk only for increase/decrease values over
> > > > 64 in the scale factor value.
> > > >
> > > > This showed me that SMT is handled properly.
> > > >
> > > > Then, as soon as I start onlining CPUs 1, 33, 65, 97, the scale factor
> > > > stops being even close to correct, for example:
> > > >
> > > > [238394.770328] CPU96: cppc scale: 22328.
> > > > [238395.628846] CPU96: cppc scale: 245.
> > > > [238516.087115] CPU96: cppc scale: 930.
> > > > [238523.385009] CPU96: cppc scale: 245.
> > > > [238538.767473] CPU96: cppc scale: 936.
> > > > [238538.867546] CPU96: cppc scale: 245.
> > > > [238599.367932] CPU97: cppc scale: 2728.
> > > > [238599.859865] CPU97: cppc scale: 452.
> > > > [238647.786284] CPU96: cppc scale: 1438.
> > > > [238669.604684] CPU96: cppc scale: 27306.
> > > > [238676.805049] CPU96: cppc scale: 245.
> > > > [238737.642902] CPU97: cppc scale: 2035.
> > > > [238737.664995] CPU97: cppc scale: 452.
> > > > [238788.066193] CPU96: cppc scale: 2749.
> > > > [238788.110192] CPU96: cppc scale: 245.
> > > > [238817.231659] CPU96: cppc scale: 2698.
> > > > [238818.083687] CPU96: cppc scale: 245.
> > > > [238845.466850] CPU97: cppc scale: 2990.
> > > > [238847.477805] CPU97: cppc scale: 452.
> > > > [238936.984107] CPU97: cppc scale: 1590.
> > > > [238937.029079] CPU97: cppc scale: 452.
> > > > [238979.052464] CPU97: cppc scale: 911.
> > > > [238980.900668] CPU97: cppc scale: 452.
> > > > [239149.587889] CPU96: cppc scale: 803.
> > > > [239151.085516] CPU96: cppc scale: 245.
> > > > [239303.871373] CPU64: cppc scale: 956.
> > > > [239303.906837] CPU64: cppc scale: 245.
> > > > [239308.666786] CPU96: cppc scale: 821.
> > > > [239319.440634] CPU96: cppc scale: 245.
> > > > [239389.978395] CPU97: cppc scale: 4229.
> > > > [239391.969562] CPU97: cppc scale: 452.
> > > > [239415.894738] CPU96: cppc scale: 630.
> > > > [239417.875326] CPU96: cppc scale: 245.
> > > >
> > >
> > > With the counter being 32bits and the freq scaling being update at
> > > tick, you can easily get a overflow on it in idle system. I can easily
> > > imagine that when you unplug CPUs there is enough activity on the CPU
> > > to update it regularly whereas with all CPUs, the idle time is longer
> > > that the counter overflow
>
> For sane counters, how long the CPU is idle should not matter (please
> see below my definition of sane counters).
>
> > >
> >
> > Thanks! Yes, given the high wraparound time I thought they were 64 bit.
> > All variables in software are 64 bit, but looking at bit width in the
> > _CPC entries, the platform counters are 32 bit counters.
> >
>
> I've looked a bit more over the code, and considering this particular
> system (32 bit counters, maximum frequency of CPUs = 2.2GHz), I believe
> the wraparound is considered, and this should not cause these strange
> values in the scale factor.
>
> I consider the counters sane if both stop during idle - either they stop
> when CPU is clock gated, or some firmware does save/restore at core off.
> Therefore, in all idle cases they seem to have stopped, from the point of
> view of the OS. The ACPI spec mentions that both count "any time the
> processor is active".
>
> After the cores return from idle, the counters will wraparound at a
> minimum of 1.95s. So with a tick every 4ms at most 1 wraparound would
> have happened which allows the getDelta() function in cppc_cpufreq
> driver to get the proper difference in values.
>
> Let me know what you think.

This is not what I can see on my THX2. Counter are always increasing
at the same pace even if the system is idle (nothing running except
background activity)

As long as you read counter often enough, the results is coherent with
the freq that has been set by the governor

>
> Thanks,
> Ionela.
>
> > > There are 32bits and the overflow need to be handled by cppc_cpufreq
> > > driver
> >
> > I'm wondering if this would be best handled in the function that reads
> > the counters or in the cppc_cpufreq driver that uses them. Probably the
> > latter, as you say, as the read function should only return the raw
> > values, but it does complicate things.
> >
> > Thanks,
> > Ionela.
> >
> >
> >
> > > > The counter values shown by feedback_ctrs do not seem monotonic even
> > > > when only core 0 threads are online.
> > > >
> > > > ref:2812420736 del:166051103
> > > > ref:3683620736 del:641578595
> > > > ref:1049653440 del:1548202980
> > > > ref:2099053440 del:2120997459
> > > > ref:3185853440 del:2714205997
> > > > ref:712486144  del:3708490753
> > > > ref:3658438336 del:3401357212
> > > > ref:1570998080 del:2279728438
> > > >
> > > > For now I was just wondering if you have seen the same and whether you
> > > > have an opinion on this.
> > > >
> > > > > This is tested on my Hikey platform (without the actual read/write to
> > > > > performance counters), with this script for over an hour:
> > > > >
> > > > > while true; do
> > > > >     for i in `seq 1 7`;
> > > > >     do
> > > > >         echo 0 > /sys/devices/system/cpu/cpu$i/online;
> > > > >     done;
> > > > >
> > > > >     for i in `seq 1 7`;
> > > > >     do
> > > > >         echo 1 > /sys/devices/system/cpu/cpu$i/online;
> > > > >     done;
> > > > > done
> > > > >
> > > > >
> > > > > The same is done by Vincent on ThunderX2 and no issues were seen.
> > > >
> > > > Hotplug worked fine for me as well on both platforms I tested (Juno R2
> > > > and ThunderX2).
> > > >
> > > > Thanks,
> > > > Ionela.

Ionela Voinescu June 29, 2021, 8:46 a.m. UTC | #26

Hi,

On Tuesday 29 Jun 2021 at 10:50:28 (+0530), Viresh Kumar wrote:
> On 28-06-21, 12:54, Ionela Voinescu wrote:
> > If you happen to have the data around, I would like to know more about
> > your observations on ThunderX2.
> > 
> > 
> > I tried ThunderX2 as well, with the following observations:
> > 
> > Booting with userspace governor and all CPUs online, the CPPC frequency
> > scale factor was all over the place (even much larger than 1024).
> > 
> > My initial assumptions:
> >  - Counters do not behave properly in light of SMT
> >  - Firmware does not do a good job to keep the reference and core
> >    counters monotonic: save and restore at core off.
> > 
> > So I offlined all CPUs with the exception of 0, 32, 64, 96 - threads of
> > a single core (part of policy0). With this all works very well:
> 
> Interesting.
> 
> > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 1056000 > scaling_setspeed
> > root@target:/sys/devices/system/cpu/cpufreq/policy0#
> > [ 1863.095370] CPU96: cppc scale: 697.
> > [ 1863.175370] CPU0: cppc scale: 492.
> > [ 1863.215367] CPU64: cppc scale: 492.
> > [ 1863.235366] CPU96: cppc scale: 492.
> > [ 1863.485368] CPU32: cppc scale: 492.
> > 
> > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 1936000 > scaling_setspeed
> > root@target:/sys/devices/system/cpu/cpufreq/policy0#
> > [ 1891.395363] CPU96: cppc scale: 558.
> > [ 1891.415362] CPU0: cppc scale: 595.
> > [ 1891.435362] CPU32: cppc scale: 615.
> > [ 1891.465363] CPU96: cppc scale: 635.
> > [ 1891.495361] CPU0: cppc scale: 673.
> > [ 1891.515360] CPU32: cppc scale: 703.
> > [ 1891.545360] CPU96: cppc scale: 738.
> > [ 1891.575360] CPU0: cppc scale: 779.
> > [ 1891.605360] CPU96: cppc scale: 829.
> > [ 1891.635360] CPU0: cppc scale: 879.
> > 
> > root@target:/sys/devices/system/cpu/cpufreq/policy0#
> > root@target:/sys/devices/system/cpu/cpufreq/policy0# echo 2200000 > scaling_setspeed
> > root@target:/sys/devices/system/cpu/cpufreq/policy0#
> > [ 1896.585363] CPU32: cppc scale: 1004.
> > [ 1896.675359] CPU64: cppc scale: 973.
> > [ 1896.715359] CPU0: cppc scale: 1024.
> > 
> > I'm doing a rate limited printk only for increase/decrease values over
> > 64 in the scale factor value.
> > 
> > This showed me that SMT is handled properly.
> > 
> > Then, as soon as I start onlining CPUs 1, 33, 65, 97, the scale factor
> > stops being even close to correct, for example:
> > 
> > [238394.770328] CPU96: cppc scale: 22328.
> > [238395.628846] CPU96: cppc scale: 245.
> > [238516.087115] CPU96: cppc scale: 930.
> > [238523.385009] CPU96: cppc scale: 245.
> > [238538.767473] CPU96: cppc scale: 936.
> > [238538.867546] CPU96: cppc scale: 245.
> > [238599.367932] CPU97: cppc scale: 2728.
> > [238599.859865] CPU97: cppc scale: 452.
> > [238647.786284] CPU96: cppc scale: 1438.
> > [238669.604684] CPU96: cppc scale: 27306.
> > [238676.805049] CPU96: cppc scale: 245.
> > [238737.642902] CPU97: cppc scale: 2035.
> > [238737.664995] CPU97: cppc scale: 452.
> > [238788.066193] CPU96: cppc scale: 2749.
> > [238788.110192] CPU96: cppc scale: 245.
> > [238817.231659] CPU96: cppc scale: 2698.
> > [238818.083687] CPU96: cppc scale: 245.
> > [238845.466850] CPU97: cppc scale: 2990.
> > [238847.477805] CPU97: cppc scale: 452.
> > [238936.984107] CPU97: cppc scale: 1590.
> > [238937.029079] CPU97: cppc scale: 452.
> > [238979.052464] CPU97: cppc scale: 911.
> > [238980.900668] CPU97: cppc scale: 452.
> > [239149.587889] CPU96: cppc scale: 803.
> > [239151.085516] CPU96: cppc scale: 245.
> > [239303.871373] CPU64: cppc scale: 956.
> > [239303.906837] CPU64: cppc scale: 245.
> > [239308.666786] CPU96: cppc scale: 821.
> > [239319.440634] CPU96: cppc scale: 245.
> > [239389.978395] CPU97: cppc scale: 4229.
> > [239391.969562] CPU97: cppc scale: 452.
> > [239415.894738] CPU96: cppc scale: 630.
> > [239417.875326] CPU96: cppc scale: 245.
> > 
> > The counter values shown by feedback_ctrs do not seem monotonic even
> > when only core 0 threads are online.
> > 
> > ref:2812420736 del:166051103
> > ref:3683620736 del:641578595
> > ref:1049653440 del:1548202980
> > ref:2099053440 del:2120997459
> > ref:3185853440 del:2714205997
> > ref:712486144  del:3708490753
> > ref:3658438336 del:3401357212
> > ref:1570998080 del:2279728438
> > 
> > For now I was just wondering if you have seen the same and whether you
> > have an opinion on this.
> 
> I think we also saw numbers like this, which didn't explain a lot on
> ThunderX2. We thought they may be due to rounding issues, but the
> offlining stuff adds an interesting factor to that.
> 

More testing last night showed that having 1 core (with all 4 threads)
online per socket works fine, while having 2 cores online per socket
gives these bad values. My assumption is that the counters reset to 0
at core off which introduces the behavior. When there is a single core
per socket, this single core cannot be turned off.

I tried to boot with less CPUs and cpuidle.off=1 but it did not make a
difference. I expect much of the idle control to be in
hardware/firmware, so possibly cores were still turned off.

I'll do more research on idle state management for SMT and debugging
to see if it explains more, but it will take longer as I've ignored a
few other responsibilities in the meantime.

Thanks,
Ionela.

> -- 
> viresh

Ionela Voinescu June 29, 2021, 9:06 a.m. UTC | #27

Hi Qian,

Sorry for the delay. I was trying to run tests on ThunderX2 as well, as
it can give me more control over testing, to get more insight on this.

On Friday 25 Jun 2021 at 22:29:26 (-0400), Qian Cai wrote:
> 
> 
> On 6/25/2021 10:37 AM, Ionela Voinescu wrote:
> > Quick questions for you:
> > 
> > 1. When you say you tried a 5.4 kernel, did you try it with these
> > patches backported? They also have some dependencies with the recent
> > changes in the arch topology driver and cpufreq so they would not be
> > straight forward to backport.
> > 
> > If the 5.4 kernel you tried did not have these patches, it might be best
> > to try next/master that has these patches, but with
> > CONFIG_ACPI_CPPC_CPUFREQ_FIE=n, just to eliminate the possibility that
> > an incorrect frequency scale factor here would affect utilization that
> > would then affect the schedutil frequency selection. I would not expect
> > this behavior even if the scale factor was wrong, but it would be good
> > to rule out.
> > 
> > 2. Is your platform booting with all CPUs? Are any hotplug operations
> > done in your scenario?
> 
> Ionela, I found that set ACPI_PROCESSOR=y instead of ACPI_PROCESSOR=m
> will fix the previous mentioned issues here (any explanations of that?)
> even though the scaling down is not perfect. Now, we have the following
> on this idle system:
> 

I don't see how this would have played a role. The cppc cpufreq driver
depends on functionality gated by CONFIG_ACPI_CPPC_LIB which in turn
needs CONFIG_ACPI_PROCESSOR to trigger the parsing of the _CPC objects.
If CONFIG_ACPI_PROCESSOR functionality is not built in or loaded, the
cppc cpufreq driver could not be used at all.

> # cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq | sort | uniq  -c
>  	79 1000000
>   	1 1160000
>  	73 1400000
>   	1 2000000
>   	4 2010000
>   	1 2800000
>   	1 860000
> 
> Even if I rerun a few times, there could still have a few CPUs running
> lower than lowest_perf (1GHz). Also, even though I set all CPUs to use
> "userspace" governor and set freq to the lowest. A few CPUs keep changing
> at will.
> 

I do not believe getting values lower than lowest is worrying as long as
they are not much much lower and they don't happen very often. First of
all firmware has control over CPU frequencies and it can decide to
select a lower frequency if it wishes.

Looking at the fact that it only happens rarely in your tests, it's also
possible that this is introduced by the fact that the CPU only spends
only a few cycles in active state. That would reduce the resolution of
the computed current performance level, which results in a lower value
when converted to frequency.

Hope it helps,
Ionela.

> # cat /sys/devices/system/cpu/*/cpufreq/cpuinfo_cur_freq | sort | uniq  -c
> 	156 1000000
>   	3 2000000
>   	1 760000

Qian Cai June 29, 2021, 1:38 p.m. UTC | #28

On 6/29/2021 5:06 AM, Ionela Voinescu wrote:
>> Ionela, I found that set ACPI_PROCESSOR=y instead of ACPI_PROCESSOR=m

>> will fix the previous mentioned issues here (any explanations of that?)

>> even though the scaling down is not perfect. Now, we have the following

>> on this idle system:

>>

> 

> I don't see how this would have played a role. The cppc cpufreq driver

> depends on functionality gated by CONFIG_ACPI_CPPC_LIB which in turn

> needs CONFIG_ACPI_PROCESSOR to trigger the parsing of the _CPC objects.

> If CONFIG_ACPI_PROCESSOR functionality is not built in or loaded, the

> cppc cpufreq driver could not be used at all.

Ionela, that is fine. I can live with setting ACPI_PROCESSOR=y to workaround. I
was more of just curious about why manually loading processor.ko would set the
lowest to 2GHz instead of 1GHz.

Anyway, if running below the lowest is not the a concern, then I have concluded
my testing here. Feel free to include:

Tested-by: Qian Cai <quic_qiancai@quicinc.com>

[V3,0/4] cpufreq: cppc: Add support for frequency invariance

Message

Comments