diff mbox series

cpufreq: stats: Switch to ktime and msec instead of jiffies and usertime

Message ID 0e0fb542b6f6b26944cb2cf356041348aeac95f6.1605006378.git.viresh.kumar@linaro.org
State New
Headers show
Series cpufreq: stats: Switch to ktime and msec instead of jiffies and usertime | expand

Commit Message

Viresh Kumar Nov. 10, 2020, 11:07 a.m. UTC
The cpufreq and thermal core, both provide sysfs statistics to help
userspace learn about the behavior of frequencies and cooling states.

This is how they look:

/sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:208000 11
/sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:432000 147
/sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:729000 1600
/sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:960000 879
/sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:1200000 399

/sys/class/thermal/cooling_device0/stats/time_in_state_ms:state0 4097
/sys/class/thermal/cooling_device0/stats/time_in_state_ms:state1 8932
/sys/class/thermal/cooling_device0/stats/time_in_state_ms:state2 15868
/sys/class/thermal/cooling_device0/stats/time_in_state_ms:state3 1384
/sys/class/thermal/cooling_device0/stats/time_in_state_ms:state4 103

Here, state0 of thermal corresponds to the highest frequency of the CPU,
i.e. 1200000 and state4 to the lowest one.

While both of these try to show similar kind of data (which can still be
very much different from each other), the values looked different (by a
factor of 10, i.e. thermal's time_in_state is almost 10 times that of
cpufreq time_in_state).

This comes from the fact that cpufreq core displays the time in usertime
units (10 ms).

It would be better if both the frameworks displayed times in the same
unit as the users may need to correlate between them and different
scales just make it awkward. And the choice of thermal core for that
(msec) seems to be a better choice as it is easier to read.

The thermal core also does the stats calculations using ktime, which is
much more accurate as compared to jiffies used by cpufreq core.

This patch updates the cpufreq core to use ktime for the internal
calculations and changes the units of time_in_state to msec.

The results look like this after this commit:

/sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:208000 13
/sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:432000 790
/sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:729000 12492
/sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:960000 13259
/sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:1200000 3830

/sys/class/thermal/cooling_device0/stats/time_in_state_ms:state0 3888
/sys/class/thermal/cooling_device0/stats/time_in_state_ms:state1 13432
/sys/class/thermal/cooling_device0/stats/time_in_state_ms:state2 12336
/sys/class/thermal/cooling_device0/stats/time_in_state_ms:state3 740
/sys/class/thermal/cooling_device0/stats/time_in_state_ms:state4 0

FWIW, tools/power/cpupower/ does consume the time_in_state values from
the sysfs files but it is independent of the unit of the time and didn't
require an update.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

---
 Documentation/cpu-freq/cpufreq-stats.rst |  5 +--
 drivers/cpufreq/cpufreq_stats.c          | 47 +++++++++++++-----------
 2 files changed, 28 insertions(+), 24 deletions(-)

-- 
2.25.0.rc1.19.g042ed3e048af

Comments

Lukasz Luba Nov. 10, 2020, 11:36 a.m. UTC | #1
On 11/10/20 11:07 AM, Viresh Kumar wrote:
> The cpufreq and thermal core, both provide sysfs statistics to help

> userspace learn about the behavior of frequencies and cooling states.

> 

> This is how they look:

> 

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:208000 11

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:432000 147

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:729000 1600

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:960000 879

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:1200000 399

> 

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state0 4097

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state1 8932

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state2 15868

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state3 1384

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state4 103

> 

> Here, state0 of thermal corresponds to the highest frequency of the CPU,

> i.e. 1200000 and state4 to the lowest one.

> 

> While both of these try to show similar kind of data (which can still be

> very much different from each other), the values looked different (by a

> factor of 10, i.e. thermal's time_in_state is almost 10 times that of

> cpufreq time_in_state).

> 

> This comes from the fact that cpufreq core displays the time in usertime

> units (10 ms).

> 

> It would be better if both the frameworks displayed times in the same

> unit as the users may need to correlate between them and different

> scales just make it awkward. And the choice of thermal core for that

> (msec) seems to be a better choice as it is easier to read.

> 

> The thermal core also does the stats calculations using ktime, which is

> much more accurate as compared to jiffies used by cpufreq core.

> 

> This patch updates the cpufreq core to use ktime for the internal

> calculations and changes the units of time_in_state to msec.

> 

> The results look like this after this commit:

> 

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:208000 13

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:432000 790

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:729000 12492

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:960000 13259

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:1200000 3830

> 

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state0 3888

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state1 13432

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state2 12336

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state3 740

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state4 0

> 

> FWIW, tools/power/cpupower/ does consume the time_in_state values from

> the sysfs files but it is independent of the unit of the time and didn't

> require an update.

> 

> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

> ---

>   Documentation/cpu-freq/cpufreq-stats.rst |  5 +--

>   drivers/cpufreq/cpufreq_stats.c          | 47 +++++++++++++-----------

>   2 files changed, 28 insertions(+), 24 deletions(-)

> 

> diff --git a/Documentation/cpu-freq/cpufreq-stats.rst b/Documentation/cpu-freq/cpufreq-stats.rst

> index 9ad695b1c7db..9f94012a882f 100644

> --- a/Documentation/cpu-freq/cpufreq-stats.rst

> +++ b/Documentation/cpu-freq/cpufreq-stats.rst

> @@ -64,9 +64,8 @@ need for a reboot.

>   

>   This gives the amount of time spent in each of the frequencies supported by

>   this CPU. The cat output will have "<frequency> <time>" pair in each line, which

> -will mean this CPU spent <time> usertime units of time at <frequency>. Output

> -will have one line for each of the supported frequencies. usertime units here

> -is 10mS (similar to other time exported in /proc).

> +will mean this CPU spent <time> msec of time at <frequency>. Output will have

> +one line for each of the supported frequencies.

>   

>   ::

>   

> diff --git a/drivers/cpufreq/cpufreq_stats.c b/drivers/cpufreq/cpufreq_stats.c

> index 6cd5c8ab5d49..e054ada291e7 100644

> --- a/drivers/cpufreq/cpufreq_stats.c

> +++ b/drivers/cpufreq/cpufreq_stats.c

> @@ -14,35 +14,38 @@

>   

>   struct cpufreq_stats {

>   	unsigned int total_trans;

> -	unsigned long long last_time;

> +	ktime_t last_time;

>   	unsigned int max_state;

>   	unsigned int state_num;

>   	unsigned int last_index;

> -	u64 *time_in_state;

> +	ktime_t *time_in_state;

>   	unsigned int *freq_table;

>   	unsigned int *trans_table;

>   

>   	/* Deferred reset */

>   	unsigned int reset_pending;

> -	unsigned long long reset_time;

> +	ktime_t reset_time;

>   };

>   

> -static void cpufreq_stats_update(struct cpufreq_stats *stats,

> -				 unsigned long long time)

> +static void cpufreq_stats_update(struct cpufreq_stats *stats, ktime_t time)

>   {

> -	unsigned long long cur_time = get_jiffies_64();

> +	ktime_t cur_time = ktime_get(), delta;

>   

> -	stats->time_in_state[stats->last_index] += cur_time - time;

> +	delta = ktime_sub(cur_time, time);

> +	stats->time_in_state[stats->last_index] =

> +		ktime_add(stats->time_in_state[stats->last_index], delta);

>   	stats->last_time = cur_time;

>   }

>   

>   static void cpufreq_stats_reset_table(struct cpufreq_stats *stats)

>   {

> -	unsigned int count = stats->max_state;

> +	unsigned int count = stats->max_state, i;

> +

> +	for (i = 0; i < count; i++)

> +		stats->time_in_state[i] = ktime_set(0, 0);

>   

> -	memset(stats->time_in_state, 0, count * sizeof(u64));

>   	memset(stats->trans_table, 0, count * count * sizeof(int));

> -	stats->last_time = get_jiffies_64();

> +	stats->last_time = ktime_get();

>   	stats->total_trans = 0;

>   

>   	/* Adjust for the time elapsed since reset was requested */

> @@ -70,7 +73,7 @@ static ssize_t show_time_in_state(struct cpufreq_policy *policy, char *buf)

>   {

>   	struct cpufreq_stats *stats = policy->stats;

>   	bool pending = READ_ONCE(stats->reset_pending);

> -	unsigned long long time;

> +	ktime_t time, now = ktime_get(), delta;

>   	ssize_t len = 0;

>   	int i;

>   

> @@ -82,18 +85,20 @@ static ssize_t show_time_in_state(struct cpufreq_policy *policy, char *buf)

>   				 * before the reset_pending read above.

>   				 */

>   				smp_rmb();

> -				time = get_jiffies_64() - READ_ONCE(stats->reset_time);

> +				time = ktime_sub(now, READ_ONCE(stats->reset_time));

>   			} else {

> -				time = 0;

> +				time = ktime_set(0, 0);;

>   			}

>   		} else {

>   			time = stats->time_in_state[i];

> -			if (i == stats->last_index)

> -				time += get_jiffies_64() - stats->last_time;

> +			if (i == stats->last_index) {

> +				delta = ktime_sub(now, stats->last_time);

> +				time = ktime_add(delta, time);

> +			}

>   		}

>   

>   		len += sprintf(buf + len, "%u %llu\n", stats->freq_table[i],

> -			       jiffies_64_to_clock_t(time));

> +			       ktime_to_ms(time));

>   	}

>   	return len;

>   }

> @@ -109,7 +114,7 @@ static ssize_t store_reset(struct cpufreq_policy *policy, const char *buf,

>   	 * Defer resetting of stats to cpufreq_stats_record_transition() to

>   	 * avoid races.

>   	 */

> -	WRITE_ONCE(stats->reset_time, get_jiffies_64());

> +	WRITE_ONCE(stats->reset_time, ktime_get());

>   	/*

>   	 * The memory barrier below is to prevent the readers of reset_time from

>   	 * seeing a stale or partially updated value.

> @@ -228,9 +233,9 @@ void cpufreq_stats_create_table(struct cpufreq_policy *policy)

>   	if (!stats)

>   		return;

>   

> -	alloc_size = count * sizeof(int) + count * sizeof(u64);

> -

> -	alloc_size += count * count * sizeof(int);

> +	alloc_size = count * sizeof(*stats->time_in_state);

> +	alloc_size += count * sizeof(*stats->freq_table);

> +	alloc_size += count * count * sizeof(*stats->trans_table);

>   

>   	/* Allocate memory for time_in_state/freq_table/trans_table in one go */

>   	stats->time_in_state = kzalloc(alloc_size, GFP_KERNEL);

> @@ -249,7 +254,7 @@ void cpufreq_stats_create_table(struct cpufreq_policy *policy)

>   			stats->freq_table[i++] = pos->frequency;

>   

>   	stats->state_num = i;

> -	stats->last_time = get_jiffies_64();

> +	stats->last_time = ktime_get();

>   	stats->last_index = freq_table_get_index(stats, policy->cur);

>   

>   	policy->stats = stats;

> 


I am not sure if these ktime_get() are not too heavy in the code path
visited by the scheduler.

How about local_clock()?
It's used in ./drivers/cpuidle/cpuidle.c to do similar accounting.

Regards,
Lukasz
Thomas Renninger Nov. 10, 2020, 12:53 p.m. UTC | #2
Am Dienstag, 10. November 2020, 12:07:37 CET schrieb Viresh Kumar:
> The cpufreq and thermal core, both provide sysfs statistics to help

> userspace learn about the behavior of frequencies and cooling states.

> 

> This is how they look:

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:1200000 399

 
> The results look like this after this commit:

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:1200000 3830


How would userspace know whether it's ms or 10ms?

whatabout a new file with the same convention as cooling devices (adding ms):
 
> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state0 3888

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state_ms:1200000 3830


Somewhat off-topic, some ideas:

I wonder how useful these stats still are.
CPU_FREQ_STAT is off on my system:

config CPU_FREQ_STAT
        bool "CPU frequency transition statistics"
        help
          Export CPU frequency statistics information through sysfs.

          If in doubt, say N.

Iirc this was a module at former times?

commit 1aefc75b2449eb68a6fc3ca932e2a4ee353b748d
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Tue May 31 22:14:44 2016 +0200

    cpufreq: stats: Make the stats code non-modular

outlined 2 problems with cpufreq_stats being non-modular, but
also seem to fix them up:
... and drop the notifiers from it
Make the stats sysfs attributes appear empty if fast frequency
switching is enabled...

   Thomas
Rafael J. Wysocki Nov. 10, 2020, 12:59 p.m. UTC | #3
On Tue, Nov 10, 2020 at 12:07 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>

> The cpufreq and thermal core, both provide sysfs statistics to help

> userspace learn about the behavior of frequencies and cooling states.

>

> This is how they look:

>

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:208000 11

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:432000 147

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:729000 1600

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:960000 879

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:1200000 399

>

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state0 4097

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state1 8932

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state2 15868

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state3 1384

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state4 103

>

> Here, state0 of thermal corresponds to the highest frequency of the CPU,

> i.e. 1200000 and state4 to the lowest one.

>

> While both of these try to show similar kind of data (which can still be

> very much different from each other), the values looked different (by a

> factor of 10, i.e. thermal's time_in_state is almost 10 times that of

> cpufreq time_in_state).

>

> This comes from the fact that cpufreq core displays the time in usertime

> units (10 ms).

>

> It would be better if both the frameworks displayed times in the same

> unit as the users may need to correlate between them and different

> scales just make it awkward. And the choice of thermal core for that

> (msec) seems to be a better choice as it is easier to read.

>

> The thermal core also does the stats calculations using ktime, which is

> much more accurate as compared to jiffies used by cpufreq core.

>

> This patch updates the cpufreq core to use ktime for the internal

> calculations and changes the units of time_in_state to msec.


Well, this may confuse user space using the stats today.

>

> The results look like this after this commit:

>

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:208000 13

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:432000 790

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:729000 12492

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:960000 13259

> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:1200000 3830

>

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state0 3888

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state1 13432

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state2 12336

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state3 740

> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state4 0

>

> FWIW, tools/power/cpupower/ does consume the time_in_state values from

> the sysfs files but it is independent of the unit of the time and didn't

> require an update.


But whoever uses cpupower may be confused.
Viresh Kumar Nov. 11, 2020, 5:13 a.m. UTC | #4
On 10-11-20, 13:53, Thomas Renninger wrote:
> Am Dienstag, 10. November 2020, 12:07:37 CET schrieb Viresh Kumar:

> > The cpufreq and thermal core, both provide sysfs statistics to help

> > userspace learn about the behavior of frequencies and cooling states.

> > 

> > This is how they look:

> > /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:1200000 399

>  

> > The results look like this after this commit:

> > /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:1200000 3830

> 

> How would userspace know whether it's ms or 10ms?

> 

> whatabout a new file with the same convention as cooling devices (adding ms):


Keeping two files for same stuff is not great, and renaming the file
breaks userspace ABI. I am not sure what's the right thing to do here.

> > /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state0 3888

> > /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state_ms:1200000 3830

> 

> Somewhat off-topic, some ideas:

> 

> I wonder how useful these stats still are.

> CPU_FREQ_STAT is off on my system:


I still use it.

> config CPU_FREQ_STAT

>         bool "CPU frequency transition statistics"

>         help

>           Export CPU frequency statistics information through sysfs.

> 

>           If in doubt, say N.

> 

> Iirc this was a module at former times?

> 

> commit 1aefc75b2449eb68a6fc3ca932e2a4ee353b748d

> Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> Date:   Tue May 31 22:14:44 2016 +0200

> 

>     cpufreq: stats: Make the stats code non-modular

> 

> outlined 2 problems with cpufreq_stats being non-modular, but

> also seem to fix them up:

> ... and drop the notifiers from it

> Make the stats sysfs attributes appear empty if fast frequency

> switching is enabled...


I already fixed this recently and stats don't appear empty for fast
switch anymore.

-- 
viresh
Viresh Kumar Nov. 11, 2020, 5:14 a.m. UTC | #5
On 10-11-20, 11:36, Lukasz Luba wrote:
> I am not sure if these ktime_get() are not too heavy in the code path

> visited by the scheduler.


Ahh Right. I missed that.

> How about local_clock()?

> It's used in ./drivers/cpuidle/cpuidle.c to do similar accounting.


Will have a look.

-- 
viresh
Viresh Kumar Nov. 11, 2020, 5:28 a.m. UTC | #6
On 10-11-20, 13:59, Rafael J. Wysocki wrote:
> Well, this may confuse user space using the stats today.

> 

> But whoever uses cpupower may be confused.


Yes, it will confuse them for once and they will probably learn of the
change, not sure how many of them would be there though who look at
these stats. I find them helpful during testing of my stuff sometimes
and they already look a bit confusing.

-- 
viresh
Thomas Renninger Nov. 11, 2020, 8:13 a.m. UTC | #7
Am Mittwoch, 11. November 2020, 06:13:50 CET schrieb Viresh Kumar:
> On 10-11-20, 13:53, Thomas Renninger wrote:

> > Am Dienstag, 10. November 2020, 12:07:37 CET schrieb Viresh Kumar:

> > > The cpufreq and thermal core, both provide sysfs statistics to help

> > > userspace learn about the behavior of frequencies and cooling states.

> > > 

> > > This is how they look:

> > > /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:1200000 399

> > > 

> > > The results look like this after this commit:

> > > /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:1200000 3830

> > 

> > How would userspace know whether it's ms or 10ms?


Again:
How would userspace know whether it's ms or 10ms?

> > whatabout a new file with the same convention as cooling devices (adding ms):

> Keeping two files for same stuff is not great, and renaming the file

> breaks userspace ABI.


No exactly the other way around:
- Renaming, breaks the userspace ABI.
- Two files would be the super correct way to go:
  - Deprecate the old file and keep the 10ms around for some years
    ./Documentation/ABI/obsolete
  - Add the new interface and document it in:
   ./Documentation/ABI/testing

As this is about a minor cpufreq_stat debug file, it is enough if
you rename to:
> /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state_ms

Ideally document it in ./Documentation/ABI/testing
But I guess for this one this is also not mandatory.

Then userspace can do:
MS_FILE="/sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state_ms"
10MS_FILE="/sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state"

if [ -r "$MS_FILE" ]; then
    POLICY0_MS=$(<"$MS_FILE")
    echo "Found ms stats file"
elif [ -r "$10MS_FILE" ]; then
    echo "Found 10ms stats file, converting..."
    POLICY0_MS=$(<"$10MS_FILE")
    POLICY0_MS=$(echo "$POLICY0_MS 10 /" |dc)
else
    echo "cpufreq stat not compiled in?"
fi

> I am not sure what's the right thing to do here.


Use a new *_ms name.
If you are unsure how many people this still might use, keep the old file and mark
it deprecated and remove it in some years.
You could also add:
pr_info("%s userspace process accessed deprecated sysfs file %s", task->comm, OLD_SYSFS_FILE_PATH);
To find other userspace apps making use of it.

...
 
> I already fixed this recently and stats don't appear empty for fast

> switch anymore.


Then cpufreq_stats could be a module again?

      Thomas
Thomas Renninger Nov. 11, 2020, 8:42 a.m. UTC | #8
Hi,

sorry for high-jacking this thread, it is at least related and afaik you are
deeper involved in this:

(cutting CC list)

Am Dienstag, 10. November 2020, 12:07:37 CET schrieb Viresh Kumar:
> The cpufreq and thermal core, both provide sysfs statistics to help
> userspace learn about the behavior of frequencies and cooling states.
...
> /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state0 4097
There is the trans_table for cooling devices in the same dir:
/sys/devices/virtual/thermal/cooling_device*/stats/trans_table

I recently stumbled over this in a bug report and realized that it it seem
to overflow rather quickly due  to static memory usage.
Fixing it seem to be rather complex, not sure it's worth it.

for device 0-3 I get:
cat /sys/devices/virtual/thermal/cooling_device0/stats/trans_table
 From  :    To
       : state 0  
state 0:       0

and when it seem to get interesting (device 4 and 5), I get:
cat /sys/devices/virtual/thermal/cooling_device4/stats/trans_table
cat: /sys/devices/virtual/thermal/cooling_device4/stats/trans_table: File too large


Just a heads up.
Maybe it's worth to touch this as well if sysfs is changed in this area anyway.
Afaik sysfs forbids such data like whole transition tables in one file and dynamic
mem alloc.
Either it gets split up into
../cooling_device0/stats/trans_to_1
../cooling_device0/stats/trans_to_2
or maybe this should better live in debugfs?
or it could get reverted/deprecated?
or there might be another better solution to show this info...

No need to continue this off-topic thread.
Just a heads up in case someone has a neat idea...

           Thomas
Viresh Kumar Nov. 11, 2020, 9:51 a.m. UTC | #9
On 11-11-20, 09:13, Thomas Renninger wrote:
> Am Mittwoch, 11. November 2020, 06:13:50 CET schrieb Viresh Kumar:

> > On 10-11-20, 13:53, Thomas Renninger wrote:

> > > Am Dienstag, 10. November 2020, 12:07:37 CET schrieb Viresh Kumar:

> > > > The cpufreq and thermal core, both provide sysfs statistics to help

> > > > userspace learn about the behavior of frequencies and cooling states.

> > > > 

> > > > This is how they look:

> > > > /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:1200000 399

> > > > 

> > > > The results look like this after this commit:

> > > > /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state:1200000 3830

> > > 

> > > How would userspace know whether it's ms or 10ms?

> 

> Again:

> How would userspace know whether it's ms or 10ms?


Yeah, I understand the problem you are pointing at.

> > > whatabout a new file with the same convention as cooling devices (adding ms):

> > Keeping two files for same stuff is not great, and renaming the file

> > breaks userspace ABI.

> 

> No exactly the other way around:

> - Renaming, breaks the userspace ABI.

> - Two files would be the super correct way to go:


Yes, but then this is just some stats which a very limited number of
people should be using and so ...

>   - Deprecate the old file and keep the 10ms around for some years

>     ./Documentation/ABI/obsolete

>   - Add the new interface and document it in:

>    ./Documentation/ABI/testing

> 

> As this is about a minor cpufreq_stat debug file, it is enough if

> you rename to:

> > /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state_ms


... I agree about this. Just rename the file accordingly. Which will
also make sure that everyone follows that something got changed in the
kernel.

> > I already fixed this recently and stats don't appear empty for fast

> > switch anymore.

> 

> Then cpufreq_stats could be a module again?


No, not really. This is some code that needs to get called from
cpufreq core, without any notifiers and as fast as possible as we may
be in scheduler's hot path. So the module thing isn't going to work
now.

-- 
viresh
Viresh Kumar Nov. 11, 2020, 10:30 a.m. UTC | #10
On 11-11-20, 09:42, Thomas Renninger wrote:
> Hi,

> 

> sorry for high-jacking this thread, it is at least related and afaik you are

> deeper involved in this:

> 

> (cutting CC list)

> 

> Am Dienstag, 10. November 2020, 12:07:37 CET schrieb Viresh Kumar:

> > The cpufreq and thermal core, both provide sysfs statistics to help

> > userspace learn about the behavior of frequencies and cooling states.

> ...

> > /sys/class/thermal/cooling_device0/stats/time_in_state_ms:state0 4097

> There is the trans_table for cooling devices in the same dir:

> /sys/devices/virtual/thermal/cooling_device*/stats/trans_table

> 

> I recently stumbled over this in a bug report and realized that it it seem

> to overflow rather quickly due  to static memory usage.

> Fixing it seem to be rather complex, not sure it's worth it.

> 

> for device 0-3 I get:

> cat /sys/devices/virtual/thermal/cooling_device0/stats/trans_table

>  From  :    To

>        : state 0  

> state 0:       0

> 

> and when it seem to get interesting (device 4 and 5), I get:


How many total states are there ? Must be really large.

> cat /sys/devices/virtual/thermal/cooling_device4/stats/trans_table

> cat: /sys/devices/virtual/thermal/cooling_device4/stats/trans_table: File too large

> 

> 

> Just a heads up.

> Maybe it's worth to touch this as well if sysfs is changed in this area anyway.

> Afaik sysfs forbids such data like whole transition tables in one file and dynamic

> mem alloc.


Yes, but cpufreq was already in since a long time and this went in
with the same philosophy.

> Either it gets split up into

> ../cooling_device0/stats/trans_to_1

> ../cooling_device0/stats/trans_to_2


I think yes.

> or maybe this should better live in debugfs?


I think I tried the debugfs way earlier but the maintainers of thermal
asked me to do it sysfs way.

-- 
viresh
Thomas Renninger Nov. 11, 2020, 11:13 p.m. UTC | #11
Am Mittwoch, 11. November 2020, 11:30:18 CET schrieb Viresh Kumar:
> On 11-11-20, 09:42, Thomas Renninger wrote:

> > Hi,

> > 

> > sorry for high-jacking this thread, it is at least related and afaik you

> > are deeper involved in this:

..
> > and when it seem to get interesting (device 4 and 5), I get:

> How many total states are there ? Must be really large.

..

Ah, yes. The first 4 devices which have 3 thermal states seem to be connected to each CPU
via ACPI:
/sys/devices/virtual/thermal/cooling_device0/device -> ../../../LNXSYSTM:00/LNXCPU:00

The latter two, device 4+5 do not have this "device" sysfs link at all and cooling_device4 has 50,
cooling_device5 20 states.

All cooling devices max_state:
cat /sys/devices/virtual/thermal/cooling_device*/max_state 
3
3
3
3
50
20

cur_state is 0 by default for all.
Doing something like (not sure it was a 2):
echo 2 >/sys/devices/virtual/thermal/cooling_device4/cur_state
ends up in:
Nov 11 23:33:27 c100 kernel: NOHZ: local_softirq_pending 80
Nov 11 23:33:27 c100 kernel: NOHZ: local_softirq_pending 80
Nov 11 23:33:21 c100 kernel: intel_powerclamp: Start idle injection to reduce power
in kernel logs.

But echoing this static cur_state number in, seem to trigger dynamic adjusting
and the number gets up and down by itself:

while true;do cat /sys/devices/virtual/thermal/cooling_device4/cur_state;sleep 1;done
86
83
89
90
90
92
91
^C
c100:/home/trenn # cat /sys/devices/virtual/thermal/cooling_device4/max_state
50

Disabling works with setting cur_state to 0, but cur_state is -1 afterwards ?!?:
echo 0 >/sys/devices/virtual/thermal/cooling_device4/cur_state
cat /sys/devices/virtual/thermal/cooling_device4/cur_state 
-1

echo 1 >/sys/devices/virtual/thermal/cooling_device4/cur_state
c100:/home/trenn # cat /sys/devices/virtual/thermal/cooling_device4/cur_state 
86

It looks like any number echoed into the cur_state device, even above max_state enables
dynamic power adjusting:
echo 10100000 >/sys/devices/virtual/thermal/cooling_device4/cur_state
-> no error.

ACPI related CPU cooling devices 0-3 work as one would expect:
3 states (max_state) one can set them and trying to write something into cur_state
which is above max_state ends up in:
echo 4 >/sys/devices/virtual/thermal/cooling_device2/cur_state
bash: echo: Schreibfehler: Invalid argument.

Hardware:
Dell, Latitude E7470
cpu family      : 6
model           : 78
model name      : Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
stepping        : 3
microcode       : 0xd6


Adding Jacob who seem to be involved into powerclamp driver.
Maybe he can shed some light on this...

Thanks,

    Thomas
diff mbox series

Patch

diff --git a/Documentation/cpu-freq/cpufreq-stats.rst b/Documentation/cpu-freq/cpufreq-stats.rst
index 9ad695b1c7db..9f94012a882f 100644
--- a/Documentation/cpu-freq/cpufreq-stats.rst
+++ b/Documentation/cpu-freq/cpufreq-stats.rst
@@ -64,9 +64,8 @@  need for a reboot.
 
 This gives the amount of time spent in each of the frequencies supported by
 this CPU. The cat output will have "<frequency> <time>" pair in each line, which
-will mean this CPU spent <time> usertime units of time at <frequency>. Output
-will have one line for each of the supported frequencies. usertime units here
-is 10mS (similar to other time exported in /proc).
+will mean this CPU spent <time> msec of time at <frequency>. Output will have
+one line for each of the supported frequencies.
 
 ::
 
diff --git a/drivers/cpufreq/cpufreq_stats.c b/drivers/cpufreq/cpufreq_stats.c
index 6cd5c8ab5d49..e054ada291e7 100644
--- a/drivers/cpufreq/cpufreq_stats.c
+++ b/drivers/cpufreq/cpufreq_stats.c
@@ -14,35 +14,38 @@ 
 
 struct cpufreq_stats {
 	unsigned int total_trans;
-	unsigned long long last_time;
+	ktime_t last_time;
 	unsigned int max_state;
 	unsigned int state_num;
 	unsigned int last_index;
-	u64 *time_in_state;
+	ktime_t *time_in_state;
 	unsigned int *freq_table;
 	unsigned int *trans_table;
 
 	/* Deferred reset */
 	unsigned int reset_pending;
-	unsigned long long reset_time;
+	ktime_t reset_time;
 };
 
-static void cpufreq_stats_update(struct cpufreq_stats *stats,
-				 unsigned long long time)
+static void cpufreq_stats_update(struct cpufreq_stats *stats, ktime_t time)
 {
-	unsigned long long cur_time = get_jiffies_64();
+	ktime_t cur_time = ktime_get(), delta;
 
-	stats->time_in_state[stats->last_index] += cur_time - time;
+	delta = ktime_sub(cur_time, time);
+	stats->time_in_state[stats->last_index] =
+		ktime_add(stats->time_in_state[stats->last_index], delta);
 	stats->last_time = cur_time;
 }
 
 static void cpufreq_stats_reset_table(struct cpufreq_stats *stats)
 {
-	unsigned int count = stats->max_state;
+	unsigned int count = stats->max_state, i;
+
+	for (i = 0; i < count; i++)
+		stats->time_in_state[i] = ktime_set(0, 0);
 
-	memset(stats->time_in_state, 0, count * sizeof(u64));
 	memset(stats->trans_table, 0, count * count * sizeof(int));
-	stats->last_time = get_jiffies_64();
+	stats->last_time = ktime_get();
 	stats->total_trans = 0;
 
 	/* Adjust for the time elapsed since reset was requested */
@@ -70,7 +73,7 @@  static ssize_t show_time_in_state(struct cpufreq_policy *policy, char *buf)
 {
 	struct cpufreq_stats *stats = policy->stats;
 	bool pending = READ_ONCE(stats->reset_pending);
-	unsigned long long time;
+	ktime_t time, now = ktime_get(), delta;
 	ssize_t len = 0;
 	int i;
 
@@ -82,18 +85,20 @@  static ssize_t show_time_in_state(struct cpufreq_policy *policy, char *buf)
 				 * before the reset_pending read above.
 				 */
 				smp_rmb();
-				time = get_jiffies_64() - READ_ONCE(stats->reset_time);
+				time = ktime_sub(now, READ_ONCE(stats->reset_time));
 			} else {
-				time = 0;
+				time = ktime_set(0, 0);;
 			}
 		} else {
 			time = stats->time_in_state[i];
-			if (i == stats->last_index)
-				time += get_jiffies_64() - stats->last_time;
+			if (i == stats->last_index) {
+				delta = ktime_sub(now, stats->last_time);
+				time = ktime_add(delta, time);
+			}
 		}
 
 		len += sprintf(buf + len, "%u %llu\n", stats->freq_table[i],
-			       jiffies_64_to_clock_t(time));
+			       ktime_to_ms(time));
 	}
 	return len;
 }
@@ -109,7 +114,7 @@  static ssize_t store_reset(struct cpufreq_policy *policy, const char *buf,
 	 * Defer resetting of stats to cpufreq_stats_record_transition() to
 	 * avoid races.
 	 */
-	WRITE_ONCE(stats->reset_time, get_jiffies_64());
+	WRITE_ONCE(stats->reset_time, ktime_get());
 	/*
 	 * The memory barrier below is to prevent the readers of reset_time from
 	 * seeing a stale or partially updated value.
@@ -228,9 +233,9 @@  void cpufreq_stats_create_table(struct cpufreq_policy *policy)
 	if (!stats)
 		return;
 
-	alloc_size = count * sizeof(int) + count * sizeof(u64);
-
-	alloc_size += count * count * sizeof(int);
+	alloc_size = count * sizeof(*stats->time_in_state);
+	alloc_size += count * sizeof(*stats->freq_table);
+	alloc_size += count * count * sizeof(*stats->trans_table);
 
 	/* Allocate memory for time_in_state/freq_table/trans_table in one go */
 	stats->time_in_state = kzalloc(alloc_size, GFP_KERNEL);
@@ -249,7 +254,7 @@  void cpufreq_stats_create_table(struct cpufreq_policy *policy)
 			stats->freq_table[i++] = pos->frequency;
 
 	stats->state_num = i;
-	stats->last_time = get_jiffies_64();
+	stats->last_time = ktime_get();
 	stats->last_index = freq_table_get_index(stats, policy->cur);
 
 	policy->stats = stats;