mbox series

[v2,0/3] cpufreq: Allow drivers to receive more information from the governor

Message ID 3827230.0GnL3RTcl1@kreacher
Headers show
Series cpufreq: Allow drivers to receive more information from the governor | expand

Message

Rafael J. Wysocki Dec. 14, 2020, 8:01 p.m. UTC
Hi,

The timing of this is not perfect (sorry about that), but here's a refresh
of this series.

The majority of the previous cover letter still applies:

On Monday, December 7, 2020 5:25:38 PM CET Rafael J. Wysocki wrote:
> 
> This is based on the RFC posted a few days ago:
> 
> https://lore.kernel.org/linux-pm/1817571.2o5Kk4Ohv2@kreacher/
> 
>  Using intel_pstate in the passive mode with HWP enabled, in particular under
>  the schedutil governor, is still kind of problematic, because it has to assume
>  that it should not allow the frequency to fall below the one requested by the
>  governor.  For this reason, it translates the target frequency into HWP.REQ.MIN
>  which generally causes the processor to run a bit too fast.
> 
>  Moreover, this allows the HWP algorithm to use any frequency between the target
>  one and HWP.REQ.MAX that corresponds to the policy max limit and some workloads
>  cause it to go for the max turbo frequency prematurely which hurts energy-
>  efficiency without improving performance, even though the schedutil governor
>  itself would not allow the frequency to ramp up so fast.
> 
>  This patch series attempts to improve the situation by introducing a new driver
>  callback allowing the driver to receive more information from the governor.  In
>  particular, this allows the min (required) and target (desired) performance
>  levels to be passed to it and those can be used to give better hints to the
>  hardware.

In this second revision there are three patches (one preparatory patch for
schedutil that hasn't changed since the v1, the introduction of the new
callback and schedutil changes in patch [2/3] and the intel_pstate changes
in patch [3/3] that are the same as before.

Please see patch changelogs for details.

Thanks!

Comments

Doug Smythies Dec. 17, 2020, 3:26 p.m. UTC | #1
On 2020.12.14 12:02 Rafael J. Wysocki wrote:

> Hi,


Hi Rafael,

V2 test results below are new, other results are partially re-stated:

For readers that do not want to read on, I didn't find anything different than with
the other versions. This was more just due diligence.

Legend:

hwp: Kernel 5.10-rc6, HWP enabled; intel_cpufreq
rfc (or rjw): Kernel 5.10-rc6 + this patch set, HWP enabled; intel_cpu-freq; schedutil
no-hwp: Kernel 5.10-rc6, HWP disabled; intel_cpu-freq
acpi (or acpi-cpufreq): Kernel 5.10-rc6, HWP disabled; acpi-cpufreq; schedutil
patch: Kernel 5.10-rc7 + V1 patch set, HWP enabled; intel_cpu-freq; schedutil
v2: Kernel 5.10-rc7 + V2 patch set, HWP enabled; intel_cpu-freq; schedutil

Fixed work packet, fixed period, periodic workflow, load sweep up/down:

load work/sleep frequency: 73 Hertz:

hwp: Average: 12.00822 watts
rjw: Average: 10.18089 watts
no-hwp: Average: 10.21947 watts
acpi-cpufreq: Average:  9.06585 watts
patch: Average: 10.26060 watts
v2: Average: 10.50444

load work/sleep frequency: 113 Hertz:

hwp: Average: 12.01056
rjw: Average: 10.12303
no-hwp: Average: 10.08228
acpi-cpufreq: Average:  9.02215
patch: Average: 10.27055
v2: Average: 10.31097

load work/sleep frequency: 211 Hertz:

hwp: Average: 12.16067
rjw: Average: 10.24413
no-hwp: Average: 10.12463
acpi-cpufreq: Average:  9.19175
patch: Average: 10.33000
v2: Average: 10.39811

load work/sleep frequency: 347 Hertz:

hwp: Average: 12.34169
rjw: Average: 10.79980
no-hwp: Average: 10.57296
acpi-cpufreq: Average:  9.84709
patch: Average: 10.67029
v2: Average: 10.93143

load work/sleep frequency: 401 Hertz:

hwp: Average: 12.42562
rjw: Average: 11.12465
no-hwp: Average: 11.24203
acpi-cpufreq: Average: 10.78670
patch: Average: 10.94514
v2: Average: 11.50324


Serialized single threaded via PIDs per second method:
A.K.A fixed work packet, variable period
Results:

Execution times (seconds. Less is better):

no-hwp:

performance: Samples: 382  ; Average: 10.54450  ; Stand Deviation:  0.01564 ; Maximum: 10.61000 ; Minimum: 10.50000

schedutil: Samples: 293  ; Average: 13.73416  ; Stand Deviation:  0.73395 ; Maximum: 15.46000 ; Minimum: 11.68000
acpi: Samples: 253  ; Average: 15.94889  ; Stand Deviation:  1.28219 ; Maximum: 18.66000 ; Minimum: 12.04000

hwp:

schedutil: Samples: 380  ; Average: 10.58287  ; Stand Deviation:  0.01864 ; Maximum: 10.64000 ; Minimum: 10.54000
patch: Samples: 276  ; Average: 14.57029 ; Stand Deviation:  0.89771 ; Maximum: 16.04000 ; Minimum: 11.68000
rfc: Samples: 271  ; Average: 14.86037  ; Stand Deviation:  0.84164 ; Maximum: 16.04000 ; Minimum: 12.21000
v2: Samples: 274  ; Average: 14.67978  ; Stand Deviation:  1.03378 ; Maximum: 16.07000 ; Minimum: 11.43000

Power (watts. More indicates higher CPU frequency and better performance. Sample time = 1 second.):

no-hwp:

performance: Samples: 4000  ; Average: 25.41355  ; Stand Deviation:  0.22156 ; Maximum: 26.01996 ; Minimum: 24.08807

schedutil: Samples: 4000  ; Average: 12.58863  ; Stand Deviation:  5.48600 ; Maximum: 25.50934 ; Minimum:  7.54559
acpi: Samples: 4000  ; Average:  9.57924  ; Stand Deviation:  5.41157 ; Maximum: 25.06366 ; Minimum:  5.51129

hwp:

schedutil: Samples: 4000  ; Average: 25.24245  ; Stand Deviation:  0.19539 ; Maximum: 25.93671 ; Minimum: 24.14746
patch: Samples: 4000  ; Average: 11.07225  ; Stand Deviation:  5.63142 ; Maximum: 24.99493 ; Minimum:  3.67548
rfc: Samples: 4000  ; Average: 10.35842  ; Stand Deviation:  4.77915 ; Maximum: 24.95953 ; Minimum:  7.26202
v2: Samples: 4000  ; Average: 10.98284  ; Stand Deviation:  5.48859 ; Maximum: 25.76331 ; Minimum:  7.53790
Giovanni Gherdovich Dec. 18, 2020, 4:11 p.m. UTC | #2
On Mon, 2020-12-14 at 21:01 +0100, Rafael J. Wysocki wrote:
> Hi,

> 

> The timing of this is not perfect (sorry about that), but here's a refresh

> of this series.

> 

> The majority of the previous cover letter still applies:

> [...]


Hello,

the series is tested using

-> tbench (packets processing with loopback networking, measures throughput)
-> dbench (filesystem operations, measures average latency)
-> kernbench (kernel compilation, elapsed time)
-> and gitsource (long-running shell script, elapsed time)

These are chosen because none of them is bound by compute and all are
sensitive to freq scaling decisions. The machines are a Cascade Lake based
server, a client Skylake and a Coffee Lake laptop.

What's being compared:

sugov-HWP.desired : the present series;  intel_pstate=passive,  governor=schedutil
sugov-HWP.min     : mainline;            intel_pstate=passive,  governor=schedutil
powersave-HWP     : mainline;            intel_pstate=active,   governor=powersave
perfgov-HWP       : mainline;            intel_pstate=active,   governor=performance
sugov-no-HWP      : HWP disabled;        intel_pstate=passive,  governor=schedutil

Dbench and Kernbench have neutral results, but Tbench has sugov-HWP.desired
lose in both performance and performance-per-watt, while Gitsource show the
series as faster in raw performance but again worse than the competition in
efficiency.

1. SUMMARY BY BENCHMARK
   1.1. TBENCH
   1.2. DBENCH
   1.3. KERNBENCH
   1.4. GITSOURCE
2. SUMMARY BY USER PROFILE
   2.1. PERFORMANCE USER: what if I switch pergov -> schedutil?
   2.2. DEFAULT USER: what if I switch powersave -> schedutil?
   2.3. DEVELOPER: what if I switch sugov-HWP.min -> sugov-HWP.desired?
3. RESULTS TABLES
   PERFORMANCE RATIOS
   PERFORMANCE-PER-WATT RATIOS


1. SUMMARY BY BENCHMARK
~~~~~~~~~~~~~~~~~~~~~~~

Tbench: sugov-HWP.desired is the worst performance on all three
    machines. sugov-HWP.min is between 20% and 90% better. The baseline
    sugov-HWP.desired offers a lower throughput, but does it increase
    efficiency? It actually doesn't: on two out of three machines the
    incumbent code (current sugov, or intel_pstate=active) has 10% to 35%
    better efficiency. In other word, the status quo is both faster and more
    efficient than the proposed series on this benchmark.
    The absolute power consumption is lower, but the delivered performance is
    "even more lower", and that's why performance-per-watt shows a net loss.

Dbench: generally neutral, in both performance and efficiency. Powersave is
    occasionally behind the pack in performance, 5% to 15%. A 15% performance
    loss on the Coffe Lake is compensated by an 80% improved efficiency. To be
    noted that on the same Coffee Lake sugov-no-HWP is 20% ahead of the pack
    in efficiency.

Kernbench: neutral, in both performance and efficiency. powersave looses 14%
    to the pack in performance on the Cascade Lake.

Gitsource: this test show the most compelling case against the
    sugov-HWP.desired series: on the Cascade Lake sugov-HWP.desired is 10%
    faster than sugov-HWP.min (it was expected to be slower!) and 35% less
    efficient (we expected more performance-per-watt, not less).


2. SUMMARY BY USER PROFILE
~~~~~~~~~~~~~~~~~~~~~~~~~~

If I was a perfgov-HWP user, I would be 20%-90% faster than with other governors
on tbench an gitsource. This speed gap comes with an unexpected efficiency
bonus on both test. Since dbench and kernbench have a flat profile across the
board, there is no incentive to try another governor.

If I was a powersave-HWP user, I'd be the slower of the bunch. The lost
performance is not, in general, balanced by better efficiency. This only
happens on Coffee Lake, which is a CPU for the mobile market and possibly HWP
has efficiency-oriented tuning there. Any flavor of schedutil would be an
improvement.

From a developer perspective, the obstacles to move from HWP.min to
HWP.desired are tbench, where HWP.desired is worse than having no HWP support
at all, and gitsource, where HWP.desired has the opposite properties than
those advertised (it's actually faster but less efficient).


3. RESULTS TABLES
~~~~~~~~~~~~~~~~~

Tilde (~) means the result is the same as baseline (or, the ratio is close to 1).
The double asterisk (**) is a visual aid and means the result is better than
baseline (higher or lower depending on the case).


| 80x_CASCADELAKE_NUMA: Intel Cascade Lake, 40 cores / 80 threads, NUMA, SATA SSD storage
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|            sugov-HWP.des  sugov-HWP.min  powersave-HWP  perfgov-HWP  sugov-no-HWP   better if
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|                                         PERFORMANCE RATIOS
| tbench         1.00           1.89**         1.88**        1.89**        1.17**       higher
| dbench         1.00           ~              1.06          ~             ~            lower 
| kernbench      1.00           ~              1.14          ~             ~            lower 
| gitsource      1.00           1.11           2.70          0.80**        ~            lower 
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|                                    PERFORMANCE-PER-WATT RATIOS
| tbench         1.00           1.36**         1.38**        1.33**        1.04**       higher
| dbench         1.00           ~              ~             ~             ~            higher
| kernbench      1.00           ~              ~             ~             ~            higher
| gitsource      1.00           1.36**         0.63          1.22**        1.02**       higher


| 8x_COFFEELAKE_UMA: Intel Coffee Lake, 4 cores / 8 threads, UMA, NVMe SSD storage
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|            sugov-HWP.des  sugov-HWP.min  powersave-HWP  perfgov-HWP  sugov-no-HWP   better if
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|                                         PERFORMANCE RATIOS
| tbench         1.00           1.27**         1.30**        1.30**        1.31**       higher
| dbench         1.00           ~              1.15          ~             ~            lower 
| kernbench      1.00           ~              ~             ~             ~            lower 
| gitsource      1.00           ~              2.09          ~             ~            lower 
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|                                    PERFORMANCE-PER-WATT RATIOS
| tbench         1.00           ~              ~             ~             ~            higher
| dbench         1.00           ~              1.82**        ~             1.22**       higher
| kernbench      1.00           ~              ~             ~             ~            higher
| gitsource      1.00           ~              1.56**        ~             1.17**       higher


| 8x_SKYLAKE_UMA: Intel Skylake (client), 4 cores / 8 threads, UMA, SATA SSD storage
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|            sugov-HWP.des  sugov-HWP.min  powersave-HWP  perfgov-HWP  sugov-no-HWP   better if
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|                                         PERFORMANCE RATIOS
| tbench         1.00           1.21**         1.22**        1.20**        1.06**       higher
| dbench         1.00           ~              ~             ~             ~            lower 
| kernbench      1.00           ~              ~             ~             ~            lower 
| gitsource      1.00           ~              1.71          0.96**        ~            lower 
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|                                    PERFORMANCE-PER-WATT RATIOS
| tbench         1.00           1.11**         1.12**        1.10**        1.03**       higher
| dbench         1.00           ~              ~             ~             ~            higher
| kernbench      1.00           ~              ~             ~             ~            higher
| gitsource      1.00           ~              0.75          ~             ~            higher



Giovanni
Rafael J. Wysocki Dec. 21, 2020, 10:41 a.m. UTC | #3
On Thu, Dec 17, 2020 at 4:27 PM Doug Smythies <dsmythies@telus.net> wrote:
>

> On 2020.12.14 12:02 Rafael J. Wysocki wrote:

>

> > Hi,

>

> Hi Rafael,

>

> V2 test results below are new, other results are partially re-stated:

>

> For readers that do not want to read on, I didn't find anything different than with

> the other versions. This was more just due diligence.


Thanks a lot for the data, much appreciated as always!

> Legend:

>

> hwp: Kernel 5.10-rc6, HWP enabled; intel_cpufreq

> rfc (or rjw): Kernel 5.10-rc6 + this patch set, HWP enabled; intel_cpu-freq; schedutil

> no-hwp: Kernel 5.10-rc6, HWP disabled; intel_cpu-freq

> acpi (or acpi-cpufreq): Kernel 5.10-rc6, HWP disabled; acpi-cpufreq; schedutil

> patch: Kernel 5.10-rc7 + V1 patch set, HWP enabled; intel_cpu-freq; schedutil

> v2: Kernel 5.10-rc7 + V2 patch set, HWP enabled; intel_cpu-freq; schedutil

>

> Fixed work packet, fixed period, periodic workflow, load sweep up/down:

>

> load work/sleep frequency: 73 Hertz:

>

> hwp: Average: 12.00822 watts

> rjw: Average: 10.18089 watts

> no-hwp: Average: 10.21947 watts

> acpi-cpufreq: Average:  9.06585 watts

> patch: Average: 10.26060 watts

> v2: Average: 10.50444

>

> load work/sleep frequency: 113 Hertz:

>

> hwp: Average: 12.01056

> rjw: Average: 10.12303

> no-hwp: Average: 10.08228

> acpi-cpufreq: Average:  9.02215

> patch: Average: 10.27055

> v2: Average: 10.31097

>

> load work/sleep frequency: 211 Hertz:

>

> hwp: Average: 12.16067

> rjw: Average: 10.24413

> no-hwp: Average: 10.12463

> acpi-cpufreq: Average:  9.19175

> patch: Average: 10.33000

> v2: Average: 10.39811

>

> load work/sleep frequency: 347 Hertz:

>

> hwp: Average: 12.34169

> rjw: Average: 10.79980

> no-hwp: Average: 10.57296

> acpi-cpufreq: Average:  9.84709

> patch: Average: 10.67029

> v2: Average: 10.93143

>

> load work/sleep frequency: 401 Hertz:

>

> hwp: Average: 12.42562

> rjw: Average: 11.12465

> no-hwp: Average: 11.24203

> acpi-cpufreq: Average: 10.78670

> patch: Average: 10.94514

> v2: Average: 11.50324

>

>

> Serialized single threaded via PIDs per second method:

> A.K.A fixed work packet, variable period

> Results:

>

> Execution times (seconds. Less is better):

>

> no-hwp:

>

> performance: Samples: 382  ; Average: 10.54450  ; Stand Deviation:  0.01564 ; Maximum: 10.61000 ; Minimum: 10.50000

>

> schedutil: Samples: 293  ; Average: 13.73416  ; Stand Deviation:  0.73395 ; Maximum: 15.46000 ; Minimum: 11.68000

> acpi: Samples: 253  ; Average: 15.94889  ; Stand Deviation:  1.28219 ; Maximum: 18.66000 ; Minimum: 12.04000

>

> hwp:

>

> schedutil: Samples: 380  ; Average: 10.58287  ; Stand Deviation:  0.01864 ; Maximum: 10.64000 ; Minimum: 10.54000

> patch: Samples: 276  ; Average: 14.57029 ; Stand Deviation:  0.89771 ; Maximum: 16.04000 ; Minimum: 11.68000

> rfc: Samples: 271  ; Average: 14.86037  ; Stand Deviation:  0.84164 ; Maximum: 16.04000 ; Minimum: 12.21000

> v2: Samples: 274  ; Average: 14.67978  ; Stand Deviation:  1.03378 ; Maximum: 16.07000 ; Minimum: 11.43000

>

> Power (watts. More indicates higher CPU frequency and better performance. Sample time = 1 second.):

>

> no-hwp:

>

> performance: Samples: 4000  ; Average: 25.41355  ; Stand Deviation:  0.22156 ; Maximum: 26.01996 ; Minimum: 24.08807

>

> schedutil: Samples: 4000  ; Average: 12.58863  ; Stand Deviation:  5.48600 ; Maximum: 25.50934 ; Minimum:  7.54559

> acpi: Samples: 4000  ; Average:  9.57924  ; Stand Deviation:  5.41157 ; Maximum: 25.06366 ; Minimum:  5.51129

>

> hwp:

>

> schedutil: Samples: 4000  ; Average: 25.24245  ; Stand Deviation:  0.19539 ; Maximum: 25.93671 ; Minimum: 24.14746

> patch: Samples: 4000  ; Average: 11.07225  ; Stand Deviation:  5.63142 ; Maximum: 24.99493 ; Minimum:  3.67548

> rfc: Samples: 4000  ; Average: 10.35842  ; Stand Deviation:  4.77915 ; Maximum: 24.95953 ; Minimum:  7.26202

> v2: Samples: 4000  ; Average: 10.98284  ; Stand Deviation:  5.48859 ; Maximum: 25.76331 ; Minimum:  7.53790

>

>
Rafael J. Wysocki Dec. 21, 2020, 4:11 p.m. UTC | #4
Hi,

On Fri, Dec 18, 2020 at 5:22 PM Giovanni Gherdovich
<ggherdovich@suse.com> wrote:
>

> On Mon, 2020-12-14 at 21:01 +0100, Rafael J. Wysocki wrote:

> > Hi,

> >

> > The timing of this is not perfect (sorry about that), but here's a refresh

> > of this series.

> >

> > The majority of the previous cover letter still applies:

> > [...]

>

> Hello,

>

> the series is tested using

>

> -> tbench (packets processing with loopback networking, measures throughput)

> -> dbench (filesystem operations, measures average latency)

> -> kernbench (kernel compilation, elapsed time)

> -> and gitsource (long-running shell script, elapsed time)

>

> These are chosen because none of them is bound by compute and all are

> sensitive to freq scaling decisions. The machines are a Cascade Lake based

> server, a client Skylake and a Coffee Lake laptop.


First of all, many thanks for the results!

Any test results input is always much appreciated for all of the
changes under consideration.

> What's being compared:

>

> sugov-HWP.desired : the present series;  intel_pstate=passive,  governor=schedutil

> sugov-HWP.min     : mainline;            intel_pstate=passive,  governor=schedutil

> powersave-HWP     : mainline;            intel_pstate=active,   governor=powersave

> perfgov-HWP       : mainline;            intel_pstate=active,   governor=performance

> sugov-no-HWP      : HWP disabled;        intel_pstate=passive,  governor=schedutil

>

> Dbench and Kernbench have neutral results, but Tbench has sugov-HWP.desired

> lose in both performance and performance-per-watt, while Gitsource show the

> series as faster in raw performance but again worse than the competition in

> efficiency.


Well, AFAICS tbench "likes" high turbo and is sensitive to the
response time (as indicated by the fact that it is also sensitive to
the polling limit value in cpuidle).

Using the target perf to set HWP_REQ.DESIRED (instead of using it to
set HWP_REQ.MIN) generally causes the turbo to be less aggressive and
the response time to go up, so the tbench result is not a surprise at
all.  This case represents the tradeoff being made here (as noted by
Doug in one of his previous messages).

The gitsource result is a bit counter-intuitive, but my conclusions
drawn from it are quite different from yours (more on that below).

> 1. SUMMARY BY BENCHMARK

>    1.1. TBENCH

>    1.2. DBENCH

>    1.3. KERNBENCH

>    1.4. GITSOURCE

> 2. SUMMARY BY USER PROFILE

>    2.1. PERFORMANCE USER: what if I switch pergov -> schedutil?

>    2.2. DEFAULT USER: what if I switch powersave -> schedutil?

>    2.3. DEVELOPER: what if I switch sugov-HWP.min -> sugov-HWP.desired?

> 3. RESULTS TABLES

>    PERFORMANCE RATIOS

>    PERFORMANCE-PER-WATT RATIOS

>

>

> 1. SUMMARY BY BENCHMARK

> ~~~~~~~~~~~~~~~~~~~~~~~

>

> Tbench: sugov-HWP.desired is the worst performance on all three

>     machines. sugov-HWP.min is between 20% and 90% better. The baseline

>     sugov-HWP.desired offers a lower throughput, but does it increase

>     efficiency? It actually doesn't: on two out of three machines the

>     incumbent code (current sugov, or intel_pstate=active) has 10% to 35%

>     better efficiency. In other word, the status quo is both faster and more

>     efficient than the proposed series on this benchmark.

>     The absolute power consumption is lower, but the delivered performance is

>     "even more lower", and that's why performance-per-watt shows a net loss.


This benchmark is best off when run under the performance governor and
the observation that sugov-HWP.min is almost as good as the
performance governor for it is a consequence of a bias towards
performance in the former (which need not be regarded as a good
thing).

The drop in energy-efficiency is somewhat disappointing, but not
entirely unexpected too.

> Dbench: generally neutral, in both performance and efficiency. Powersave is

>     occasionally behind the pack in performance, 5% to 15%. A 15% performance

>     loss on the Coffe Lake is compensated by an 80% improved efficiency. To be

>     noted that on the same Coffee Lake sugov-no-HWP is 20% ahead of the pack

>     in efficiency.

>

> Kernbench: neutral, in both performance and efficiency. powersave looses 14%

>     to the pack in performance on the Cascade Lake.

>

> Gitsource: this test show the most compelling case against the

>     sugov-HWP.desired series: on the Cascade Lake sugov-HWP.desired is 10%

>     faster than sugov-HWP.min (it was expected to be slower!) and 35% less

>     efficient (we expected more performance-per-watt, not less).


This is a bit counter-intuitive, so it is good to try to understand
what's going on instead of drawing conclusions right away from pure
numbers.

My interpretation of the available data is that gitsource benefits
from the "race-to-idle" effect in terms of energy-efficiency which
also causes it to suffer in terms of performance.  Namely, completing
the given piece of work faster causes some CPU idle time to become
available and that effectively reduces power, but it also increases
the response time (by the idle state exit latency) which causes
performance to drop. Whether or not this effect can be present depends
on what CPU idle states are available etc. and it may be a pure
coincidence.

What sugov-HWP.desired really does is to bias the frequency towards
whatever is perceived by schedutil as sufficient to run the workload
(which is a key property of it - see below) and it appears to do the
job here quite well, but it eliminates the "race-to-idle" effect that
the workload benefited from originally and, like it or not, schedutil
cannot know about that effect.

That effect can only be present if the frequencies used for running
the workload are too high and by a considerable margin (sufficient for
a deep enough idle state to be entered).  In some cases running the
workload too fast helps (like in this one, although this time it
happens to hurt performance), but in some cases it really hurts
energy-efficiency and the question is whether or not this should be
always done.

There is a whole broad category of workloads involving periodic tasks
that do the same amount of work in every period regardless of the
frequency they run at (as long as the frequency is sufficient to avoid
"overrunning" the period) and they almost never benefit from
"race-to-idle".There is zero benefit from running them too fast and
the energy-efficiency goes down the sink when that happens.

Now the problem is that with sugov-HWP.min the users who care about
these workloads don't even have an option to use the task utilization
history recorded by the scheduler to bias the frequency towards the
"sufficient" level, because sugov-HWP.min only sets a lower bound on
the frequency selection to improve the situation, so the choice
between it and sugov-HWP.desired boils down to whether or not to give
that option to them and my clear preference is for that option to
exist.  Sorry about that.  [Note that it really is an option, though,
because "pure" HWP is still the default for HWP-enabled systems.]

It may be possible to restore some "race-to-idle" benefits by tweaking
HWP_REQ.EPP in the future, but that needs to be investigated.

BTW, what EPP value was there on the system where you saw better
performance under sugov-HWP.desired?  If it was greater than zero, it
would be useful to decrease EPP (by adjusting the
energy_performance_preference attributes in sysfs for all CPUs) and
see what happens to the performance difference then.


>

> 2. SUMMARY BY USER PROFILE

> ~~~~~~~~~~~~~~~~~~~~~~~~~~

>

> If I was a perfgov-HWP user, I would be 20%-90% faster than with other governors

> on tbench an gitsource. This speed gap comes with an unexpected efficiency

> bonus on both test. Since dbench and kernbench have a flat profile across the

> board, there is no incentive to try another governor.

>

> If I was a powersave-HWP user, I'd be the slower of the bunch. The lost

> performance is not, in general, balanced by better efficiency. This only

> happens on Coffee Lake, which is a CPU for the mobile market and possibly HWP

> has efficiency-oriented tuning there. Any flavor of schedutil would be an

> improvement.

>

> From a developer perspective, the obstacles to move from HWP.min to

> HWP.desired are tbench, where HWP.desired is worse than having no HWP support

> at all, and gitsource, where HWP.desired has the opposite properties than

> those advertised (it's actually faster but less efficient).

>

>

> 3. RESULTS TABLES

> ~~~~~~~~~~~~~~~~~

>

> Tilde (~) means the result is the same as baseline (or, the ratio is close to 1).

> The double asterisk (**) is a visual aid and means the result is better than

> baseline (higher or lower depending on the case).

>

>

> | 80x_CASCADELAKE_NUMA: Intel Cascade Lake, 40 cores / 80 threads, NUMA, SATA SSD storage

> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

> |            sugov-HWP.des  sugov-HWP.min  powersave-HWP  perfgov-HWP  sugov-no-HWP   better if

> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

> |                                         PERFORMANCE RATIOS

> | tbench         1.00           1.89**         1.88**        1.89**        1.17**       higher

> | dbench         1.00           ~              1.06          ~             ~            lower

> | kernbench      1.00           ~              1.14          ~             ~            lower

> | gitsource      1.00           1.11           2.70          0.80**        ~            lower

> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

> |                                    PERFORMANCE-PER-WATT RATIOS

> | tbench         1.00           1.36**         1.38**        1.33**        1.04**       higher

> | dbench         1.00           ~              ~             ~             ~            higher

> | kernbench      1.00           ~              ~             ~             ~            higher

> | gitsource      1.00           1.36**         0.63          1.22**        1.02**       higher

>

>

> | 8x_COFFEELAKE_UMA: Intel Coffee Lake, 4 cores / 8 threads, UMA, NVMe SSD storage

> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

> |            sugov-HWP.des  sugov-HWP.min  powersave-HWP  perfgov-HWP  sugov-no-HWP   better if

> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

> |                                         PERFORMANCE RATIOS

> | tbench         1.00           1.27**         1.30**        1.30**        1.31**       higher

> | dbench         1.00           ~              1.15          ~             ~            lower

> | kernbench      1.00           ~              ~             ~             ~            lower

> | gitsource      1.00           ~              2.09          ~             ~            lower

> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

> |                                    PERFORMANCE-PER-WATT RATIOS

> | tbench         1.00           ~              ~             ~             ~            higher

> | dbench         1.00           ~              1.82**        ~             1.22**       higher

> | kernbench      1.00           ~              ~             ~             ~            higher

> | gitsource      1.00           ~              1.56**        ~             1.17**       higher

>

>

> | 8x_SKYLAKE_UMA: Intel Skylake (client), 4 cores / 8 threads, UMA, SATA SSD storage

> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

> |            sugov-HWP.des  sugov-HWP.min  powersave-HWP  perfgov-HWP  sugov-no-HWP   better if

> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

> |                                         PERFORMANCE RATIOS

> | tbench         1.00           1.21**         1.22**        1.20**        1.06**       higher

> | dbench         1.00           ~              ~             ~             ~            lower

> | kernbench      1.00           ~              ~             ~             ~            lower

> | gitsource      1.00           ~              1.71          0.96**        ~            lower

> + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

> |                                    PERFORMANCE-PER-WATT RATIOS

> | tbench         1.00           1.11**         1.12**        1.10**        1.03**       higher

> | dbench         1.00           ~              ~             ~             ~            higher

> | kernbench      1.00           ~              ~             ~             ~            higher

> | gitsource      1.00           ~              0.75          ~             ~            higher
Giovanni Gherdovich Dec. 23, 2020, 1:06 p.m. UTC | #5
On Mon, 2020-12-21 at 17:11 +0100, Rafael J. Wysocki wrote:
> Hi,

> 

> On Fri, Dec 18, 2020 at 5:22 PM Giovanni Gherdovich wrote:

> > 

> > Gitsource: this test show the most compelling case against the

> >     sugov-HWP.desired series: on the Cascade Lake sugov-HWP.desired is 10%

> >     faster than sugov-HWP.min (it was expected to be slower!) and 35% less

> >     efficient (we expected more performance-per-watt, not less).

> 

> This is a bit counter-intuitive, so it is good to try to understand

> what's going on instead of drawing conclusions right away from pure

> numbers.

> 

> My interpretation of the available data is that gitsource benefits

> from the "race-to-idle" effect in terms of energy-efficiency which

> also causes it to suffer in terms of performance.  Namely, completing

> the given piece of work faster causes some CPU idle time to become

> available and that effectively reduces power, but it also increases

> the response time (by the idle state exit latency) which causes

> performance to drop. Whether or not this effect can be present depends

> on what CPU idle states are available etc. and it may be a pure

> coincidence.

>

> [snip]


Right, race-to-idle might explain the increased efficiency of HWP.MIN.
As you note, increased exit latencies from idle can also explain the overall
performance difference.

> There is a whole broad category of workloads involving periodic tasks

> that do the same amount of work in every period regardless of the

> frequency they run at (as long as the frequency is sufficient to avoid

> "overrunning" the period) and they almost never benefit from

> "race-to-idle".There is zero benefit from running them too fast and

> the energy-efficiency goes down the sink when that happens.

> 

> Now the problem is that with sugov-HWP.min the users who care about

> these workloads don't even have an option to use the task utilization

> history recorded by the scheduler to bias the frequency towards the

> "sufficient" level, because sugov-HWP.min only sets a lower bound on

> the frequency selection to improve the situation, so the choice

> between it and sugov-HWP.desired boils down to whether or not to give

> that option to them and my clear preference is for that option to

> exist.  Sorry about that.  [Note that it really is an option, though,

> because "pure" HWP is still the default for HWP-enabled systems.]


Sure, the periodic workloads benefit from this patch, Doug's test shows that.

I guess I'm still confused by the difference between setting HWP.DESIRED and
disabling HWP completely. The Intel manual says that a non-zero HWP.DESIRED
"effectively disabl[es] HW autonomous selection", but then continues with "The
Desired_Performance input is non-constraining in terms of Performance and
Energy optimizations, which are independently controlled". The first
statement sounds like HWP is out of the picture (no more autonomous
frequency selections) but the latter part implies there are other
optimizations still available. I'm not sure how to reason about that.

> It may be possible to restore some "race-to-idle" benefits by tweaking

> HWP_REQ.EPP in the future, but that needs to be investigated.

> 

> BTW, what EPP value was there on the system where you saw better

> performance under sugov-HWP.desired?  If it was greater than zero, it

> would be useful to decrease EPP (by adjusting the

> energy_performance_preference attributes in sysfs for all CPUs) and

> see what happens to the performance difference then.


For sugov-HWP.desired the EPP was 0x80 (the default value).


Giovanni
Rafael J. Wysocki Dec. 28, 2020, 7:11 p.m. UTC | #6
On Wed, Dec 23, 2020 at 2:08 PM Giovanni Gherdovich
<ggherdovich@suse.com> wrote:
>

> On Mon, 2020-12-21 at 17:11 +0100, Rafael J. Wysocki wrote:

> > Hi,

> >

> > On Fri, Dec 18, 2020 at 5:22 PM Giovanni Gherdovich wrote:

> > >

> > > Gitsource: this test show the most compelling case against the

> > >     sugov-HWP.desired series: on the Cascade Lake sugov-HWP.desired is 10%

> > >     faster than sugov-HWP.min (it was expected to be slower!) and 35% less

> > >     efficient (we expected more performance-per-watt, not less).

> >

> > This is a bit counter-intuitive, so it is good to try to understand

> > what's going on instead of drawing conclusions right away from pure

> > numbers.

> >

> > My interpretation of the available data is that gitsource benefits

> > from the "race-to-idle" effect in terms of energy-efficiency which

> > also causes it to suffer in terms of performance.  Namely, completing

> > the given piece of work faster causes some CPU idle time to become

> > available and that effectively reduces power, but it also increases

> > the response time (by the idle state exit latency) which causes

> > performance to drop. Whether or not this effect can be present depends

> > on what CPU idle states are available etc. and it may be a pure

> > coincidence.

> >

> > [snip]

>

> Right, race-to-idle might explain the increased efficiency of HWP.MIN.

> As you note, increased exit latencies from idle can also explain the overall

> performance difference.

>

> > There is a whole broad category of workloads involving periodic tasks

> > that do the same amount of work in every period regardless of the

> > frequency they run at (as long as the frequency is sufficient to avoid

> > "overrunning" the period) and they almost never benefit from

> > "race-to-idle".There is zero benefit from running them too fast and

> > the energy-efficiency goes down the sink when that happens.

> >

> > Now the problem is that with sugov-HWP.min the users who care about

> > these workloads don't even have an option to use the task utilization

> > history recorded by the scheduler to bias the frequency towards the

> > "sufficient" level, because sugov-HWP.min only sets a lower bound on

> > the frequency selection to improve the situation, so the choice

> > between it and sugov-HWP.desired boils down to whether or not to give

> > that option to them and my clear preference is for that option to

> > exist.  Sorry about that.  [Note that it really is an option, though,

> > because "pure" HWP is still the default for HWP-enabled systems.]

>

> Sure, the periodic workloads benefit from this patch, Doug's test shows that.

>

> I guess I'm still confused by the difference between setting HWP.DESIRED and

> disabling HWP completely. The Intel manual says that a non-zero HWP.DESIRED

> "effectively disabl[es] HW autonomous selection", but then continues with "The

> Desired_Performance input is non-constraining in terms of Performance and

> Energy optimizations, which are independently controlled". The first

> statement sounds like HWP is out of the picture (no more autonomous

> frequency selections) but the latter part implies there are other

> optimizations still available. I'm not sure how to reason about that.


For example, if HWP_REQ.DESIRED is set below the point of maximum
energy-efficiency that is known to the processor, it is allowed to go
for the max energy-efficiency instead of following the hint.
Likewise, if the hint is above the P-state corresponding to the max
performance in the given conditions (i.e. increasing the frequency is
not likely to result in better performance due to some limitations
known to the processor), the processor is allowed to set that P-state
instead of following the hint.

Generally speaking, the processor may not follow the hint if better
results can be achieved by putting the given CPU into a P-state
different from the requested one.

> > It may be possible to restore some "race-to-idle" benefits by tweaking

> > HWP_REQ.EPP in the future, but that needs to be investigated.

> >

> > BTW, what EPP value was there on the system where you saw better

> > performance under sugov-HWP.desired?  If it was greater than zero, it

> > would be useful to decrease EPP (by adjusting the

> > energy_performance_preference attributes in sysfs for all CPUs) and

> > see what happens to the performance difference then.

>

> For sugov-HWP.desired the EPP was 0x80 (the default value).


So it would be worth testing with EPP=0x20 (or even lower).

Lowering the EPP should cause the processor to ramp up turbo
frequencies faster and it may also allow higher turbo frequencies to
be used.