[v2,0/4] Utilization estimation (util_est) for FAIR tasks

Message ID	20171205171018.9203-1-patrick.bellasi@arm.com
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-pm-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Patrick Bellasi <patrick.bellasi@arm.com> To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>, Viresh Kumar <viresh.kumar@linaro.org>, Vincent Guittot <vincent.guittot@linaro.org>, Paul Turner <pjt@google.com>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Morten Rasmussen <morten.rasmussen@arm.com>, Juri Lelli <juri.lelli@redhat.com>, Todd Kjos <tkjos@android.com>, Joel Fernandes <joelaf@google.com> Subject: [PATCH v2 0/4] Utilization estimation (util_est) for FAIR tasks Date: Tue, 5 Dec 2017 17:10:14 +0000 Message-Id: <20171205171018.9203-1-patrick.bellasi@arm.com> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk
Series	Utilization estimation (util_est) for FAIR tasks \| expand [v2,0/4] Utilization estimation (util_est) for FAIR tasks [v2,1/4] sched/fair: always used unsigned long for utilization [v2,2/4] sched/fair: add util_est on top of PELT [v2,3/4] sched/fair: use util_est in LB and WU paths

Patrick Bellasi Dec. 5, 2017, 5:10 p.m. UTC

This is a respin of:
   https://lkml.org/lkml/2017/11/9/546
which has been rebased on v4.15-rc2 to have util_est now working on top
of the recent PeterZ's:
   [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul

The aim of this series is to improve some PELT behaviors to make it a
better fit for the scheduling of tasks common in embedded mobile
use-cases, without affecting other classes of workloads.

A complete description of these behaviors has been presented in the
previous RFC [1] and further discussed during the last OSPM Summit [2]
as well as during the last two LPCs.

This series presents an implementation which improves the initial RFC's
prototype. Specifically, this new implementation has been verified to
not impact in any noticeable way the performance of:

    perf bench sched messaging --pipe --thread --group 8 --loop 50000

when running 30 iterations on a dual socket, 10 cores (20 threads) per
socket Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, whit the
sched_feat(SCHED_UTILEST) set to False.
With this feature enabled, the measured overhead is in the range of ~1%
on the same HW/SW test configuration.

That's the main reason why this sched feature is disabled by default.
A possible improvement can be the addition of a KConfig option to toggle
the sched_feat default value on systems where a 1% overhead on hackbench
is not a concern, e.g. mobile systems, especially considering the
benefits coming from estimated utilization on workloads of interest.

From a functional standpoint, this implementation shows a more stable
utilization signal, compared to mainline, when running synthetics
benchmarks describing a set of interesting target use-cases.
This allows for a better selection of the target CPU as well as a
faster selection of the most appropriate OPP.
A detailed description of the used functional tests has been already
covered in the previous RFC [1].

This series is based on v4.15-rc2 and is composed of four patches:
 1) a small refactoring preparing the ground
 2) introducing the required data structures to track util_est of both
    TASKs and CPUs
 3) make use of util_est in the wakeup and load balance paths
 4) make use of util_est in schedutil for frequency selection

Cheers Patrick

.:: References
==============
[1] https://lkml.org/lkml/2017/8/25/195
[2] slides: http://retis.sssup.it/ospm-summit/Downloads/OSPM_PELT_DecayClampingVsUtilEst.pdf
     video: http://youtu.be/adnSHPBGS-w

Changes v1->v2:
 - rebase on top of v4.15-rc2
 - tested that overhauled PELT code does not affect the util_est

Patrick Bellasi (4):
  sched/fair: always used unsigned long for utilization
  sched/fair: add util_est on top of PELT
  sched/fair: use util_est in LB and WU paths
  sched/cpufreq_schedutil: use util_est for OPP selection

 include/linux/sched.h            |  21 +++++
 kernel/sched/cpufreq_schedutil.c |   6 +-
 kernel/sched/debug.c             |   4 +
 kernel/sched/fair.c              | 184 ++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h          |   5 ++
 kernel/sched/sched.h             |   1 +
 6 files changed, 209 insertions(+), 12 deletions(-)

-- 
2.14.1

Peter Zijlstra Dec. 13, 2017, 4:03 p.m. UTC | #1

On Tue, Dec 05, 2017 at 05:10:14PM +0000, Patrick Bellasi wrote:
> With this feature enabled, the measured overhead is in the range of ~1%

> on the same HW/SW test configuration.


That's quite a lot; did you look where that comes from?

Mike Galbraith Dec. 13, 2017, 5:56 p.m. UTC | #2

On Tue, 2017-12-05 at 17:10 +0000, Patrick Bellasi wrote:
> This is a respin of:

>    https://lkml.org/lkml/2017/11/9/546

> which has been rebased on v4.15-rc2 to have util_est now working on top

> of the recent PeterZ's:

>    [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul

> 

> The aim of this series is to improve some PELT behaviors to make it a

> better fit for the scheduling of tasks common in embedded mobile

> use-cases, without affecting other classes of workloads.


I thought perhaps this patch set would improve the below behavior, but
alas it does not.  That's 3 instances of firefox playing youtube clips
being shoved into a corner by hogs sitting on 7 of 8 runqueues.  PELT
serializes the threaded desktop, making that threading kinda pointless,
and CFS not all that fair.

 6569 root      20   0    4048    704    628 R 100.0 0.004   5:10.48 7 cpuhog                                                                                                                                                                
 6573 root      20   0    4048    712    636 R 100.0 0.004   5:07.47 5 cpuhog                                                                                                                                                                
 6581 root      20   0    4048    696    620 R 100.0 0.004   5:07.36 1 cpuhog                                                                                                                                                                
 6585 root      20   0    4048    812    736 R 100.0 0.005   5:08.14 4 cpuhog                                                                                                                                                                
 6589 root      20   0    4048    712    636 R 100.0 0.004   5:06.42 6 cpuhog                                                                                                                                                                
 6577 root      20   0    4048    720    644 R 99.80 0.005   5:06.52 3 cpuhog                                                                                                                                                                
 6593 root      20   0    4048    728    652 R 99.60 0.005   5:04.25 0 cpuhog                                                                                                                                                                
 6755 mikeg     20   0 2714788 885324 179196 S 19.96 5.544   2:14.36 2 Web Content                                                                                                                                                           
 6620 mikeg     20   0 2318348 312336 145044 S 8.383 1.956   0:51.51 2 firefox                                                                                                                                                               
 3190 root      20   0  323944  71704  42368 S 3.194 0.449   0:11.90 2 Xorg                                                                                                                                                                  
 3718 root      20   0 3009580  67112  49256 S 0.599 0.420   0:02.89 2 kwin_x11                                                                                                                                                              
 3761 root      20   0  769760  90740  62048 S 0.399 0.568   0:03.46 2 konsole                                                                                                                                                               
 3845 root       9 -11  791224  20132  14236 S 0.399 0.126   0:03.00 2 pulseaudio                                                                                                                                                            
 3722 root      20   0 3722308 172568  88088 S 0.200 1.081   0:04.35 2 plasmashel

 ------------------------------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Sum     delay ms | Maximum delay at      |
 ------------------------------------------------------------------------------------------------------------------------------------
  Web Content:6755      |   2864.862 ms |     7314 | avg:    0.299 ms | max:   40.374 ms | sum: 2189.472 ms | max at:    375.769240 |
  Compositor:6680       |   1889.847 ms |     4672 | avg:    0.531 ms | max:   29.092 ms | sum: 2478.559 ms | max at:    375.759405 |
  MediaPl~back #3:(13)  |   3269.777 ms |     7853 | avg:    0.218 ms | max:   19.451 ms | sum: 1711.635 ms | max at:    391.123970 |
  MediaPl~back #4:(10)  |   1472.986 ms |     8189 | avg:    0.236 ms | max:   18.653 ms | sum: 1933.886 ms | max at:    376.124211 |
  MediaPl~back #1:(9)   |    601.788 ms |     6598 | avg:    0.247 ms | max:   17.823 ms | sum: 1627.852 ms | max at:    401.122567 |
  firefox:6620          |    303.181 ms |     6232 | avg:    0.111 ms | max:   15.602 ms | sum:  691.865 ms | max at:    385.078558 |
  Socket Thread:6639    |    667.537 ms |     4806 | avg:    0.069 ms | max:   12.638 ms | sum:  329.387 ms | max at:    380.827323 |
  MediaPD~oder #1:6835  |    154.737 ms |     1592 | avg:    0.700 ms | max:   10.139 ms | sum: 1113.688 ms | max at:    392.575370 |
  MediaTimer #1:6828    |     42.660 ms |     5250 | avg:    0.575 ms | max:    9.845 ms | sum: 3018.994 ms | max at:    380.823677 |
  MediaPD~oder #2:6840  |    150.822 ms |     1583 | avg:    0.703 ms | max:    9.639 ms | sum: 1112.962 ms | max at:    380.823741 |
...

Patrick Bellasi Dec. 15, 2017, 4:13 p.m. UTC | #3

Hi Mike,

On 13-Dec 18:56, Mike Galbraith wrote:
> On Tue, 2017-12-05 at 17:10 +0000, Patrick Bellasi wrote:

> > This is a respin of:

> >    https://lkml.org/lkml/2017/11/9/546

> > which has been rebased on v4.15-rc2 to have util_est now working on top

> > of the recent PeterZ's:

> >    [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul

> > 

> > The aim of this series is to improve some PELT behaviors to make it a

> > better fit for the scheduling of tasks common in embedded mobile

> > use-cases, without affecting other classes of workloads.

> 

> I thought perhaps this patch set would improve the below behavior, but

> alas it does not.  That's 3 instances of firefox playing youtube clips

> being shoved into a corner by hogs sitting on 7 of 8 runqueues.  PELT

> serializes the threaded desktop, making that threading kinda pointless,

> and CFS not all that fair.

Perhaps I don't completely get your use-case.
Are the cpuhog thread pinned to a CPU or just happens to be always
running on the same CPU?

I guess you would expect the three Firefox instances to be spread on
different CPUs. But whether this is possible depends also on the
specific tasks composition generated by Firefox, isn't it?

Being a video playback pipeline I would not be surprised to see that
most of the time we actually have only 1 or 2 tasks RUNNABLE, while
the others are sleeping... and if an HW decoder is involved, even if
you have three instances running you likely get only one pipeline
active at each time...

If that's the case, why should CFS move Fairfox tasks around?

>  6569 root      20   0    4048    704    628 R 100.0 0.004   5:10.48 7 cpuhog

>  6573 root      20   0    4048    712    636 R 100.0 0.004   5:07.47 5 cpuhog

>  6581 root      20   0    4048    696    620 R 100.0 0.004   5:07.36 1 cpuhog

>  6585 root      20   0    4048    812    736 R 100.0 0.005   5:08.14 4 cpuhog

>  6589 root      20   0    4048    712    636 R 100.0 0.004   5:06.42 6 cpuhog

>  6577 root      20   0    4048    720    644 R 99.80 0.005   5:06.52 3 cpuhog

>  6593 root      20   0    4048    728    652 R 99.60 0.005   5:04.25 0 cpuhog

>  6755 mikeg     20   0 2714788 885324 179196 S 19.96 5.544   2:14.36 2 Web Content

>  6620 mikeg     20   0 2318348 312336 145044 S 8.383 1.956   0:51.51 2 firefox

>  3190 root      20   0  323944  71704  42368 S 3.194 0.449   0:11.90 2 Xorg

>  3718 root      20   0 3009580  67112  49256 S 0.599 0.420   0:02.89 2 kwin_x11

>  3761 root      20   0  769760  90740  62048 S 0.399 0.568   0:03.46 2 konsole

>  3845 root       9 -11  791224  20132  14236 S 0.399 0.126   0:03.00 2 pulseaudio

>  3722 root      20   0 3722308 172568  88088 S 0.200 1.081   0:04.35 2 plasmashel

Is this always happening... or sometimes Firefox tasks gets a chance
to run on CPUs other then CPU2?

Could be that looking at an htop output we don't see these small opportunities?

>  ------------------------------------------------------------------------------------------------------------------------------------

>   Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Sum     delay ms | Maximum delay at      |

>  ------------------------------------------------------------------------------------------------------------------------------------

>   Web Content:6755      |   2864.862 ms |     7314 | avg:    0.299 ms | max:   40.374 ms | sum: 2189.472 ms | max at:    375.769240 |

>   Compositor:6680       |   1889.847 ms |     4672 | avg:    0.531 ms | max:   29.092 ms | sum: 2478.559 ms | max at:    375.759405 |

>   MediaPl~back #3:(13)  |   3269.777 ms |     7853 | avg:    0.218 ms | max:   19.451 ms | sum: 1711.635 ms | max at:    391.123970 |

>   MediaPl~back #4:(10)  |   1472.986 ms |     8189 | avg:    0.236 ms | max:   18.653 ms | sum: 1933.886 ms | max at:    376.124211 |

>   MediaPl~back #1:(9)   |    601.788 ms |     6598 | avg:    0.247 ms | max:   17.823 ms | sum: 1627.852 ms | max at:    401.122567 |

>   firefox:6620          |    303.181 ms |     6232 | avg:    0.111 ms | max:   15.602 ms | sum:  691.865 ms | max at:    385.078558 |

>   Socket Thread:6639    |    667.537 ms |     4806 | avg:    0.069 ms | max:   12.638 ms | sum:  329.387 ms | max at:    380.827323 |

>   MediaPD~oder #1:6835  |    154.737 ms |     1592 | avg:    0.700 ms | max:   10.139 ms | sum: 1113.688 ms | max at:    392.575370 |

>   MediaTimer #1:6828    |     42.660 ms |     5250 | avg:    0.575 ms | max:    9.845 ms | sum: 3018.994 ms | max at:    380.823677 |

>   MediaPD~oder #2:6840  |    150.822 ms |     1583 | avg:    0.703 ms | max:    9.639 ms | sum: 1112.962 ms | max at:    380.823741 |

How do you get these stats?

It's definitively an interesting use-case however I think it's out of
the scope of util_est.

Regarding the specific statement "CFS not all that fair", I would say
that the fairness of CFS is defined and has to be evaluated within a
single CPU and on a temporal (not clock cycles) base.

AFAIK, vruntime is progressed based on elapsed time, thus you can have
two tasks which gets the same slice time but consume it at different
frequencies. In this case also we are not that fair, isn't it?

Thus, at the end it all boils down to some (as much as possible)
low-overhead heuristics. Thus, a proper description of a
reproducible use-case can help on improving them.

Can we model your use-case using a simple rt-app configuration?

This would likely help to have a simple and reproducible testing
scenario to better understand where the issue eventually is...
maybe by looking at an execution trace.

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

Mike Galbraith Dec. 15, 2017, 8:23 p.m. UTC | #4

On Fri, 2017-12-15 at 16:13 +0000, Patrick Bellasi wrote:
> Hi Mike,

> 

> On 13-Dec 18:56, Mike Galbraith wrote:

> > On Tue, 2017-12-05 at 17:10 +0000, Patrick Bellasi wrote:

> > > This is a respin of:

> > >    https://lkml.org/lkml/2017/11/9/546

> > > which has been rebased on v4.15-rc2 to have util_est now working on top

> > > of the recent PeterZ's:

> > >    [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul

> > > 

> > > The aim of this series is to improve some PELT behaviors to make it a

> > > better fit for the scheduling of tasks common in embedded mobile

> > > use-cases, without affecting other classes of workloads.

> > 

> > I thought perhaps this patch set would improve the below behavior, but

> > alas it does not.  That's 3 instances of firefox playing youtube clips

> > being shoved into a corner by hogs sitting on 7 of 8 runqueues.  PELT

> > serializes the threaded desktop, making that threading kinda pointless,

> > and CFS not all that fair.

> 

> Perhaps I don't completely get your use-case.

> Are the cpuhog thread pinned to a CPU or just happens to be always

> running on the same CPU?


Nothing is pinned.

> I guess you would expect the three Firefox instances to be spread on

> different CPUs. But whether this is possible depends also on the

> specific tasks composition generated by Firefox, isn't it?


It depends on load balancing.  We're letting firefox threads stack up
to 6 deep while single hogs dominate the box.

> Being a video playback pipeline I would not be surprised to see that

> most of the time we actually have only 1 or 2 tasks RUNNABLE, while

> the others are sleeping... and if an HW decoder is involved, even if

> you have three instances running you likely get only one pipeline

> active at each time...

> 

> If that's the case, why should CFS move Fairfox tasks around?


No, while they are indeed ~fairly synchronous, there is overlap.  If
there were not, there would be no wait time being accumulated. The load
wants to consume roughly one full core worth, but to achieve that, it
needs access to more than one runqueue, which we are not facilitating.

> Is this always happening... or sometimes Firefox tasks gets a chance

> to run on CPUs other then CPU2?


There is some escape going on, but not enough for the load to get its
fair share.  I have it sort of fixed up locally, but while patch keeps
changing, it's not getting any prettier, nor is it particularly
interested in letting me keep some performance gains I want, so...

> How do you get these stats?


perf sched record/perf sched lat.  I twiddled it to output accumulated
wait times as well for convenience, stock only shows max.  See below.
 If you play with perf sched, you'll notice some.. oddities about it.

> It's definitively an interesting use-case however I think it's out of

> the scope of util_est.


Yeah.  If I had been less busy and read the whole thing, I wouldn't
have taken it out for a spin.

> Regarding the specific statement "CFS not all that fair", I would say

> that the fairness of CFS is defined and has to be evaluated within a

> single CPU and on a temporal (not clock cycles) base.


No, that doesn't really fly.  In fact, in the group scheduling code, we
actively pursue box wide fairness.  PELT is going a bit too far ATM.

Point: if you think it's OK to serialize these firefox threads, would
you still think so if those were kernel threads instead?  Serializing
your kernel is a clear fail, but unpinned kthreads can be stacked up
just as effectively as those browser threads are, eat needless wakeup
latency and pass it on.

> AFAIK, vruntime is progressed based on elapsed time, thus you can have

> two tasks which gets the same slice time but consume it at different

> frequencies. In this case also we are not that fair, isn't it?


Time slices don't really exist as a concrete quantum in CFS.  There's
vruntime equalization, and that's it.

> Thus, at the end it all boils down to some (as much as possible)

> low-overhead heuristics. Thus, a proper description of a

> reproducible use-case can help on improving them.


Nah, heuristics are fickle beasts, they WILL knife you in the back,
it's just a question of how often, and how deep.

> Can we model your use-case using a simple rt-app configuration?


No idea.

> This would likely help to have a simple and reproducible testing

> scenario to better understand where the issue eventually is...

> maybe by looking at an execution trace.


It should be reproducible by anyone, just fire up NR_CPUS-1 pure hogs,
point firefox at youtube, open three clips in tabs, watch tasks stack.

Root cause IMHO is PELT having grown too aggressive.  SIS was made more
aggressive to compensate, but when you slam that door you get the full
PELT impact, and it stings, as does too aggressive bouncing when you
leave the escape hatch open.  Sticky wicket that.  Both of those want a
gentle wrap upside the head, as they're both acting a bit nutty.

	-Mike

---
 tools/perf/builtin-sched.c |   34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -212,6 +212,7 @@ struct perf_sched {
 	u64		 run_avg;
 	u64		 all_runtime;
 	u64		 all_count;
+	u64		 all_lat;
 	u64		 cpu_last_switched[MAX_CPUS];
 	struct rb_root	 atom_root, sorted_atom_root, merged_atom_root;
 	struct list_head sort_list, cmp_pid;
@@ -1286,6 +1287,7 @@ static void output_lat_thread(struct per
 
 	sched->all_runtime += work_list->total_runtime;
 	sched->all_count   += work_list->nb_atoms;
+	sched->all_lat += work_list->total_lat;
 
 	if (work_list->num_merged > 1)
 		ret = printf("  %s:(%d) ", thread__comm_str(work_list->thread), work_list->num_merged);
@@ -1298,10 +1300,11 @@ static void output_lat_thread(struct per
 	avg = work_list->total_lat / work_list->nb_atoms;
 	timestamp__scnprintf_usec(work_list->max_lat_at, max_lat_at, sizeof(max_lat_at));
 
-	printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: %13s s\n",
+	printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | sum:%9.3f ms | max at: %13s s\n",
 	      (double)work_list->total_runtime / NSEC_PER_MSEC,
 		 work_list->nb_atoms, (double)avg / NSEC_PER_MSEC,
 		 (double)work_list->max_lat / NSEC_PER_MSEC,
+		 (double)work_list->total_lat / NSEC_PER_MSEC,
 		 max_lat_at);
 }
 
@@ -1347,6 +1350,16 @@ static int max_cmp(struct work_atoms *l,
 	return 0;
 }
 
+static int sum_cmp(struct work_atoms *l, struct work_atoms *r)
+{
+	if (l->total_lat < r->total_lat)
+		return -1;
+	if (l->total_lat > r->total_lat)
+		return 1;
+
+	return 0;
+}
+
 static int switch_cmp(struct work_atoms *l, struct work_atoms *r)
 {
 	if (l->nb_atoms < r->nb_atoms)
@@ -1378,6 +1391,10 @@ static int sort_dimension__add(const cha
 		.name = "max",
 		.cmp  = max_cmp,
 	};
+	static struct sort_dimension sum_sort_dimension = {
+		.name = "sum",
+		.cmp  = sum_cmp,
+	};
 	static struct sort_dimension pid_sort_dimension = {
 		.name = "pid",
 		.cmp  = pid_cmp,
@@ -1394,6 +1411,7 @@ static int sort_dimension__add(const cha
 		&pid_sort_dimension,
 		&avg_sort_dimension,
 		&max_sort_dimension,
+		&sum_sort_dimension,
 		&switch_sort_dimension,
 		&runtime_sort_dimension,
 	};
@@ -3090,9 +3108,9 @@ static int perf_sched__lat(struct perf_s
 	perf_sched__merge_lat(sched);
 	perf_sched__sort_lat(sched);
 
-	printf("\n -----------------------------------------------------------------------------------------------------------------\n");
-	printf("  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |\n");
-	printf(" -----------------------------------------------------------------------------------------------------------------\n");
+	printf("\n ------------------------------------------------------------------------------------------------------------------------------------\n");
+	printf("  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Sum     delay ms | Maximum delay at       |\n");
+	printf(" ------------------------------------------------------------------------------------------------------------------------------------\n");
 
 	next = rb_first(&sched->sorted_atom_root);
 
@@ -3105,11 +3123,11 @@ static int perf_sched__lat(struct perf_s
 		thread__zput(work_list->thread);
 	}
 
-	printf(" -----------------------------------------------------------------------------------------------------------------\n");
-	printf("  TOTAL:                |%11.3f ms |%9" PRIu64 " |\n",
-		(double)sched->all_runtime / NSEC_PER_MSEC, sched->all_count);
+	printf(" ------------------------------------------------------------------------------------------------------------\n");
+	printf("  TOTAL:                |%11.3f ms |%9" PRIu64 " |                                     |%14.3f ms |\n",
+		(double)sched->all_runtime / NSEC_PER_MSEC, sched->all_count, (double)sched->all_lat / NSEC_PER_MSEC);
 
-	printf(" ---------------------------------------------------\n");
+	printf(" ------------------------------------------------------------------------------------------------------------\n");
 
 	print_bad_events(sched);
 	printf("\n");

Rafael J. Wysocki Dec. 16, 2017, 2:35 a.m. UTC | #5

On Tuesday, December 5, 2017 6:10:18 PM CET Patrick Bellasi wrote:
> When schedutil looks at the CPU utilization, the current PELT value for

> that CPU is returned straight away. In certain scenarios this can have

> undesired side effects and delays on frequency selection.

> 

> For example, since the task utilization is decayed at wakeup time, a

> long sleeping big task newly enqueued does not add immediately a

> significant contribution to the target CPU. This introduces some latency

> before schedutil will be able to detect the best frequency required by

> that task.

> 

> Moreover, the PELT signal build-up time is function of the current

> frequency, because of the scale invariant load tracking support. Thus,

> starting from a lower frequency, the utilization build-up time will

> increase even more and further delays the selection of the actual

> frequency which better serves the task requirements.

> 

> In order to reduce these kind of latencies, this patch integrates the

> usage of the CPU's estimated utilization in the sugov_get_util function.

> 

> The estimated utilization of a CPU is defined to be the maximum between

> its PELT's utilization and the sum of the estimated utilization of each

> currently RUNNABLE task on that CPU.

> This allows to properly represent the expected utilization of a CPU which,

> for example, has just got a big task running after a long sleep period,

> and ultimately it allows to select the best frequency to run a task

> right after it wakes up.

> 

> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>

> Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>

> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

> Cc: Ingo Molnar <mingo@redhat.com>

> Cc: Peter Zijlstra <peterz@infradead.org>

> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> Cc: Viresh Kumar <viresh.kumar@linaro.org>

> Cc: Paul Turner <pjt@google.com>

> Cc: Vincent Guittot <vincent.guittot@linaro.org>

> Cc: Morten Rasmussen <morten.rasmussen@arm.com>

> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>

> Cc: linux-kernel@vger.kernel.org

> Cc: linux-pm@vger.kernel.org

> 

> ---

> Changes v1->v2:

>  - rebase on top of v4.15-rc2

>  - tested that overhauled PELT code does not affect the util_est

> ---

>  kernel/sched/cpufreq_schedutil.c | 6 +++++-

>  1 file changed, 5 insertions(+), 1 deletion(-)

> 

> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> index 2f52ec0f1539..465430d99440 100644

> --- a/kernel/sched/cpufreq_schedutil.c

> +++ b/kernel/sched/cpufreq_schedutil.c

> @@ -183,7 +183,11 @@ static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)

>  

>  	cfs_max = arch_scale_cpu_capacity(NULL, cpu);

>  

> -	*util = min(rq->cfs.avg.util_avg, cfs_max);

> +	*util = rq->cfs.avg.util_avg;


I would use a local variable here.

That *util everywhere looks a bit dirtyish.

> +	if (sched_feat(UTIL_EST))

> +		*util = max(*util, rq->cfs.util_est_runnable);

> +	*util = min(*util, cfs_max);

> +

>  	*max = cfs_max;

>  }

>  

>

Mike Galbraith Dec. 16, 2017, 6:37 a.m. UTC | #6

On Fri, 2017-12-15 at 21:23 +0100, Mike Galbraith wrote:
> 

> Point: if you think it's OK to serialize these firefox threads, would

> you still think so if those were kernel threads instead?  Serializing

> your kernel is a clear fail, but unpinned kthreads can be stacked up

> just as effectively as those browser threads are, eat needless wakeup

> latency and pass it on.


FWIW, somewhat cheezy example of that below.

(later, /me returns to [apparently endless] squabble w. PELT/SIS;)

bonnie in nfs mount of own box competing with 7 hogs:
 ------------------------------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Sum     delay ms | Maximum delay at      |
 ------------------------------------------------------------------------------------------------------------------------------------
  kworker/3:0:29        |    630.078 ms |    89669 | avg:    0.011 ms | max:  102.340 ms | sum:  962.919 ms | max at:    310.501277 |
  kworker/3:1H:464      |   1179.868 ms |   101944 | avg:    0.005 ms | max:  102.232 ms | sum:  480.915 ms | max at:    310.501273 |
  kswapd0:78            |   2662.230 ms |     1661 | avg:    0.128 ms | max:   93.935 ms | sum:  213.258 ms | max at:    310.503419 |
  nfsd:2039             |   3257.143 ms |    78448 | avg:    0.112 ms | max:   86.039 ms | sum: 8795.767 ms | max at:    258.847140 |
  nfsd:2038             |   3185.730 ms |    76253 | avg:    0.113 ms | max:   78.348 ms | sum: 8580.676 ms | max at:    258.831370 |
  nfsd:2042             |   3256.554 ms |    81423 | avg:    0.110 ms | max:   74.941 ms | sum: 8929.015 ms | max at:    288.397203 |
  nfsd:2040             |   3314.826 ms |    80396 | avg:    0.105 ms | max:   51.039 ms | sum: 8471.816 ms | max at:    363.870078 |
  nfsd:2036             |   3058.867 ms |    70460 | avg:    0.115 ms | max:   44.629 ms | sum: 8092.319 ms | max at:    250.074253 |
  nfsd:2037             |   3113.592 ms |    74276 | avg:    0.115 ms | max:   43.294 ms | sum: 8556.110 ms | max at:    310.443722 |
  konsole:4013          |    402.509 ms |      894 | avg:    0.148 ms | max:   38.129 ms | sum:  132.050 ms | max at:    332.156495 |
  haveged:497           |     11.831 ms |     1224 | avg:    0.104 ms | max:   37.575 ms | sum:  127.706 ms | max at:    350.669645 |
  nfsd:2043             |   3316.033 ms |    78303 | avg:    0.115 ms | max:   36.511 ms | sum: 8995.138 ms | max at:    248.576108 |
  nfsd:2035             |   3064.108 ms |    67413 | avg:    0.115 ms | max:   28.221 ms | sum: 7746.306 ms | max at:    313.785682 |
  bash:7022             |      0.342 ms |        1 | avg:   22.959 ms | max:   22.959 ms | sum:   22.959 ms | max at:    262.258960 |
  kworker/u16:4:354     |   2073.383 ms |     1550 | avg:    0.050 ms | max:   21.203 ms | sum:   77.185 ms | max at:    332.220678 |
  kworker/4:3:6975      |   1189.868 ms |   115776 | avg:    0.018 ms | max:   20.856 ms | sum: 2071.894 ms | max at:    348.142757 |
  kworker/2:4:6981      |    335.895 ms |    26617 | avg:    0.023 ms | max:   20.726 ms | sum:  625.102 ms | max at:    248.522083 |
  bash:7021             |      0.517 ms |        2 | avg:   10.363 ms | max:   20.726 ms | sum:   20.727 ms | max at:    262.235708 |
  ksoftirqd/2:22        |     65.718 ms |      998 | avg:    0.138 ms | max:   19.072 ms | sum:  137.827 ms | max at:    332.221676 |
  kworker/7:3:6969      |    625.724 ms |    84153 | avg:    0.010 ms | max:   18.838 ms | sum:  876.603 ms | max at:    264.188983 |
  bonnie:6965           |  79637.998 ms |    35434 | avg:    0.007 ms | max:   18.719 ms | sum:  256.748 ms | max at:    331.299867 |

Patrick Bellasi Dec. 18, 2017, 10:48 a.m. UTC | #7

Hi Rafael,

On 16-Dec 03:35, Rafael J. Wysocki wrote:
> On Tuesday, December 5, 2017 6:10:18 PM CET Patrick Bellasi wrote:

[...]

> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c

> > index 2f52ec0f1539..465430d99440 100644

> > --- a/kernel/sched/cpufreq_schedutil.c

> > +++ b/kernel/sched/cpufreq_schedutil.c

> > @@ -183,7 +183,11 @@ static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)

> >  

> >  	cfs_max = arch_scale_cpu_capacity(NULL, cpu);

> >  

> > -	*util = min(rq->cfs.avg.util_avg, cfs_max);

> > +	*util = rq->cfs.avg.util_avg;

> 

> I would use a local variable here.

> 

> That *util everywhere looks a bit dirtyish.


Yes, right... will update for the next respin.

> 

> > +	if (sched_feat(UTIL_EST))

> > +		*util = max(*util, rq->cfs.util_est_runnable);

> > +	*util = min(*util, cfs_max);

> > +

> >  	*max = cfs_max;

> >  }

> >  

> > 


Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

[v2,0/4] Utilization estimation (util_est) for FAIR tasks

Message

Comments