Message ID | 5eba2fb4af9ebc7396101bb9bd6c8aa9c8af0710.1571899508.git.viresh.kumar@linaro.org |
---|---|
State | Superseded |
Headers | show |
Series | sched/fair: Make sched-idle cpu selection consistent throughout | expand |
Hi Viresh, On 10/24/19 12:15 PM, Viresh Kumar wrote: > There are instances where we keep searching for an idle CPU despite > having a sched-idle cpu already (in find_idlest_group_cpu(), > select_idle_smt() and select_idle_cpu() and then there are places where > we don't necessarily do that and return a sched-idle cpu as soon as we > find one (in select_idle_sibling()). This looks a bit inconsistent and > it may be worth having the same policy everywhere. > > On the other hand, choosing a sched-idle cpu over a idle one shall be > beneficial from performance point of view as well, as we don't need to > get the cpu online from a deep idle state which is quite a time > consuming process and delays the scheduling of the newly wakeup task. > > This patch tries to simplify code around sched-idle cpu selection and > make it consistent throughout. > > FWIW, tests were done with the help of rt-app (8 SCHED_OTHER and 5 > SCHED_IDLE tasks, not bound to any cpu) on ARM platform (octa-core), and > no significant difference in scheduling latency of SCHED_OTHER tasks was > found. > > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> > --- [...] > @@ -5755,13 +5749,11 @@ static int select_idle_smt(struct task_struct *p, int target) > for_each_cpu(cpu, cpu_smt_mask(target)) { > if (!cpumask_test_cpu(cpu, p->cpus_ptr)) > continue; > - if (available_idle_cpu(cpu)) > + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > return cpu; I guess this is a correct approach, but just wondering what if we still keep searching for a sched_idle CPU even though we have found an available_idle CPU? [...] Thanks, Parth
On 25-10-19, 12:13, Parth Shah wrote: > Hi Viresh, > > On 10/24/19 12:15 PM, Viresh Kumar wrote: > > There are instances where we keep searching for an idle CPU despite > > having a sched-idle cpu already (in find_idlest_group_cpu(), > > select_idle_smt() and select_idle_cpu() and then there are places where > > we don't necessarily do that and return a sched-idle cpu as soon as we > > find one (in select_idle_sibling()). This looks a bit inconsistent and > > it may be worth having the same policy everywhere. > > > > On the other hand, choosing a sched-idle cpu over a idle one shall be > > beneficial from performance point of view as well, as we don't need to > > get the cpu online from a deep idle state which is quite a time > > consuming process and delays the scheduling of the newly wakeup task. > > > > This patch tries to simplify code around sched-idle cpu selection and > > make it consistent throughout. > > > > FWIW, tests were done with the help of rt-app (8 SCHED_OTHER and 5 > > SCHED_IDLE tasks, not bound to any cpu) on ARM platform (octa-core), and > > no significant difference in scheduling latency of SCHED_OTHER tasks was > > found. > > > > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> > > --- > > [...] > > > @@ -5755,13 +5749,11 @@ static int select_idle_smt(struct task_struct *p, int target) > > for_each_cpu(cpu, cpu_smt_mask(target)) { > > if (!cpumask_test_cpu(cpu, p->cpus_ptr)) > > continue; > > - if (available_idle_cpu(cpu)) > > + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > > return cpu; > > I guess this is a correct approach, but just wondering what if we still > keep searching for a sched_idle CPU even though we have found an > available_idle CPU? I do believe selecting a sched-idle CPU should almost always be better (performance wise), unless we have a strong argument against it. And anyway, the load balancer will get triggered at a later point of time and will pull away these newly wakeup tasks to idle CPUs. The advantage we get out of it is that the tasks get serviced a bit earlier when they first get queued. It is really up to the maintainers to see what kind of policy do we want to adapt here and not a choice I can make :) -- viresh
On 10/25/19 1:41 PM, Viresh Kumar wrote: > On 25-10-19, 12:13, Parth Shah wrote: >> Hi Viresh, >> >> On 10/24/19 12:15 PM, Viresh Kumar wrote: >>> There are instances where we keep searching for an idle CPU despite >>> having a sched-idle cpu already (in find_idlest_group_cpu(), >>> select_idle_smt() and select_idle_cpu() and then there are places where >>> we don't necessarily do that and return a sched-idle cpu as soon as we >>> find one (in select_idle_sibling()). This looks a bit inconsistent and >>> it may be worth having the same policy everywhere. >>> >>> On the other hand, choosing a sched-idle cpu over a idle one shall be >>> beneficial from performance point of view as well, as we don't need to >>> get the cpu online from a deep idle state which is quite a time >>> consuming process and delays the scheduling of the newly wakeup task. >>> >>> This patch tries to simplify code around sched-idle cpu selection and >>> make it consistent throughout. >>> >>> FWIW, tests were done with the help of rt-app (8 SCHED_OTHER and 5 >>> SCHED_IDLE tasks, not bound to any cpu) on ARM platform (octa-core), and >>> no significant difference in scheduling latency of SCHED_OTHER tasks was >>> found. >>> >>> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> >>> --- >> >> [...] >> >>> @@ -5755,13 +5749,11 @@ static int select_idle_smt(struct task_struct *p, int target) >>> for_each_cpu(cpu, cpu_smt_mask(target)) { >>> if (!cpumask_test_cpu(cpu, p->cpus_ptr)) >>> continue; >>> - if (available_idle_cpu(cpu)) >>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) >>> return cpu; >> >> I guess this is a correct approach, but just wondering what if we still >> keep searching for a sched_idle CPU even though we have found an >> available_idle CPU? > > I do believe selecting a sched-idle CPU should almost always be better > (performance wise), unless we have a strong argument against it. And > anyway, the load balancer will get triggered at a later point of time > and will pull away these newly wakeup tasks to idle CPUs. The > advantage we get out of it is that the tasks get serviced a bit > earlier when they first get queued. > > It is really up to the maintainers to see what kind of policy do we > want to adapt here and not a choice I can make :) > yeah, I agree. I will favor selecting sched-idle first for smaller domains like SMT but would leave on experts. BTW, if sched-idle is given priority then maybe... > @@ -5818,13 +5810,11 @@ static int select_idle_cpu(struct task_struct *p, > struct sched_domain *sd, int t > > for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { > if (!--nr) > - return si_cpu; > + return -1; > if (!cpumask_test_cpu(cpu, p->cpus_ptr)) > continue; > - if (available_idle_cpu(cpu)) > + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > break; ...here too can be optimized I guess. Thanks, Parth
On Thu, Oct 24, 2019 at 12:15:27PM +0530, Viresh Kumar wrote: > There are instances where we keep searching for an idle CPU despite > having a sched-idle cpu already (in find_idlest_group_cpu(), > select_idle_smt() and select_idle_cpu() and then there are places where > we don't necessarily do that and return a sched-idle cpu as soon as we > find one (in select_idle_sibling()). This looks a bit inconsistent and > it may be worth having the same policy everywhere. > This needs supporting data. find_idlest_group_cpu is generally from a fork() context where it's not particularly performance critical. select_idle_sibling and the helpers it uses is wakeup context where is is often much more critical to wake quickly than find the best CPU. The biggest challenge of select_idle_sibling is making a "good enough decision" quickly without disrupting cache but a fork-intensive workload making quick decision can overload local domains requiring fixing by the load balancer. > On the other hand, choosing a sched-idle cpu over a idle one shall be > beneficial from performance point of view as well, as we don't need to > get the cpu online from a deep idle state which is quite a time > consuming process and delays the scheduling of the newly wakeup task. > > This patch tries to simplify code around sched-idle cpu selection and > make it consistent throughout. > > FWIW, tests were done with the help of rt-app (8 SCHED_OTHER and 5 > SCHED_IDLE tasks, not bound to any cpu) on ARM platform (octa-core), and > no significant difference in scheduling latency of SCHED_OTHER tasks was > found. > As the patch stands, I think a fork-intensive workload where each process is doing small amounts of work will suffer from overloading domains and have variable performance depending on how quickly the load balancer reacts. -- Mel Gorman SUSE Labs
On Wed, 30 Oct 2019 at 22:17, Mel Gorman <mgorman@suse.de> wrote: > As the patch stands, I think a fork-intensive workload where each > process is doing small amounts of work will suffer from overloading > domains and have variable performance depending on how quickly the load > balancer reacts. Just wanted to clarify this slightly in case it is confusing. Once a newly forked (non SCHED_IDLE) task gets placed on a sched-idle CPU, it won't remain sched-idle anymore and we will again start looking for a fully idle CPU. So, we won't put everything on a small set of CPUs, but just one SCHED_NORMAL task on a CPU unless we are out of idle CPUs. Do you have some specific test in mind which I can run to test this ? -- Viresh
On Thu, Oct 31, 2019 at 02:42:03PM +0530, Viresh Kumar wrote: > On Wed, 30 Oct 2019 at 22:17, Mel Gorman <mgorman@suse.de> wrote: > > > As the patch stands, I think a fork-intensive workload where each > > process is doing small amounts of work will suffer from overloading > > domains and have variable performance depending on how quickly the load > > balancer reacts. > > Just wanted to clarify this slightly in case it is confusing. Once a > newly forked > (non SCHED_IDLE) task gets placed on a sched-idle CPU, it won't remain > sched-idle anymore and we will again start looking for a fully idle CPU. So, > we won't put everything on a small set of CPUs, but just one SCHED_NORMAL > task on a CPU unless we are out of idle CPUs. > > Do you have some specific test in mind which I can run to test this ? > Nothing in particular. git test suite for the basic fork-intensive case (mmtests config workload-shellscripts), something fork-intensive but relatively short-lived like a kernel build scaling the number of build jobs (mmtests config config-workload-kerndevel), something fairly basic that scales number of running jobs and relatively long-lived like tbench (mmtests config config-network-tbench). The ideal of course is that you wrote the patch based on an observed problem that you decided to fix. -- Mel Gorman SUSE Labs
On 30-10-19, 16:47, Mel Gorman wrote: > On Thu, Oct 24, 2019 at 12:15:27PM +0530, Viresh Kumar wrote: > > There are instances where we keep searching for an idle CPU despite > > having a sched-idle cpu already (in find_idlest_group_cpu(), > > select_idle_smt() and select_idle_cpu() and then there are places where > > we don't necessarily do that and return a sched-idle cpu as soon as we > > find one (in select_idle_sibling()). This looks a bit inconsistent and > > it may be worth having the same policy everywhere. > > > > This needs supporting data. I did some more interesting tests with rt-app. It was getting difficult to generate the correct numbers with normal use cases as most of the time prev/target/etc CPUs were found to be completely idle and the task was getting placed there in all the cases and so no diff with sched-idle changes. To prove the point I was making (that we can reduce task latency with SCHED_IDLE), I created 3 different tests on my hikey board (octa-core, 2 clusters, 0-3 and 4-7). The cpufreq governor was set to performance to avoid any side affects from CPU frequency. Test 1: 1-cfs-task: A single SCHED_NORMAL task is pinned to CPU5 which runs for 2333 us out of 7777 us (so gives time for the cluster to go in deep idle state). Test 2: 1-cfs-1-idle-task: A single SCHED_NORMAL task is pinned on CPU5 and single SCHED_IDLE task is pinned on CPU6 (to make sure cluster 1 doesn't go in deep idle state). Test 3: 1-cfs-8-idle-task: A single SCHED_NORMAL task is pinned on CPU5 and eight SCHED_IDLE tasks are created which run forever (not pinned anywhere, so they run on all CPUs). Checked with kernelshark that as soon as NORMAL task sleeps, the SCHED_IDLE task starts running on CPU5. And here are the results on mean latency (in us), using the "st" tool. $ st 1-cfs-task/rt-app-cfs_thread-0.log N min max sum mean stddev 642 90 592 197180 307.134 109.906 $ st 1-cfs-1-idle-task/rt-app-cfs_thread-0.log N min max sum mean stddev 642 67 311 113850 177.336 41.4251 $ st 1-cfs-8-idle-task/rt-app-cfs_thread-0.log N min max sum mean stddev 643 29 173 41364 64.3297 13.2344 The mean latency when: - we need to wakeup from deep idle state is 307 us - we need to wakeup from shallow idle state is 177 us - we need to preempt a SCHED_IDLE task is 64 us So the theory looks correct, we should probably prefer SCHED_IDLE CPUs both for power and performance :) > find_idlest_group_cpu is generally from > a fork() context where it's not particularly performance critical. > select_idle_sibling and the helpers it uses is wakeup context where is > is often much more critical to wake quickly than find the best CPU. I agree. We must find the best CPU here. But won't a SCHED_IDLE cpu be the best ? After all that is the one in shallowest idle state and so better for power :) -- viresh
On Fri, 8 Nov 2019 at 12:32, Viresh Kumar <viresh.kumar@linaro.org> wrote: > > On 30-10-19, 16:47, Mel Gorman wrote: > > On Thu, Oct 24, 2019 at 12:15:27PM +0530, Viresh Kumar wrote: > > > There are instances where we keep searching for an idle CPU despite > > > having a sched-idle cpu already (in find_idlest_group_cpu(), > > > select_idle_smt() and select_idle_cpu() and then there are places where > > > we don't necessarily do that and return a sched-idle cpu as soon as we > > > find one (in select_idle_sibling()). This looks a bit inconsistent and > > > it may be worth having the same policy everywhere. > > > > > > > This needs supporting data. > > I did some more interesting tests with rt-app. It was getting > difficult to generate the correct numbers with normal use cases as > most of the time prev/target/etc CPUs were found to be completely idle > and the task was getting placed there in all the cases and so no diff > with sched-idle changes. > > To prove the point I was making (that we can reduce task latency with > SCHED_IDLE), I created 3 different tests on my hikey board (octa-core, > 2 clusters, 0-3 and 4-7). The cpufreq governor was set to performance > to avoid any side affects from CPU frequency. > > Test 1: 1-cfs-task: > > A single SCHED_NORMAL task is pinned to CPU5 which runs for 2333 us > out of 7777 us (so gives time for the cluster to go in deep idle > state). > > Test 2: 1-cfs-1-idle-task: > > A single SCHED_NORMAL task is pinned on CPU5 and single SCHED_IDLE > task is pinned on CPU6 (to make sure cluster 1 doesn't go in deep idle > state). > > Test 3: 1-cfs-8-idle-task: > > A single SCHED_NORMAL task is pinned on CPU5 and eight SCHED_IDLE > tasks are created which run forever (not pinned anywhere, so they run > on all CPUs). Checked with kernelshark that as soon as NORMAL task > sleeps, the SCHED_IDLE task starts running on CPU5. > > And here are the results on mean latency (in us), using the "st" tool. > > $ st 1-cfs-task/rt-app-cfs_thread-0.log > N min max sum mean stddev > 642 90 592 197180 307.134 109.906 > > $ st 1-cfs-1-idle-task/rt-app-cfs_thread-0.log > N min max sum mean stddev > 642 67 311 113850 177.336 41.4251 > > $ st 1-cfs-8-idle-task/rt-app-cfs_thread-0.log > N min max sum mean stddev > 643 29 173 41364 64.3297 13.2344 > > > The mean latency when: > - we need to wakeup from deep idle state is 307 us > - we need to wakeup from shallow idle state is 177 us > - we need to preempt a SCHED_IDLE task is 64 us > > So the theory looks correct, we should probably prefer SCHED_IDLE CPUs > both for power and performance :) > > > find_idlest_group_cpu is generally from > > a fork() context where it's not particularly performance critical. > > select_idle_sibling and the helpers it uses is wakeup context where is > > is often much more critical to wake quickly than find the best CPU. > > I agree. We must find the best CPU here. But won't a SCHED_IDLE cpu be > the best ? After all that is the one in shallowest idle state and so > better for power :) It makes sense to me to consider a CPU that runs only SCHED_IDLE task as an idle CPU with shortest latency and most recently idled timestamp. This seems to be confirmed be the data above. The SCHED_IDLE tasks would be somewhat penalized because they can now be preempted whereas there is a real idle CPU but such SCHED_IDLE task don't have any other requirements than not delaying NORMAL task wakeup Also this simplifies and shortens the search loop. > > -- > viresh
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a81c36472822..bb367f48c1ef 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5545,7 +5545,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this unsigned int min_exit_latency = UINT_MAX; u64 latest_idle_timestamp = 0; int least_loaded_cpu = this_cpu; - int shallowest_idle_cpu = -1, si_cpu = -1; + int shallowest_idle_cpu = -1; int i; /* Check if we have any choice: */ @@ -5554,6 +5554,9 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this /* Traverse only the allowed CPUs */ for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) { + if (sched_idle_cpu(i)) + return i; + if (available_idle_cpu(i)) { struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); @@ -5576,12 +5579,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this latest_idle_timestamp = rq->idle_stamp; shallowest_idle_cpu = i; } - } else if (shallowest_idle_cpu == -1 && si_cpu == -1) { - if (sched_idle_cpu(i)) { - si_cpu = i; - continue; - } - + } else if (shallowest_idle_cpu == -1) { load = cpu_load(cpu_rq(i)); if (load < min_load) { min_load = load; @@ -5590,11 +5588,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this } } - if (shallowest_idle_cpu != -1) - return shallowest_idle_cpu; - if (si_cpu != -1) - return si_cpu; - return least_loaded_cpu; + return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu; } static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, @@ -5747,7 +5741,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int */ static int select_idle_smt(struct task_struct *p, int target) { - int cpu, si_cpu = -1; + int cpu; if (!static_branch_likely(&sched_smt_present)) return -1; @@ -5755,13 +5749,11 @@ static int select_idle_smt(struct task_struct *p, int target) for_each_cpu(cpu, cpu_smt_mask(target)) { if (!cpumask_test_cpu(cpu, p->cpus_ptr)) continue; - if (available_idle_cpu(cpu)) + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) return cpu; - if (si_cpu == -1 && sched_idle_cpu(cpu)) - si_cpu = cpu; } - return si_cpu; + return -1; } #else /* CONFIG_SCHED_SMT */ @@ -5790,7 +5782,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 time, cost; s64 delta; int this = smp_processor_id(); - int cpu, nr = INT_MAX, si_cpu = -1; + int cpu, nr = INT_MAX; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -5818,13 +5810,11 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { if (!--nr) - return si_cpu; + return -1; if (!cpumask_test_cpu(cpu, p->cpus_ptr)) continue; - if (available_idle_cpu(cpu)) + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) break; - if (si_cpu == -1 && sched_idle_cpu(cpu)) - si_cpu = cpu; } time = cpu_clock(this) - time;
There are instances where we keep searching for an idle CPU despite having a sched-idle cpu already (in find_idlest_group_cpu(), select_idle_smt() and select_idle_cpu() and then there are places where we don't necessarily do that and return a sched-idle cpu as soon as we find one (in select_idle_sibling()). This looks a bit inconsistent and it may be worth having the same policy everywhere. On the other hand, choosing a sched-idle cpu over a idle one shall be beneficial from performance point of view as well, as we don't need to get the cpu online from a deep idle state which is quite a time consuming process and delays the scheduling of the newly wakeup task. This patch tries to simplify code around sched-idle cpu selection and make it consistent throughout. FWIW, tests were done with the help of rt-app (8 SCHED_OTHER and 5 SCHED_IDLE tasks, not bound to any cpu) on ARM platform (octa-core), and no significant difference in scheduling latency of SCHED_OTHER tasks was found. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> --- kernel/sched/fair.c | 34 ++++++++++++---------------------- 1 file changed, 12 insertions(+), 22 deletions(-) -- 2.21.0.rc0.269.g1a574e7a288b