From patchwork Tue May 3 14:56:43 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Morten Rasmussen X-Patchwork-Id: 67069 Delivered-To: patch@linaro.org Received: by 10.140.92.199 with SMTP id b65csp640599qge; Tue, 3 May 2016 07:56:33 -0700 (PDT) X-Received: by 10.98.11.78 with SMTP id t75mr4133747pfi.72.1462287393751; Tue, 03 May 2016 07:56:33 -0700 (PDT) Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f5si1757652pay.191.2016.05.03.07.56.33; Tue, 03 May 2016 07:56:33 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933390AbcECO4c (ORCPT + 29 others); Tue, 3 May 2016 10:56:32 -0400 Received: from foss.arm.com ([217.140.101.70]:39413 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932884AbcECO4a (ORCPT ); Tue, 3 May 2016 10:56:30 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 799313A; Tue, 3 May 2016 07:56:35 -0700 (PDT) Received: from e105550-lin.cambridge.arm.com (usa-sjc-imap-foss1.foss.arm.com [10.72.51.249]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2F3D23F252; Tue, 3 May 2016 07:56:29 -0700 (PDT) Date: Tue, 3 May 2016 15:56:43 +0100 From: Morten Rasmussen To: Peter Zijlstra Cc: Dietmar Eggemann , linux-kernel@vger.kernel.org Subject: Re: [PATCH 3/7] sched/fair: Correct unit of load_above_capacity Message-ID: <20160503145641.GA759@e105550-lin.cambridge.arm.com> References: <1461958364-675-1-git-send-email-dietmar.eggemann@arm.com> <1461958364-675-4-git-send-email-dietmar.eggemann@arm.com> <20160503105233.GN3430@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160503105233.GN3430@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 03, 2016 at 12:52:33PM +0200, Peter Zijlstra wrote: > On Fri, Apr 29, 2016 at 08:32:40PM +0100, Dietmar Eggemann wrote: > > From: Morten Rasmussen > > > > In calculate_imbalance() load_above_capacity currently has the unit > > [load] while it is used as being [load/capacity]. Not only is it wrong it > > also makes it unlikely that load_above_capacity is ever used as the > > subsequent code picks the smaller of load_above_capacity and the avg_load > > difference between the local and the busiest sched_groups. > > > > This patch ensures that load_above_capacity has the right unit > > [load/capacity]. > > > > Signed-off-by: Morten Rasmussen > > --- > > kernel/sched/fair.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index bc19fee94f39..c066574cff04 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -7016,9 +7016,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > local->group_type == group_overloaded) { > > load_above_capacity = busiest->sum_nr_running * > > SCHED_LOAD_SCALE; > > Given smpnice and cgroups; this starting point seems somewhat random. > Then again, without cgroup and assuming all tasks are nice-0 it is still > true. But I think the cgroup muck is rather universal these days. It makes sense for non-grouped always-running nice-0 tasks, and not so much for all other scenarios. We ignore smpnice, PELT, and cgroups :-( > > The above SCHED_LOAD_SCALE is the 1 -> load metric, right? Yes. For clarity I would probably have used NICE_O_LOAD instead. We are establishing how much capacity sum_nr_running task would consume if they are all non-grouped always-running nice-0 tasks. > > > - if (load_above_capacity > busiest->group_capacity) > > + if (load_above_capacity > busiest->group_capacity) { > > load_above_capacity -= busiest->group_capacity; If they would consume more than we have, we return how much more otherwise LONG_MAX. > > - else > > + load_above_capacity *= SCHED_LOAD_SCALE; > > + load_above_capacity /= busiest->group_capacity; > > And this one does load->capacity It does load->[load/capacity] The subsequent code does [load/capacity]->load: env->imbalance = min( --> max_pull * busiest->group_capacity, (sds->avg_load - local->avg_load) * local->group_capacity ) / SCHED_CAPACITY_SCALE; > > > + } else > > load_above_capacity = ~0UL; > > } > > > Is there anything better we can do? I know there's a fair bit of cruft > in; but these assumptions that SCHED_LOAD_SCALE is one task, just bug > the hell out of me. > > Would it make sense to make load_balance()->detach_tasks() ensure > busiest->sum_nr_running >= busiest->group_weight or so? That would make a lot of sense, because that is essentially what the code does for non-SMT systems. The story is a bit different for SMT though since: busiest->group_capacity < busiest->group_weight * SCHED_LOAD_SCALE. This could lead us to try pulling more tasks to vacate SMT hw-threads. However, I'm not convinced if it has any effect as: max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity); is the _minimum_ of the actual load difference and the 'excess' load, so load_above_capacity can't cause us to pull more load anyway. So I don't see why it shouldn't be okay. Something like the below should implement your proposal I think (only boot tested on arm64): ---8<--- -- 1.9.1 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0fe30e6..37fcbec 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5777,6 +5777,7 @@ struct lb_env { int new_dst_cpu; enum cpu_idle_type idle; long imbalance; + long imbalance_tasks; /* The set of CPUs under consideration for load-balancing */ struct cpumask *cpus; @@ -6018,7 +6019,7 @@ static int detach_tasks(struct lb_env *env) lockdep_assert_held(&env->src_rq->lock); - if (env->imbalance <= 0) + if (env->imbalance <= 0 || env->imbalance_tasks == 0) return 0; while (!list_empty(tasks)) { @@ -6072,9 +6073,9 @@ static int detach_tasks(struct lb_env *env) /* * We only want to steal up to the prescribed amount of - * weighted load. + * weighted load or max number of tasks. */ - if (env->imbalance <= 0) + if (env->imbalance <= 0 || env->imbalance_tasks <= detached) break; continue; @@ -6873,12 +6874,14 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) */ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds) { - unsigned long max_pull, load_above_capacity = ~0UL; + unsigned long max_pull; struct sg_lb_stats *local, *busiest; local = &sds->local_stat; busiest = &sds->busiest_stat; + env->imbalance_tasks = ~0UL; + if (busiest->group_type == group_imbalanced) { /* * In the group_imb case we cannot rely on group-wide averages @@ -6904,23 +6907,17 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s */ if (busiest->group_type == group_overloaded && local->group_type == group_overloaded) { - load_above_capacity = busiest->sum_nr_running * - SCHED_LOAD_SCALE; - if (load_above_capacity > busiest->group_capacity) - load_above_capacity -= busiest->group_capacity; - else - load_above_capacity = ~0UL; + if (busiest->sum_nr_running > busiest->group_weight) + env->imbalance_tasks = + busiest->sum_nr_running - busiest->group_weight; } /* * We're trying to get all the cpus to the average_load, so we don't * want to push ourselves above the average load, nor do we wish to - * reduce the max loaded cpu below the average load. At the same time, - * we also don't want to reduce the group load below the group capacity - * (so that we can implement power-savings policies etc). Thus we look - * for the minimum possible imbalance. + * reduce the max loaded cpu below the average load. */ - max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity); + max_pull = busiest->avg_load - sds->avg_load; /* How much load to actually move to equalise the imbalance */ env->imbalance = min(