From patchwork Thu May 7 15:31:25 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Lezcano X-Patchwork-Id: 48131 Return-Path: X-Original-To: linaro@patches.linaro.org Delivered-To: linaro@patches.linaro.org Received: from mail-wg0-f71.google.com (mail-wg0-f71.google.com [74.125.82.71]) by ip-10-151-82-157.ec2.internal (Postfix) with ESMTPS id 5CCEB2121F for ; Thu, 7 May 2015 15:32:10 +0000 (UTC) Received: by wghm4 with SMTP id m4sf13879649wgh.2 for ; Thu, 07 May 2015 08:32:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:delivered-to:message-id:date:from:user-agent :mime-version:to:cc:subject:references:in-reply-to:content-type :content-transfer-encoding:sender:precedence:list-id :x-original-sender:x-original-authentication-results:mailing-list :list-post:list-help:list-archive:list-unsubscribe; bh=gZMIMYveZtplD+Mnowtua8vFCyeZdHhLRcU6YHvrER4=; b=Gfr/mVrHTiPtXZrykAv4HV4Q8lrmwDAsP51R+395tEZG60MudXOvORzSetjnTNd+45 /HDT7RLK1dKARmCzJzsDjmy0pj8Pb8xVJ73QJRaBihDq6dPLINgPU4aNBPv4I5ckfG2V yTa0XZJAFF6q6qpGDuBFGSedUpwahtviS1z9wTKR1cx8WHhJWwpt+Il6cTpf45Y8q/Q4 esEXaF4f0qwUikZ4ztH76q3iCLW0stLSn0Zvpk13723gNEuLhCzKaHBzTmBSk16hT4zU qd3v/xsDGTQJWWm91DJ9ErhOKWFAyU571a+KnT6/D29faAQ1oGRH/pg3aJgayVzznRlv wBGw== X-Gm-Message-State: ALoCoQlvWUyHewUCqWQ44H15IGxbrJE4pvVGad3XXv8sFbMJAIheHFInt93Ge91aHJ41gW0Dp+b6 X-Received: by 10.180.7.138 with SMTP id j10mr2396849wia.2.1431012729614; Thu, 07 May 2015 08:32:09 -0700 (PDT) X-BeenThere: patchwork-forward@linaro.org Received: by 10.152.239.130 with SMTP id vs2ls187376lac.97.gmail; Thu, 07 May 2015 08:32:09 -0700 (PDT) X-Received: by 10.152.25.167 with SMTP id d7mr3575853lag.108.1431012729432; Thu, 07 May 2015 08:32:09 -0700 (PDT) Received: from mail-la0-f46.google.com (mail-la0-f46.google.com. [209.85.215.46]) by mx.google.com with ESMTPS id as7si1501082lbc.107.2015.05.07.08.32.09 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 May 2015 08:32:09 -0700 (PDT) Received-SPF: pass (google.com: domain of patch+caf_=patchwork-forward=linaro.org@linaro.org designates 209.85.215.46 as permitted sender) client-ip=209.85.215.46; Received: by labbd9 with SMTP id bd9so33463552lab.2 for ; Thu, 07 May 2015 08:32:09 -0700 (PDT) X-Received: by 10.112.161.226 with SMTP id xv2mr3571281lbb.106.1431012729296; Thu, 07 May 2015 08:32:09 -0700 (PDT) X-Forwarded-To: patchwork-forward@linaro.org X-Forwarded-For: patch@linaro.org patchwork-forward@linaro.org Delivered-To: patch@linaro.org Received: by 10.112.67.65 with SMTP id l1csp3510384lbt; Thu, 7 May 2015 08:32:07 -0700 (PDT) X-Received: by 10.68.161.4 with SMTP id xo4mr7901007pbb.65.1431012723335; Thu, 07 May 2015 08:32:03 -0700 (PDT) Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id ey6si3207156pab.172.2015.05.07.08.31.33; Thu, 07 May 2015 08:32:03 -0700 (PDT) Received-SPF: none (google.com: linux-pm-owner@vger.kernel.org does not designate permitted sender hosts) client-ip=209.132.180.67; Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751608AbbEGPbb (ORCPT + 11 others); Thu, 7 May 2015 11:31:31 -0400 Received: from mail-wg0-f46.google.com ([74.125.82.46]:33493 "EHLO mail-wg0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751425AbbEGPba (ORCPT ); Thu, 7 May 2015 11:31:30 -0400 Received: by wgin8 with SMTP id n8so47375989wgi.0 for ; Thu, 07 May 2015 08:31:28 -0700 (PDT) X-Received: by 10.194.200.228 with SMTP id jv4mr7307980wjc.157.1431012688598; Thu, 07 May 2015 08:31:28 -0700 (PDT) Received: from ?IPv6:2a01:e34:ed2f:f020:d8a4:75af:f714:5505? ([2a01:e34:ed2f:f020:d8a4:75af:f714:5505]) by mx.google.com with ESMTPSA id 16sm3955240wjs.41.2015.05.07.08.31.26 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 May 2015 08:31:27 -0700 (PDT) Message-ID: <554B854D.9060109@linaro.org> Date: Thu, 07 May 2015 17:31:25 +0200 From: Daniel Lezcano User-Agent: Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Peter Zijlstra CC: rjw@rjwysocki.net, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, nicolas.pitre@linaro.org, Ingo Molnar Subject: Re: [PATCH 3/3] sched: fair: Fix wrong idle timestamp usage References: <1429092024-20498-1-git-send-email-daniel.lezcano@linaro.org> <1429092024-20498-3-git-send-email-daniel.lezcano@linaro.org> <20150415121831.GU5029@twins.programming.kicks-ass.net> <552E8715.4060601@linaro.org> <20150415160200.GU23123@twins.programming.kicks-ass.net> In-Reply-To: <20150415160200.GU23123@twins.programming.kicks-ass.net> Sender: linux-pm-owner@vger.kernel.org Precedence: list List-ID: X-Mailing-List: linux-pm@vger.kernel.org X-Removed-Original-Auth: Dkim didn't pass. X-Original-Sender: daniel.lezcano@linaro.org X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of patch+caf_=patchwork-forward=linaro.org@linaro.org designates 209.85.215.46 as permitted sender) smtp.mail=patch+caf_=patchwork-forward=linaro.org@linaro.org Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org X-Google-Group-Id: 836684582541 List-Post: , List-Help: , List-Archive: List-Unsubscribe: , On 04/15/2015 06:02 PM, Peter Zijlstra wrote: > On Wed, Apr 15, 2015 at 05:43:17PM +0200, Daniel Lezcano wrote: >> On 04/15/2015 02:18 PM, Peter Zijlstra wrote: >>> On Wed, Apr 15, 2015 at 12:00:24PM +0200, Daniel Lezcano wrote: >>>> The find_idlest_cpu is assuming the rq->idle_stamp information reflects when >>>> the cpu entered the idle state. This is wrong as the cpu may exit and enter >>>> the idle state several times without the rq->idle_stamp being updated. >>> >>> Sure, but you forgot to tell us why it matters. >> >> Yes, right. Thanks for pointing this out. >> >> Assuming we are in the situation where there are several idle cpus in the >> same idle state. >> >> With the current code, the function find_idlest_cpu will choose a cpu with >> the shortest idle duration. This information is based on the rq->idle_stamp >> variable and is correct until one of the idle cpu is exiting the >> cpuidle_enter function and re-entering it again. As soon as this happen, the >> rq->idle_stamp value is no longer a reliable information. >> >> Example: >> >> * CPU0 and CPU1 are running >> * CPU2 and CPU3 are in the C3 state. >> * CPU2 entered idle at T2 >> * CPU3 entered idle at T3 >> * T2 < T3 >> >> The function find_idlest_cpu will choose CPU3 because it has a shorter idle >> duration. >> >> Then CPU3 is woken up by an interrupt, process it and re-enter idle C3. >> >> The information will still give the out to date information T2 < T3 and >> find_idlest_cpu will choose CPU2 instead of CPU3. >> >> Even if that shouldn't have a big impact on the performance and energy side, >> we are dealing with a wrong information preventing us to improve the energy >> side later (eg. prevent to wakeup a cpu which did not reach its target >> residency yet). > > Right, I figured as much; but no tangible results or behavioural fail > observed. > >>> Urgh, you made horrid code more horrible. >>> >>> And all without reason. >> >> Ok. What is horrible ? The 'if then else' blocks or the algorithm itself ? > > Yeah the amount and depth of branches. I briefly tried to see if it > could be fixed but came up empty. Maybe I should try harder :-) Here's a try here (patch below based on top of this patchset). There is a cpumask for the idle cpus being set / cleared when entering / exiting the idle loop. The function find_idlest_cpu uses this mask to separate idle cpus from running cpus. So the benefit of this approach is we don't lookup on all sched_group's cpus but just a subset. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b44f1ad..b6f671e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4686,67 +4686,89 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, return idlest; } -/* - * find_idlest_cpu - find the idlest cpu among the cpus in group. - */ -static int -find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) +struct cpumask idle_cpus_mask; + +static int find_shallowest_idle_cpu(struct cpumask *cpus) { - unsigned long load, min_load = ULONG_MAX; - unsigned int min_exit_latency = UINT_MAX; u64 latest_idle_timestamp = 0; - int least_loaded_cpu = this_cpu; - int shallowest_idle_cpu = -1; - int i; + unsigned min_exit_latency = UINT_MAX; + int i, cpu = -1; - /* Traverse only the allowed CPUs */ - for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) { - if (idle_cpu(i)) { - struct rq *rq = cpu_rq(i); - struct cpuidle_state *idle = idle_get_state(rq); - - if (idle) { - if (idle->exit_latency < min_exit_latency) { - /* - * We give priority to a CPU - * whose idle state has the - * smallest exit latency - * irrespective of any idle - * timestamp. - */ - min_exit_latency = idle->exit_latency; - latest_idle_timestamp = idle->idle_stamp; - shallowest_idle_cpu = i; - } else if (idle->exit_latency == min_exit_latency && - idle->idle_stamp > latest_idle_timestamp) { - /* - * If the CPU is in the same - * idle state, choose the more - * recent one as it might have - * a warmer cache - */ - latest_idle_timestamp = idle->idle_stamp; - shallowest_idle_cpu = i; - } - } else if (rq->idle_stamp > latest_idle_timestamp) { - /* - * If no active idle state, then the - * most recent idled CPU might have a - * warmer cache - */ + for_each_cpu(i, cpus) { + + struct rq *rq = cpu_rq(i); + struct cpuidle_state *state = idle_get_state(rq); + + if (!state) { + + if (rq->idle_stamp > latest_idle_timestamp) { latest_idle_timestamp = rq->idle_stamp; - shallowest_idle_cpu = i; - } - } else if (shallowest_idle_cpu == -1) { - load = weighted_cpuload(i); - if (load < min_load || (load == min_load && i == this_cpu)) { - min_load = load; - least_loaded_cpu = i; + cpu = i; } + + continue; + } + + if (state->exit_latency < min_exit_latency) { + /* + * We give priority to a CPU whose idle state + * has the smallest exit latency irrespective + * of any idle timestamp. + */ + min_exit_latency = state->exit_latency; + cpu = i; + + continue; + } + + if (state->exit_latency == min_exit_latency && + state->idle_stamp > latest_idle_timestamp) { + /* + * If the CPU is in the same idle state, + * choose the more recent one as it might have + * a warmer cache + */ + cpu = i; + } + } + + return cpu; +} + +static int find_least_loaded_cpu(struct cpumask *cpus) +{ + int i, cpu, this_cpu; + int min_load = ULONG_MAX, load = weighted_cpuload(cpu); + + cpu = this_cpu = smp_processor_id(); + + for_each_cpu(i, cpus) { + if (load < min_load || (load == min_load && i == this_cpu)) { + min_load = load; + cpu = i; } } - return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu; + return cpu; +} + +/* + * find_idlest_cpu - find the idlest cpu among the cpus in group. + */ +static int +find_idlest_cpu(struct sched_group *group, struct task_struct *p) +{ + struct cpumask tmp, idle_cpus, running_cpus; + + cpumask_and(&tmp, sched_group_cpus(group), tsk_cpus_allowed(p)); + + cpumask_and(&idle_cpus, &idle_cpus_mask, &tmp); + + cpumask_andnot(&running_cpus, &idle_cpus_mask, &tmp); + + return !cpumask_empty(&idle_cpus) ? + find_shallowest_idle_cpu(&idle_cpus) : + find_least_loaded_cpu(&running_cpus); } /* @@ -4887,7 +4909,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f continue; } - new_cpu = find_idlest_cpu(group, p, cpu); + new_cpu = find_idlest_cpu(group, p); if (new_cpu == -1 || new_cpu == cpu) { /* Now try balancing at a lower domain level of cpu */ sd = sd->child; diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 80014a1..b2aa008 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -217,6 +217,8 @@ use_default: */ static void cpu_idle_loop(void) { + int cpu = smp_processor_id(); + while (1) { /* * If the arch has a polling bit, we maintain an invariant: @@ -230,6 +232,8 @@ static void cpu_idle_loop(void) __current_set_polling(); tick_nohz_idle_enter(); + cpumask_set_cpu(cpu, &idle_cpus_mask); + while (!need_resched()) { check_pgt_cache(); rmb(); @@ -257,6 +261,8 @@ static void cpu_idle_loop(void) arch_cpu_idle_exit(); } + cpumask_clear_cpu(cpu, &idle_cpus_mask); + /* * Since we fell out of the loop above, we know * TIF_NEED_RESCHED must be set, propagate it into diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e0e1299..a14d833 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -544,6 +544,8 @@ struct root_domain { extern struct root_domain def_root_domain; +extern struct cpumask idle_cpus_mask; + #endif /* CONFIG_SMP */ /*