From patchwork Thu May  7 15:31:25 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Daniel Lezcano <daniel.lezcano@linaro.org>
X-Patchwork-Id: 48131
Return-Path: <patchwork-forward+bncBDIYTLXUW4BRB6MKV2VAKGQEGE5ETWQ@linaro.org>
X-Original-To: linaro@patches.linaro.org
Delivered-To: linaro@patches.linaro.org
Received: from mail-wg0-f71.google.com (mail-wg0-f71.google.com [74.125.82.71])
 by ip-10-151-82-157.ec2.internal (Postfix) with ESMTPS id 5CCEB2121F
 for <linaro@patches.linaro.org>; Thu,  7 May 2015 15:32:10 +0000 (UTC)
Received: by wghm4 with SMTP id m4sf13879649wgh.2
 for <linaro@patches.linaro.org>; Thu, 07 May 2015 08:32:09 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:delivered-to:message-id:date:from:user-agent
 :mime-version:to:cc:subject:references:in-reply-to:content-type
 :content-transfer-encoding:sender:precedence:list-id
 :x-original-sender:x-original-authentication-results:mailing-list
 :list-post:list-help:list-archive:list-unsubscribe;
 bh=gZMIMYveZtplD+Mnowtua8vFCyeZdHhLRcU6YHvrER4=;
 b=Gfr/mVrHTiPtXZrykAv4HV4Q8lrmwDAsP51R+395tEZG60MudXOvORzSetjnTNd+45
 /HDT7RLK1dKARmCzJzsDjmy0pj8Pb8xVJ73QJRaBihDq6dPLINgPU4aNBPv4I5ckfG2V
 yTa0XZJAFF6q6qpGDuBFGSedUpwahtviS1z9wTKR1cx8WHhJWwpt+Il6cTpf45Y8q/Q4
 esEXaF4f0qwUikZ4ztH76q3iCLW0stLSn0Zvpk13723gNEuLhCzKaHBzTmBSk16hT4zU
 qd3v/xsDGTQJWWm91DJ9ErhOKWFAyU571a+KnT6/D29faAQ1oGRH/pg3aJgayVzznRlv
 wBGw==
X-Gm-Message-State: ALoCoQlvWUyHewUCqWQ44H15IGxbrJE4pvVGad3XXv8sFbMJAIheHFInt93Ge91aHJ41gW0Dp+b6
X-Received: by 10.180.7.138 with SMTP id j10mr2396849wia.2.1431012729614;
 Thu, 07 May 2015 08:32:09 -0700 (PDT)
X-BeenThere: patchwork-forward@linaro.org
Received: by 10.152.239.130 with SMTP id vs2ls187376lac.97.gmail; Thu, 07 May
 2015 08:32:09 -0700 (PDT)
X-Received: by 10.152.25.167 with SMTP id d7mr3575853lag.108.1431012729432; 
 Thu, 07 May 2015 08:32:09 -0700 (PDT)
Received: from mail-la0-f46.google.com (mail-la0-f46.google.com.
 [209.85.215.46]) by mx.google.com with ESMTPS id
 as7si1501082lbc.107.2015.05.07.08.32.09
 for <patchwork-forward@linaro.org>
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Thu, 07 May 2015 08:32:09 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 patch+caf_=patchwork-forward=linaro.org@linaro.org designates
 209.85.215.46 as permitted sender) client-ip=209.85.215.46; 
Received: by labbd9 with SMTP id bd9so33463552lab.2
 for <patchwork-forward@linaro.org>;
 Thu, 07 May 2015 08:32:09 -0700 (PDT)
X-Received: by 10.112.161.226 with SMTP id xv2mr3571281lbb.106.1431012729296; 
 Thu, 07 May 2015 08:32:09 -0700 (PDT)
X-Forwarded-To: patchwork-forward@linaro.org
X-Forwarded-For: patch@linaro.org patchwork-forward@linaro.org
Delivered-To: patch@linaro.org
Received: by 10.112.67.65 with SMTP id l1csp3510384lbt;
 Thu, 7 May 2015 08:32:07 -0700 (PDT)
X-Received: by 10.68.161.4 with SMTP id xo4mr7901007pbb.65.1431012723335;
 Thu, 07 May 2015 08:32:03 -0700 (PDT)
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 ey6si3207156pab.172.2015.05.07.08.31.33; 
 Thu, 07 May 2015 08:32:03 -0700 (PDT)
Received-SPF: none (google.com: linux-pm-owner@vger.kernel.org does not
 designate permitted sender hosts) client-ip=209.132.180.67; 
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1751608AbbEGPbb (ORCPT <rfc822;amit.pundir@linaro.org>
 + 11 others); Thu, 7 May 2015 11:31:31 -0400
Received: from mail-wg0-f46.google.com ([74.125.82.46]:33493 "EHLO
 mail-wg0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1751425AbbEGPba (ORCPT
 <rfc822;linux-pm@vger.kernel.org>); Thu, 7 May 2015 11:31:30 -0400
Received: by wgin8 with SMTP id n8so47375989wgi.0
 for <linux-pm@vger.kernel.org>; Thu, 07 May 2015 08:31:28 -0700 (PDT)
X-Received: by 10.194.200.228 with SMTP id jv4mr7307980wjc.157.1431012688598; 
 Thu, 07 May 2015 08:31:28 -0700 (PDT)
Received: from ?IPv6:2a01:e34:ed2f:f020:d8a4:75af:f714:5505?
 ([2a01:e34:ed2f:f020:d8a4:75af:f714:5505])
 by mx.google.com with ESMTPSA id
 16sm3955240wjs.41.2015.05.07.08.31.26
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Thu, 07 May 2015 08:31:27 -0700 (PDT)
Message-ID: <554B854D.9060109@linaro.org>
Date: Thu, 07 May 2015 17:31:25 +0200
From: Daniel Lezcano <daniel.lezcano@linaro.org>
User-Agent: Mozilla/5.0 (X11; Linux i686;
 rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: Peter Zijlstra <peterz@infradead.org>
CC: rjw@rjwysocki.net, linux-kernel@vger.kernel.org,
 linux-pm@vger.kernel.org, nicolas.pitre@linaro.org,
 Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH 3/3] sched: fair: Fix wrong idle timestamp usage
References: <1429092024-20498-1-git-send-email-daniel.lezcano@linaro.org>
 <1429092024-20498-3-git-send-email-daniel.lezcano@linaro.org>
 <20150415121831.GU5029@twins.programming.kicks-ass.net>
 <552E8715.4060601@linaro.org>
 <20150415160200.GU23123@twins.programming.kicks-ass.net>
In-Reply-To: <20150415160200.GU23123@twins.programming.kicks-ass.net>
Sender: linux-pm-owner@vger.kernel.org
Precedence: list
List-ID: <patchwork-forward.linaro.org>
X-Mailing-List: linux-pm@vger.kernel.org
X-Removed-Original-Auth: Dkim didn't pass.
X-Original-Sender: daniel.lezcano@linaro.org
X-Original-Authentication-Results: mx.google.com; spf=pass (google.com:
 domain of
 patch+caf_=patchwork-forward=linaro.org@linaro.org designates
 209.85.215.46 as permitted sender)
 smtp.mail=patch+caf_=patchwork-forward=linaro.org@linaro.org
Mailing-list: list patchwork-forward@linaro.org;
 contact patchwork-forward+owners@linaro.org
X-Google-Group-Id: 836684582541
List-Post: <http://groups.google.com/a/linaro.org/group/patchwork-forward/post>, 
 <mailto:patchwork-forward@linaro.org>
List-Help: <http://support.google.com/a/linaro.org/bin/topic.py?topic=25838>, 
 <mailto:patchwork-forward+help@linaro.org>
List-Archive: <http://groups.google.com/a/linaro.org/group/patchwork-forward/>
List-Unsubscribe: <mailto:googlegroups-manage+836684582541+unsubscribe@googlegroups.com>, 
 <http://groups.google.com/a/linaro.org/group/patchwork-forward/subscribe>

On 04/15/2015 06:02 PM, Peter Zijlstra wrote:
> On Wed, Apr 15, 2015 at 05:43:17PM +0200, Daniel Lezcano wrote:
>> On 04/15/2015 02:18 PM, Peter Zijlstra wrote:
>>> On Wed, Apr 15, 2015 at 12:00:24PM +0200, Daniel Lezcano wrote:
>>>> The find_idlest_cpu is assuming the rq->idle_stamp information reflects when
>>>> the cpu entered the idle state. This is wrong as the cpu may exit and enter
>>>> the idle state several times without the rq->idle_stamp being updated.
>>>
>>> Sure, but you forgot to tell us why it matters.
>>
>> Yes, right. Thanks for pointing this out.
>>
>> Assuming we are in the situation where there are several idle cpus in the
>> same idle state.
>>
>> With the current code, the function find_idlest_cpu will choose a cpu with
>> the shortest idle duration. This information is based on the rq->idle_stamp
>> variable and is correct until one of the idle cpu is exiting the
>> cpuidle_enter function and re-entering it again. As soon as this happen, the
>> rq->idle_stamp value is no longer a reliable information.
>>
>> Example:
>>
>>   * CPU0 and CPU1 are running
>>   * CPU2 and CPU3 are in the C3 state.
>>   * CPU2 entered idle at T2
>>   * CPU3 entered idle at T3
>>   * T2 < T3
>>
>> The function find_idlest_cpu will choose CPU3 because it has a shorter idle
>> duration.
>>
>> Then CPU3 is woken up by an interrupt, process it and re-enter idle C3.
>>
>> The information will still give the out to date information T2 < T3 and
>> find_idlest_cpu will choose CPU2 instead of CPU3.
>>
>> Even if that shouldn't have a big impact on the performance and energy side,
>> we are dealing with a wrong information preventing us to improve the energy
>> side later (eg. prevent to wakeup a cpu which did not reach its target
>> residency yet).
>
> Right, I figured as much; but no tangible results or behavioural fail
> observed.
>
>>> Urgh, you made horrid code more horrible.
>>>
>>> And all without reason.
>>
>> Ok. What is horrible ? The 'if then else' blocks or the algorithm itself ?
>
> Yeah the amount and depth of branches. I briefly tried to see if it
> could be fixed but came up empty. Maybe I should try harder :-)

Here's a try here (patch below based on top of this patchset).

There is a cpumask for the idle cpus being set / cleared when entering / 
exiting the idle loop.

The function find_idlest_cpu uses this mask to separate idle cpus from 
running cpus.

So the benefit of this approach is we don't lookup on all sched_group's 
cpus but just a subset.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b44f1ad..b6f671e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4686,67 +4686,89 @@ find_idlest_group(struct sched_domain *sd, 
struct task_struct *p,
  	return idlest;
  }

-/*
- * find_idlest_cpu - find the idlest cpu among the cpus in group.
- */
-static int
-find_idlest_cpu(struct sched_group *group, struct task_struct *p, int 
this_cpu)
+struct cpumask idle_cpus_mask;
+
+static int find_shallowest_idle_cpu(struct cpumask *cpus)
  {
-	unsigned long load, min_load = ULONG_MAX;
-	unsigned int min_exit_latency = UINT_MAX;
  	u64 latest_idle_timestamp = 0;
-	int least_loaded_cpu = this_cpu;
-	int shallowest_idle_cpu = -1;
-	int i;
+	unsigned min_exit_latency = UINT_MAX;
+	int i, cpu = -1;

-	/* Traverse only the allowed CPUs */
-	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
-		if (idle_cpu(i)) {
-			struct rq *rq = cpu_rq(i);
-			struct cpuidle_state *idle = idle_get_state(rq);
-
-			if (idle) {
-				if (idle->exit_latency < min_exit_latency) {
-					/*
-					 * We give priority to a CPU
-					 * whose idle state has the
-					 * smallest exit latency
-					 * irrespective of any idle
-					 * timestamp.
-					 */
-					min_exit_latency = idle->exit_latency;
-					latest_idle_timestamp = idle->idle_stamp;
-					shallowest_idle_cpu = i;
-				} else if (idle->exit_latency == min_exit_latency &&
-					   idle->idle_stamp > latest_idle_timestamp) {
-					/*
-					 * If the CPU is in the same
-					 * idle state, choose the more
-					 * recent one as it might have
-					 * a warmer cache
-					 */
-					latest_idle_timestamp = idle->idle_stamp;
-					shallowest_idle_cpu = i;
-				}
-			} else if (rq->idle_stamp > latest_idle_timestamp) {
-				/*
-				 * If no active idle state, then the
-				 * most recent idled CPU might have a
-				 * warmer cache
-				 */
+	for_each_cpu(i, cpus) {
+
+		struct rq *rq = cpu_rq(i);
+		struct cpuidle_state *state = idle_get_state(rq);
+
+		if (!state) {
+
+			if (rq->idle_stamp > latest_idle_timestamp) {
  				latest_idle_timestamp = rq->idle_stamp;
-				shallowest_idle_cpu = i;
-			}
-		} else if (shallowest_idle_cpu == -1) {
-			load = weighted_cpuload(i);
-			if (load < min_load || (load == min_load && i == this_cpu)) {
-				min_load = load;
-				least_loaded_cpu = i;
+				cpu = i;
  			}
+
+			continue;
+		}
+
+		if (state->exit_latency < min_exit_latency) {
+			/*
+			 * We give priority to a CPU whose idle state
+			 * has the smallest exit latency irrespective
+			 * of any idle timestamp.
+			 */
+			min_exit_latency = state->exit_latency;
+			cpu = i;
+
+			continue;
+		}
+
+		if (state->exit_latency == min_exit_latency &&
+		    state->idle_stamp > latest_idle_timestamp) {
+			/*
+			 * If the CPU is in the same idle state,
+			 * choose the more recent one as it might have
+			 * a warmer cache
+			 */
+			cpu = i;
+		}
+	}
+
+	return cpu;
+}
+
+static int find_least_loaded_cpu(struct cpumask *cpus)
+{
+	int i, cpu, this_cpu;
+	int min_load = ULONG_MAX, load = weighted_cpuload(cpu);
+
+	cpu = this_cpu = smp_processor_id();
+
+	for_each_cpu(i, cpus) {
+		if (load < min_load || (load == min_load && i == this_cpu)) {
+			min_load = load;
+			cpu = i;
  		}
  	}

-	return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
+	return cpu;
+}
+
+/*
+ * find_idlest_cpu - find the idlest cpu among the cpus in group.
+ */
+static int
+find_idlest_cpu(struct sched_group *group, struct task_struct *p)
+{
+	struct cpumask tmp, idle_cpus, running_cpus;
+
+	cpumask_and(&tmp, sched_group_cpus(group), tsk_cpus_allowed(p));
+
+	cpumask_and(&idle_cpus, &idle_cpus_mask, &tmp);
+
+	cpumask_andnot(&running_cpus, &idle_cpus_mask, &tmp);
+
+	return !cpumask_empty(&idle_cpus) ?
+		find_shallowest_idle_cpu(&idle_cpus) :
+		find_least_loaded_cpu(&running_cpus);
  }

  /*
@@ -4887,7 +4909,7 @@ select_task_rq_fair(struct task_struct *p, int 
prev_cpu, int sd_flag, int wake_f
  			continue;
  		}

-		new_cpu = find_idlest_cpu(group, p, cpu);
+		new_cpu = find_idlest_cpu(group, p);
  		if (new_cpu == -1 || new_cpu == cpu) {
  			/* Now try balancing at a lower domain level of cpu */
  			sd = sd->child;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 80014a1..b2aa008 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -217,6 +217,8 @@ use_default:
   */
  static void cpu_idle_loop(void)
  {
+	int cpu = smp_processor_id();
+
  	while (1) {
  		/*
  		 * If the arch has a polling bit, we maintain an invariant:
@@ -230,6 +232,8 @@ static void cpu_idle_loop(void)
  		__current_set_polling();
  		tick_nohz_idle_enter();

+		cpumask_set_cpu(cpu, &idle_cpus_mask);
+
  		while (!need_resched()) {
  			check_pgt_cache();
  			rmb();
@@ -257,6 +261,8 @@ static void cpu_idle_loop(void)
  			arch_cpu_idle_exit();
  		}

+		cpumask_clear_cpu(cpu, &idle_cpus_mask);
+
  		/*
  		 * Since we fell out of the loop above, we know
  		 * TIF_NEED_RESCHED must be set, propagate it into
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e0e1299..a14d833 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -544,6 +544,8 @@ struct root_domain {

  extern struct root_domain def_root_domain;

+extern struct cpumask idle_cpus_mask;
+
  #endif /* CONFIG_SMP */

  /*