From patchwork Thu Jun 16 16:30:13 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 70200
Delivered-To: patch@linaro.org
Received: by 10.140.28.4 with SMTP id 4csp350336qgy;
 Thu, 16 Jun 2016 09:30:28 -0700 (PDT)
X-Received: by 10.66.66.42 with SMTP id c10mr6120147pat.119.1466094628762;
 Thu, 16 Jun 2016 09:30:28 -0700 (PDT)
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id i5si6692460pfk.100.2016.06.16.09.30.28;
 Thu, 16 Jun 2016 09:30:28 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1754177AbcFPQa0 (ORCPT <rfc822;julien.grall@linaro.org>
 + 30 others); Thu, 16 Jun 2016 12:30:26 -0400
Received: from mail-wm0-f41.google.com ([74.125.82.41]:37814 "EHLO
 mail-wm0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1751490AbcFPQaY (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Thu, 16 Jun 2016 12:30:24 -0400
Received: by mail-wm0-f41.google.com with SMTP id a66so65855929wme.0
 for <linux-kernel@vger.kernel.org>;
 Thu, 16 Jun 2016 09:30:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:content-transfer-encoding:in-reply-to
 :user-agent; bh=v7AlASYUavSs0mPIM0Jqs700hhK0cF0e/OcAEy4H3U4=;
 b=E7DXjaYcXzMEBV9Cbj4Tj7BfpfNfxMWdCc5mLRR9GB1PPZTOr1seztD8roawsF39Am
 /y9+bp2OLUd7wPi0Oqs5geSEDXoXWiCAn35VCbY2XJSTIcsoxnI4GKvSw3Xp4/ky6mPh
 AFD+0sUrYRzp+kMGR0Ishchv4XQiZG22GGaKg=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:content-transfer-encoding
 :in-reply-to:user-agent;
 bh=v7AlASYUavSs0mPIM0Jqs700hhK0cF0e/OcAEy4H3U4=;
 b=byjhYllpJg9sriRgiqxX5RwBERbbF88yxMFyOoCFLf++O7uLD1iC5sAPxgeboIjGj7
 S5qgGFQc5NL4S4GrD8A/S5E31Yo64qKfJ6XPjj1pFjeBTDlL5K98j9ddCA94kn0AoWKF
 Kb3Q3VLcBiJyHWy/dcsIXHZWMCFM8LovkbAOEmpEZuZhAzgPHTqVBncUOhB3vVsPeV7n
 02GkFoM3+ScWNh4KGAhta1uZHlTktT8nmMA1VGIpWuC45+lk8xg8MGftOim4CNx3YK1q
 013TWOhCzHHhNKKUh5+MJW6r+0j5GHJWMRRn7ii61cYboiqbUdh+SCKrGMGhMWLTZI7D
 W7pw==
X-Gm-Message-State: ALyK8tJpbSE9nuCzut4LD3mf4OejeIOLo45+d+jy8/y9m8Wba0j/tUsvXwOMNHcE5EDPKxVc
X-Received: by 10.194.86.70 with SMTP id n6mr581848wjz.154.1466094618037;
 Thu, 16 Jun 2016 09:30:18 -0700 (PDT)
Received: from vingu-laptop ([2a01:e35:8bd4:7750:f5a4:bcf7:c400:4554])
 by smtp.gmail.com with ESMTPSA id
 g10sm22552682wjl.25.2016.06.16.09.30.14
 (version=TLS1_2 cipher=AES128-SHA bits=128/128);
 Thu, 16 Jun 2016 09:30:15 -0700 (PDT)
Date: Thu, 16 Jun 2016 18:30:13 +0200
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Yuyang Du <yuyang.du@intel.com>, Ingo Molnar <mingo@kernel.org>,
 linux-kernel <linux-kernel@vger.kernel.org>,
 Mike Galbraith <umgwanakikbuti@gmail.com>,
 Benjamin Segall <bsegall@google.com>, Paul Turner <pjt@google.com>,
 Morten Rasmussen <morten.rasmussen@arm.com>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>,
 Matt Fleming <matt@codeblueprint.co.uk>
Subject: Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice
 when switching to fair or changing task group
Message-ID: <20160616163013.GA32169@vingu-laptop>
References: <1465942870-28419-1-git-send-email-yuyang.du@intel.com>
 <1465942870-28419-2-git-send-email-yuyang.du@intel.com>
 <CAKfTPtC6SqsTkH4u6GT_64wwVr9t0mf8J0TchxSQGfbH6oAX9A@mail.gmail.com>
 <20160615152217.GN30921@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20160615152217.GN30921@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Le Wednesday 15 Jun 2016 à 17:22:17 (+0200), Peter Zijlstra a écrit :
> On Wed, Jun 15, 2016 at 09:46:53AM +0200, Vincent Guittot wrote:
> > I still have concerned with this change of the behavior that  attaches
> > the task only when it is enqueued. The load avg of the task will not
> > be decayed between the time we move it into its new group until its
> > enqueue. With this change, a task's load can stay high whereas it has
> > slept for the last couple of seconds. Then, its load and utilization
> > is no more accounted anywhere in the mean time just because we have
> > moved the task which will be enqueued on the same rq.
> > A task should always be attached to a cfs_rq and its load/utilization
> > should always be accounted on a cfs_rq and decayed for its sleep
> > period
> 
> OK; so I think I agree with that. Does the below (completely untested,
> hasn't even been near a compiler) look reasonable?
> 
> The general idea is to always attach to a cfs_rq -- through
> post_init_entity_util_avg(). This covers both the new task isn't
> attached yet and was never in the fair class to begin with issues.

Your patch ensures that a task will be attached to a cfs_rq and fix the issue raised by Yuyang because of se->avg.last_update_time = 0 at init. During the test the following message has raised  "BUG: using smp_processor_id() in preemptible [00000000] code: systemd/1" because of cfs_rq_util_change that is called in attach_entity_load_avg

With patch [1] for the init of cfs_rq side, all use cases will be covered regarding the issue linked to a last_update_time set to 0 at init
[1] https://lkml.org/lkml/2016/5/30/508

> 
> That only leaves a tiny hole in fork() where the task is hashed but
> hasn't yet passed through wake_up_new_task() in which someone can do
> cgroup move on it. That is closed with TASK_NEW and can_attach()
> refusing those tasks.
> 

But a new fair task is still detached and attached from/to task_group with cgroup_post_fork()-->ss->fork(child)-->cpu_cgroup_fork()-->sched_move_task()-->task_move_group_fair().
cpu_cgroup_can_attach is not used in this path and sched_move_task do the move unconditionally for fair task.

With your patch, we still have the sequence

sched_fork()
    set_task_cpu()
cgroup_post_fork()--> ... --> task_move_group_fair()
    detach_task_cfs_rq()
    set_task_rq()
    attach_task_cfs_rq()
wake_up_new_task()
    select_task_rq() can select a new cpu
    set_task_cpu()
        migrate_task_rq_fair if the new_cpu != task_cpu
             remove_load()
        __set_task_cpu
    post_init_entity_util_avg
        attach_task_cfs_rq()
    activate_task
        enqueue_task

In fact, cpu_cgroup_fork needs a small part of sched_move_task so we can just call this small part directly instead sched_move_task. And the task doesn't really migrate because it is not yet attached so we need the sequence:
sched_fork()
    __set_task_cpu()
cgroup_post_fork()--> ... --> task_move_group_fair()
    set_task_rq() to set task group and runqueue
wake_up_new_task()
    select_task_rq() can select a new cpu
    __set_task_cpu
    post_init_entity_util_avg
        attach_task_cfs_rq()
    activate_task
        enqueue_task

The patch below on top of your patch, ensures that we follow the right sequence :

---
 kernel/sched/core.c | 60 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 20 deletions(-)

-- 
1.9.1

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7895689a..a21e3dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2373,7 +2373,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	 * Silence PROVE_RCU.
 	 */
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	set_task_cpu(p, cpu);
+	__set_task_cpu(p, cpu);
 	if (p->sched_class->task_fork)
 		p->sched_class->task_fork(p);
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
@@ -2515,7 +2515,7 @@ void wake_up_new_task(struct task_struct *p)
 	 *  - cpus_allowed can change in the fork path
 	 *  - any previously selected cpu might disappear through hotplug
 	 */
-	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
+	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 	/* Post initialize new task's util average when its cfs_rq is set */
 	post_init_entity_util_avg(&p->se);
@@ -7715,6 +7715,35 @@ void sched_offline_group(struct task_group *tg)
 	spin_unlock_irqrestore(&task_group_lock, flags);
 }
 
+/* Set task's runqueue and group
+ *     In case of a move between group, we update src and dst group
+ *     thanks to sched_class->task_move_group. Otherwise, we just need to set
+ *     runqueue and group pointers. The task will be attached to the runqueue
+ *     during its wake up.
+ */
+static void sched_set_group(struct task_struct *tsk, bool move)
+{
+	struct task_group *tg;
+
+	/*
+	 * All callers are synchronized by task_rq_lock(); we do not use RCU
+	 * which is pointless here. Thus, we pass "true" to task_css_check()
+	 * to prevent lockdep warnings.
+	 */
+	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
+			  struct task_group, css);
+	tg = autogroup_task_group(tsk, tg);
+	tsk->sched_task_group = tg;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	if (move && tsk->sched_class->task_move_group)
+		tsk->sched_class->task_move_group(tsk);
+	else
+#endif
+		set_task_rq(tsk, task_cpu(tsk));
+
+}
+
 /* change task's runqueue when it moves between groups.
  *	The caller of this function should have put the task in its new group
  *	by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
@@ -7722,7 +7751,6 @@ void sched_offline_group(struct task_group *tg)
  */
 void sched_move_task(struct task_struct *tsk)
 {
-	struct task_group *tg;
 	int queued, running;
 	struct rq_flags rf;
 	struct rq *rq;
@@ -7737,22 +7765,7 @@ void sched_move_task(struct task_struct *tsk)
 	if (unlikely(running))
 		put_prev_task(rq, tsk);
 
-	/*
-	 * All callers are synchronized by task_rq_lock(); we do not use RCU
-	 * which is pointless here. Thus, we pass "true" to task_css_check()
-	 * to prevent lockdep warnings.
-	 */
-	tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
-			  struct task_group, css);
-	tg = autogroup_task_group(tsk, tg);
-	tsk->sched_task_group = tg;
-
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	if (tsk->sched_class->task_move_group)
-		tsk->sched_class->task_move_group(tsk);
-	else
-#endif
-		set_task_rq(tsk, task_cpu(tsk));
+	sched_set_group(tsk, true);
 
 	if (unlikely(running))
 		tsk->sched_class->set_curr_task(rq);
@@ -8182,7 +8195,14 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 
 static void cpu_cgroup_fork(struct task_struct *task)
 {
-	sched_move_task(task);
+	struct rq_flags rf;
+	struct rq *rq;
+
+	rq = task_rq_lock(task, &rf);
+
+	sched_set_group(task, false);
+
+	task_rq_unlock(rq, task, &rf);
 }
 
 static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)