From patchwork Thu Jun 16 16:30:13 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Vincent Guittot X-Patchwork-Id: 70200 Delivered-To: patch@linaro.org Received: by 10.140.28.4 with SMTP id 4csp350336qgy; Thu, 16 Jun 2016 09:30:28 -0700 (PDT) X-Received: by 10.66.66.42 with SMTP id c10mr6120147pat.119.1466094628762; Thu, 16 Jun 2016 09:30:28 -0700 (PDT) Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i5si6692460pfk.100.2016.06.16.09.30.28; Thu, 16 Jun 2016 09:30:28 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754177AbcFPQa0 (ORCPT + 30 others); Thu, 16 Jun 2016 12:30:26 -0400 Received: from mail-wm0-f41.google.com ([74.125.82.41]:37814 "EHLO mail-wm0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751490AbcFPQaY (ORCPT ); Thu, 16 Jun 2016 12:30:24 -0400 Received: by mail-wm0-f41.google.com with SMTP id a66so65855929wme.0 for ; Thu, 16 Jun 2016 09:30:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=v7AlASYUavSs0mPIM0Jqs700hhK0cF0e/OcAEy4H3U4=; b=E7DXjaYcXzMEBV9Cbj4Tj7BfpfNfxMWdCc5mLRR9GB1PPZTOr1seztD8roawsF39Am /y9+bp2OLUd7wPi0Oqs5geSEDXoXWiCAn35VCbY2XJSTIcsoxnI4GKvSw3Xp4/ky6mPh AFD+0sUrYRzp+kMGR0Ishchv4XQiZG22GGaKg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=v7AlASYUavSs0mPIM0Jqs700hhK0cF0e/OcAEy4H3U4=; b=byjhYllpJg9sriRgiqxX5RwBERbbF88yxMFyOoCFLf++O7uLD1iC5sAPxgeboIjGj7 S5qgGFQc5NL4S4GrD8A/S5E31Yo64qKfJ6XPjj1pFjeBTDlL5K98j9ddCA94kn0AoWKF Kb3Q3VLcBiJyHWy/dcsIXHZWMCFM8LovkbAOEmpEZuZhAzgPHTqVBncUOhB3vVsPeV7n 02GkFoM3+ScWNh4KGAhta1uZHlTktT8nmMA1VGIpWuC45+lk8xg8MGftOim4CNx3YK1q 013TWOhCzHHhNKKUh5+MJW6r+0j5GHJWMRRn7ii61cYboiqbUdh+SCKrGMGhMWLTZI7D W7pw== X-Gm-Message-State: ALyK8tJpbSE9nuCzut4LD3mf4OejeIOLo45+d+jy8/y9m8Wba0j/tUsvXwOMNHcE5EDPKxVc X-Received: by 10.194.86.70 with SMTP id n6mr581848wjz.154.1466094618037; Thu, 16 Jun 2016 09:30:18 -0700 (PDT) Received: from vingu-laptop ([2a01:e35:8bd4:7750:f5a4:bcf7:c400:4554]) by smtp.gmail.com with ESMTPSA id g10sm22552682wjl.25.2016.06.16.09.30.14 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Thu, 16 Jun 2016 09:30:15 -0700 (PDT) Date: Thu, 16 Jun 2016 18:30:13 +0200 From: Vincent Guittot To: Peter Zijlstra Cc: Yuyang Du , Ingo Molnar , linux-kernel , Mike Galbraith , Benjamin Segall , Paul Turner , Morten Rasmussen , Dietmar Eggemann , Matt Fleming Subject: Re: [PATCH v6 1/4] sched/fair: Fix attaching task sched avgs twice when switching to fair or changing task group Message-ID: <20160616163013.GA32169@vingu-laptop> References: <1465942870-28419-1-git-send-email-yuyang.du@intel.com> <1465942870-28419-2-git-send-email-yuyang.du@intel.com> <20160615152217.GN30921@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160615152217.GN30921@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Le Wednesday 15 Jun 2016 à 17:22:17 (+0200), Peter Zijlstra a écrit : > On Wed, Jun 15, 2016 at 09:46:53AM +0200, Vincent Guittot wrote: > > I still have concerned with this change of the behavior that attaches > > the task only when it is enqueued. The load avg of the task will not > > be decayed between the time we move it into its new group until its > > enqueue. With this change, a task's load can stay high whereas it has > > slept for the last couple of seconds. Then, its load and utilization > > is no more accounted anywhere in the mean time just because we have > > moved the task which will be enqueued on the same rq. > > A task should always be attached to a cfs_rq and its load/utilization > > should always be accounted on a cfs_rq and decayed for its sleep > > period > > OK; so I think I agree with that. Does the below (completely untested, > hasn't even been near a compiler) look reasonable? > > The general idea is to always attach to a cfs_rq -- through > post_init_entity_util_avg(). This covers both the new task isn't > attached yet and was never in the fair class to begin with issues. Your patch ensures that a task will be attached to a cfs_rq and fix the issue raised by Yuyang because of se->avg.last_update_time = 0 at init. During the test the following message has raised  "BUG: using smp_processor_id() in preemptible [00000000] code: systemd/1" because of cfs_rq_util_change that is called in attach_entity_load_avg With patch [1] for the init of cfs_rq side, all use cases will be covered regarding the issue linked to a last_update_time set to 0 at init [1] https://lkml.org/lkml/2016/5/30/508 > > That only leaves a tiny hole in fork() where the task is hashed but > hasn't yet passed through wake_up_new_task() in which someone can do > cgroup move on it. That is closed with TASK_NEW and can_attach() > refusing those tasks. > But a new fair task is still detached and attached from/to task_group with cgroup_post_fork()-->ss->fork(child)-->cpu_cgroup_fork()-->sched_move_task()-->task_move_group_fair(). cpu_cgroup_can_attach is not used in this path and sched_move_task do the move unconditionally for fair task. With your patch, we still have the sequence sched_fork()     set_task_cpu() cgroup_post_fork()--> ... --> task_move_group_fair()     detach_task_cfs_rq()     set_task_rq()     attach_task_cfs_rq() wake_up_new_task()     select_task_rq() can select a new cpu     set_task_cpu()         migrate_task_rq_fair if the new_cpu != task_cpu              remove_load()         __set_task_cpu     post_init_entity_util_avg         attach_task_cfs_rq()     activate_task         enqueue_task In fact, cpu_cgroup_fork needs a small part of sched_move_task so we can just call this small part directly instead sched_move_task. And the task doesn't really migrate because it is not yet attached so we need the sequence: sched_fork()     __set_task_cpu() cgroup_post_fork()--> ... --> task_move_group_fair()     set_task_rq() to set task group and runqueue wake_up_new_task()     select_task_rq() can select a new cpu     __set_task_cpu     post_init_entity_util_avg         attach_task_cfs_rq()     activate_task         enqueue_task The patch below on top of your patch, ensures that we follow the right sequence : --- kernel/sched/core.c | 60 +++++++++++++++++++++++++++++++++++------------------ 1 file changed, 40 insertions(+), 20 deletions(-) -- 1.9.1 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7895689a..a21e3dc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2373,7 +2373,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) * Silence PROVE_RCU. */ raw_spin_lock_irqsave(&p->pi_lock, flags); - set_task_cpu(p, cpu); + __set_task_cpu(p, cpu); if (p->sched_class->task_fork) p->sched_class->task_fork(p); raw_spin_unlock_irqrestore(&p->pi_lock, flags); @@ -2515,7 +2515,7 @@ void wake_up_new_task(struct task_struct *p) * - cpus_allowed can change in the fork path * - any previously selected cpu might disappear through hotplug */ - set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); + __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif /* Post initialize new task's util average when its cfs_rq is set */ post_init_entity_util_avg(&p->se); @@ -7715,6 +7715,35 @@ void sched_offline_group(struct task_group *tg) spin_unlock_irqrestore(&task_group_lock, flags); } +/* Set task's runqueue and group + * In case of a move between group, we update src and dst group + * thanks to sched_class->task_move_group. Otherwise, we just need to set + * runqueue and group pointers. The task will be attached to the runqueue + * during its wake up. + */ +static void sched_set_group(struct task_struct *tsk, bool move) +{ + struct task_group *tg; + + /* + * All callers are synchronized by task_rq_lock(); we do not use RCU + * which is pointless here. Thus, we pass "true" to task_css_check() + * to prevent lockdep warnings. + */ + tg = container_of(task_css_check(tsk, cpu_cgrp_id, true), + struct task_group, css); + tg = autogroup_task_group(tsk, tg); + tsk->sched_task_group = tg; + +#ifdef CONFIG_FAIR_GROUP_SCHED + if (move && tsk->sched_class->task_move_group) + tsk->sched_class->task_move_group(tsk); + else +#endif + set_task_rq(tsk, task_cpu(tsk)); + +} + /* change task's runqueue when it moves between groups. * The caller of this function should have put the task in its new group * by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to @@ -7722,7 +7751,6 @@ void sched_offline_group(struct task_group *tg) */ void sched_move_task(struct task_struct *tsk) { - struct task_group *tg; int queued, running; struct rq_flags rf; struct rq *rq; @@ -7737,22 +7765,7 @@ void sched_move_task(struct task_struct *tsk) if (unlikely(running)) put_prev_task(rq, tsk); - /* - * All callers are synchronized by task_rq_lock(); we do not use RCU - * which is pointless here. Thus, we pass "true" to task_css_check() - * to prevent lockdep warnings. - */ - tg = container_of(task_css_check(tsk, cpu_cgrp_id, true), - struct task_group, css); - tg = autogroup_task_group(tsk, tg); - tsk->sched_task_group = tg; - -#ifdef CONFIG_FAIR_GROUP_SCHED - if (tsk->sched_class->task_move_group) - tsk->sched_class->task_move_group(tsk); - else -#endif - set_task_rq(tsk, task_cpu(tsk)); + sched_set_group(tsk, true); if (unlikely(running)) tsk->sched_class->set_curr_task(rq); @@ -8182,7 +8195,14 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css) static void cpu_cgroup_fork(struct task_struct *task) { - sched_move_task(task); + struct rq_flags rf; + struct rq *rq; + + rq = task_rq_lock(task, &rf); + + sched_set_group(task, false); + + task_rq_unlock(rq, task, &rf); } static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)