From patchwork Thu Dec 1 16:38:53 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vincent Guittot X-Patchwork-Id: 86102 Delivered-To: patch@linaro.org Received: by 10.140.20.101 with SMTP id 92csp777689qgi; Thu, 1 Dec 2016 08:39:31 -0800 (PST) X-Received: by 10.99.37.2 with SMTP id l2mr70407747pgl.160.1480610371362; Thu, 01 Dec 2016 08:39:31 -0800 (PST) Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l7si788072pfc.49.2016.12.01.08.39.31; Thu, 01 Dec 2016 08:39:31 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932557AbcLAQjN (ORCPT + 25 others); Thu, 1 Dec 2016 11:39:13 -0500 Received: from mail-wj0-f179.google.com ([209.85.210.179]:36369 "EHLO mail-wj0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758363AbcLAQjK (ORCPT ); Thu, 1 Dec 2016 11:39:10 -0500 Received: by mail-wj0-f179.google.com with SMTP id qp4so209714302wjc.3 for ; Thu, 01 Dec 2016 08:39:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id; bh=vzTmj+l4qKGuFuSgeOmQAowt6Ns7vZ8O0l6e70faG5E=; b=Nwz9Y1flDX/w9CxKxaAikgtna77f3gmRDfWZNfPD3WVltQr6r8e6rYGfg2faSGH+45 KSCgiG49pHxViIxo8zCFIJladcpMkeonKDA6bpjUZlBqG3bLSkeakMvfbPD7zi8Rz2Cz CknZjcLgRnoO39J2Uvp2KwVxl/Eu4lK/cWOYU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=vzTmj+l4qKGuFuSgeOmQAowt6Ns7vZ8O0l6e70faG5E=; b=Ul3Oyq0YWiEeThzFm9A0tRjYIetUIUUNM4pTHFX5ZxGkDJiQ6dp0Vj31uktyJotOKx 0ax1Hgah7IpnwnqIUAMbbQCS7uNto+tgycjjkbM7dwa+UXFxy8OsSmID8EJWGTccGNqV bc9vKHdVZc7+HjLLrEgdTisnvBVU77RjOn2cBC54T3VudEj1MoMYWTNkMHuCcVGa8DUt ZMn0LwwBLm8QP1uV1B1Ek7qgn7EaQ63VRZ14NY7qrr+UtnaJUurcDo1EG03stTlffpqs t5xAVNOEY5AuV0SMtjW7dSWPJRrFHGJtB5q2cgvHo+Bg6O3nmmii+KT61nKGM5jtLcHq mY8w== X-Gm-Message-State: AKaTC01Y+auOfBTk2NrZB6zaW2W8lWI8SdoR6SbNPkJm026+fdl3Jg/kEzVz4Zw347amvb4T X-Received: by 10.194.150.108 with SMTP id uh12mr37234059wjb.107.1480610348129; Thu, 01 Dec 2016 08:39:08 -0800 (PST) Received: from localhost.localdomain ([91.160.61.128]) by smtp.gmail.com with ESMTPSA id d85sm14095218wmd.17.2016.12.01.08.39.06 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 01 Dec 2016 08:39:07 -0800 (PST) From: Vincent Guittot To: peterz@infradead.org, mingo@kernel.org, linux-kernel@vger.kernel.org Cc: pjt@google.com, Vincent Guittot , stable@vger.kernel.org Subject: [PATCH] sched: fix group_entity's share update Date: Thu, 1 Dec 2016 17:38:53 +0100 Message-Id: <1480610333-23329-1-git-send-email-vincent.guittot@linaro.org> X-Mailer: git-send-email 2.7.4 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The update of the share of a cfs_rq is done when its load_avg is updated but before the group_entity's load_avg has been updated for the past time slot. This generates wrong load_avg accounting which can be significant when small tasks are involved in the scheduling. Let take the example of a task TA that is dequeued of its task group TG1. TA was the only task in TG1 which becomes idle. We have the sequence: - dequeue_entity TA->se - update_load_avg(TA->se) - dequeue_entity_load_avg(TG1->cfs_rq, TA->se) - account_entity_dequeue(TG1->cfs_rq, TA->se) TG1->cfs_rq->load.weight = 0 - update_cfs_shares(TG1->cfs_rq) TG1->se->load.weight is updated with the new share of cfs_rq. TG1->se->load.weight = 0. - dequeue_entity TG1->se - update_load_avg(TG1->se) but its weight is now null so the last time slot (up to a tick) will be accounted with its new weight (0 in our case) instead of its real weight. The last time slot is accounted as an idle one whereas it was a running one. If the running time of TA is short enough that no tick happens when it runs, all running time of TG1->se will be accounted as idle time. Instead, we should update the share of a cfs_rq (in fact the weight of its group entity) only after having updated the load_avg of the group_entity. update_cfs_shares() now takes the sched_entity as parameter instead of the cfs_rq and the weight of the group_entity is updated only once its load_avg has been synced with current time. Cc: Signed-off-by: Vincent Guittot --- I have seen the problem on tip/sched/core, v4.8 and v4.7. Previous versions might also have the problem but I haven't not been able to test them yet. kernel/sched/fair.c | 27 ++++++++++++++++----------- 1 file changed, 16 insertions(+), 11 deletions(-) -- 2.7.4 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 18d9e75..19092fa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, static inline int throttled_hierarchy(struct cfs_rq *cfs_rq); -static void update_cfs_shares(struct cfs_rq *cfs_rq) +static void update_cfs_shares(struct sched_entity *se) { struct task_group *tg; - struct sched_entity *se; + struct cfs_rq *cfs_rq = group_cfs_rq(se); long shares; + if (entity_is_task(se)) + return; + tg = cfs_rq->tg; - se = tg->se[cpu_of(rq_of(cfs_rq))]; - if (!se || throttled_hierarchy(cfs_rq)) + + if (throttled_hierarchy(cfs_rq)) return; #ifndef CONFIG_SMP if (likely(se->load.weight == tg->shares)) @@ -2707,8 +2710,10 @@ static void update_cfs_shares(struct cfs_rq *cfs_rq) reweight_entity(cfs_rq_of(se), se, shares); } + + #else /* CONFIG_FAIR_GROUP_SCHED */ -static inline void update_cfs_shares(struct cfs_rq *cfs_rq) +static inline void update_cfs_shares(struct sched_entity *se) { } #endif /* CONFIG_FAIR_GROUP_SCHED */ @@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) se->vruntime += cfs_rq->min_vruntime; update_load_avg(se, UPDATE_TG); + update_cfs_shares(se); enqueue_entity_load_avg(cfs_rq, se); account_entity_enqueue(cfs_rq, se); - update_cfs_shares(cfs_rq); if (flags & ENQUEUE_WAKEUP) place_entity(cfs_rq, se, 0); @@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) /* return excess runtime on last dequeue */ return_cfs_rq_runtime(cfs_rq); - update_cfs_shares(cfs_rq); + update_cfs_shares(se); /* * Now advance min_vruntime if @se was the entity holding it back, @@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) * Ensure that runnable average is periodically updated. */ update_load_avg(curr, UPDATE_TG); - update_cfs_shares(cfs_rq); + update_cfs_shares(curr); #ifdef CONFIG_SCHED_HRTICK /* @@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) break; update_load_avg(se, UPDATE_TG); - update_cfs_shares(cfs_rq); + update_cfs_shares(se); } if (!se) @@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) break; update_load_avg(se, UPDATE_TG); - update_cfs_shares(cfs_rq); + update_cfs_shares(se); } if (!se) @@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares) /* Possible calls to update_curr() need rq clock */ update_rq_clock(rq); for_each_sched_entity(se) - update_cfs_shares(group_cfs_rq(se)); + update_cfs_shares(se); raw_spin_unlock_irqrestore(&rq->lock, flags); }