From patchwork Thu Aug 30 18:18:32 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Paul E. McKenney" X-Patchwork-Id: 11053 Return-Path: X-Original-To: patchwork@peony.canonical.com Delivered-To: patchwork@peony.canonical.com Received: from fiordland.canonical.com (fiordland.canonical.com [91.189.94.145]) by peony.canonical.com (Postfix) with ESMTP id 757DE23EFE for ; Thu, 30 Aug 2012 18:19:52 +0000 (UTC) Received: from mail-ie0-f180.google.com (mail-ie0-f180.google.com [209.85.223.180]) by fiordland.canonical.com (Postfix) with ESMTP id 25BB8A1818E for ; Thu, 30 Aug 2012 18:19:16 +0000 (UTC) Received: by mail-ie0-f180.google.com with SMTP id k11so996735iea.11 for ; Thu, 30 Aug 2012 11:19:51 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-forwarded-to:x-forwarded-for:delivered-to:received-spf:from:to:cc :subject:date:message-id:x-mailer:in-reply-to:references :x-content-scanned:x-cbid:x-gm-message-state; bh=HUP9pc23KJJBMoW5AB2YkbFQ5PgxZKqc1QtGHPAzhoo=; b=dfl+aRgc/OX77C+F5lSb6emjL9E+r18PKwNXogLiqEDTZLjVsSJquN+6xbZU3P8cD2 K07WQLy5xwmRYwE4qwD5Q1siRVFQq89243XehLyqzWZ9O1WcuXOaFxZ2u0sgAmsPviVJ Br4uEwmBxA3gREUTW5fFgr8js4Mqgpt6nmw6RTZXVFhZ7FMhLcmWO/oS7pjIiAi/VAdn 8yVEY1hTa37YzmiL/Zm4K44Uwkd67TRdKCO8BwZI11bqNODhaqqiAFoEPzGBAQM9iHN8 n6cZctlLal35luscJPXpCPD0o8TpfQm0IG6jBLExSC3WrbarwW9sxUx7kZkHYhMQVS4l S+Sw== Received: by 10.50.7.212 with SMTP id l20mr1715037iga.43.1346350791825; Thu, 30 Aug 2012 11:19:51 -0700 (PDT) X-Forwarded-To: linaro-patchwork@canonical.com X-Forwarded-For: patch@linaro.org linaro-patchwork@canonical.com Delivered-To: patches@linaro.org Received: by 10.50.184.232 with SMTP id ex8csp24745igc; Thu, 30 Aug 2012 11:19:51 -0700 (PDT) Received: by 10.60.2.161 with SMTP id 1mr5555910oev.48.1346350791392; Thu, 30 Aug 2012 11:19:51 -0700 (PDT) Received: from e36.co.us.ibm.com (e36.co.us.ibm.com. [32.97.110.154]) by mx.google.com with ESMTPS id lr6si3095527obb.70.2012.08.30.11.19.51 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 30 Aug 2012 11:19:51 -0700 (PDT) Received-SPF: pass (google.com: domain of paulmck@linux.vnet.ibm.com designates 32.97.110.154 as permitted sender) client-ip=32.97.110.154; Authentication-Results: mx.google.com; spf=pass (google.com: domain of paulmck@linux.vnet.ibm.com designates 32.97.110.154 as permitted sender) smtp.mail=paulmck@linux.vnet.ibm.com Received: from /spool/local by e36.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 30 Aug 2012 12:19:41 -0600 Received: from d03dlp01.boulder.ibm.com (9.17.202.177) by e36.co.us.ibm.com (192.168.1.136) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 30 Aug 2012 12:19:15 -0600 Received: from d03relay03.boulder.ibm.com (d03relay03.boulder.ibm.com [9.17.195.228]) by d03dlp01.boulder.ibm.com (Postfix) with ESMTP id 178FC1FF0039 for ; Thu, 30 Aug 2012 12:19:11 -0600 (MDT) Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay03.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q7UIIwS4100222 for ; Thu, 30 Aug 2012 12:18:58 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q7UIIhQo019005 for ; Thu, 30 Aug 2012 12:18:52 -0600 Received: from paulmck-ThinkPad-W500 (sig-9-77-132-62.mts.ibm.com [9.77.132.62]) by d03av01.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id q7UIIgJG018694; Thu, 30 Aug 2012 12:18:43 -0600 Received: by paulmck-ThinkPad-W500 (Postfix, from userid 1000) id CDC55EA837; Thu, 30 Aug 2012 11:18:40 -0700 (PDT) From: "Paul E. McKenney" To: linux-kernel@vger.kernel.org Cc: mingo@elte.hu, laijs@cn.fujitsu.com, dipankar@in.ibm.com, akpm@linux-foundation.org, mathieu.desnoyers@polymtl.ca, josh@joshtriplett.org, niv@us.ibm.com, tglx@linutronix.de, peterz@infradead.org, rostedt@goodmis.org, Valdis.Kletnieks@vt.edu, dhowells@redhat.com, eric.dumazet@gmail.com, darren@dvhart.com, fweisbec@gmail.com, sbw@mit.edu, patches@linaro.org, "Paul E. McKenney" Subject: [PATCH tip/core/rcu 17/23] rcu: Fix day-zero grace-period initialization/cleanup race Date: Thu, 30 Aug 2012 11:18:32 -0700 Message-Id: <1346350718-30937-17-git-send-email-paulmck@linux.vnet.ibm.com> X-Mailer: git-send-email 1.7.8 In-Reply-To: <1346350718-30937-1-git-send-email-paulmck@linux.vnet.ibm.com> References: <20120830181811.GA29154@linux.vnet.ibm.com> <1346350718-30937-1-git-send-email-paulmck@linux.vnet.ibm.com> X-Content-Scanned: Fidelis XPS MAILER x-cbid: 12083018-7606-0000-0000-00000333B98D X-Gm-Message-State: ALoCoQnmPtMiVThwXhfBkEtFHQBUewjfPabCEHIcRcG99uIsbmUiIoXyYZVwav5sYAZu/aF1ytnT From: "Paul E. McKenney" The current approach to grace-period initialization is vulnerable to extremely low-probabity races. These races stem fro the fact that the old grace period is marked completed on the same traversal through the rcu_node structure that is marking the start of the new grace period. These races can result in too-short grace periods, as shown in the following scenario: 1. CPU 0 completes a grace period, but needs an additional grace period, so starts initializing one, initializing all the non-leaf rcu_node strcutures and the first leaf rcu_node structure. Because CPU 0 is both completing the old grace period and starting a new one, it marks the completion of the old grace period and the start of the new grace period in a single traversal of the rcu_node structures. Therefore, CPUs corresponding to the first rcu_node structure can become aware that the prior grace period has completed, but CPUs corresponding to the other rcu_node structures will see this same prior grace period as still being in progress. 2. CPU 1 passes through a quiescent state, and therefore informs the RCU core. Because its leaf rcu_node structure has already been initialized, this CPU's quiescent state is applied to the new (and only partially initialized) grace period. 3. CPU 1 enters an RCU read-side critical section and acquires a reference to data item A. Note that this critical section started after the beginning of the new grace period, and therefore will not block this new grace period. 4. CPU 16 exits dyntick-idle mode. Because it was in dyntick-idle mode, other CPUs informed the RCU core of its extended quiescent state for the past several grace periods. This means that CPU 16 is not yet aware that these past grace periods have ended. Assume that CPU 16 corresponds to the second leaf rcu_node structure. 5. CPU 16 removes data item A from its enclosing data structure and passes it to call_rcu(), which queues a callback in the RCU_NEXT_TAIL segment of the callback queue. 6. CPU 16 enters the RCU core, possibly because it has taken a scheduling-clock interrupt, or alternatively because it has more than 10,000 callbacks queued. It notes that the second most recent grace period has completed (recall that it cannot yet become aware that the most recent grace period has completed), and therefore advances its callbacks. The callback for data item A is therefore in the RCU_NEXT_READY_TAIL segment of the callback queue. 7. CPU 0 completes initialization of the remaining leaf rcu_node structures for the new grace period, including the structure corresponding to CPU 16. 8. CPU 16 again enters the RCU core, again, possibly because it has taken a scheduling-clock interrupt, or alternatively because it now has more than 10,000 callbacks queued. It notes that the most recent grace period has ended, and therefore advances its callbacks. The callback for data item A is therefore in the RCU_WAIT_TAIL segment of the callback queue. 9. All CPUs other than CPU 1 pass through quiescent states. Because CPU 1 already passed through its quiescent state, the new grace period completes. Note that CPU 1 is still in its RCU read-side critical section, still referencing data item A. 10. Suppose that CPU 2 wais the last CPU to pass through a quiescent state for the new grace period, and suppose further that CPU 2 did not have any callbacks queued, therefore not needing an additional grace period. CPU 2 therefore traverses all of the rcu_node structures, marking the new grace period as completed, but does not initialize a new grace period. 11. CPU 16 yet again enters the RCU core, yet again possibly because it has taken a scheduling-clock interrupt, or alternatively because it now has more than 10,000 callbacks queued. It notes that the new grace period has ended, and therefore advances its callbacks. The callback for data item A is therefore in the RCU_DONE_TAIL segment of the callback queue. This means that this callback is now considered ready to be invoked. 12. CPU 16 invokes the callback, freeing data item A while CPU 1 is still referencing it. This scenario represents a day-zero bug for TREE_RCU. This commit therefore ensures that the old grace period is marked completed in all leaf rcu_node structures before a new grace period is marked started in any of them. Signed-off-by: Paul E. McKenney Reviewed-by: Josh Triplett --- kernel/rcutree.c | 36 +++++++++++++----------------------- 1 files changed, 13 insertions(+), 23 deletions(-) diff --git a/kernel/rcutree.c b/kernel/rcutree.c index d435009..4cfe488 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -1161,33 +1161,23 @@ static void rcu_gp_cleanup(struct rcu_state *rsp) * they can do to advance the grace period. It is therefore * safe for us to drop the lock in order to mark the grace * period as completed in all of the rcu_node structures. - * - * But if this CPU needs another grace period, it will take - * care of this while initializing the next grace period. - * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL - * because the callbacks have not yet been advanced: Those - * callbacks are waiting on the grace period that just now - * completed. */ - rdp = this_cpu_ptr(rsp->rda); - if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) { - raw_spin_unlock_irqrestore(&rnp->lock, flags); + raw_spin_unlock_irqrestore(&rnp->lock, flags); - /* - * Propagate new ->completed value to rcu_node - * structures so that other CPUs don't have to - * wait until the start of the next grace period - * to process their callbacks. - */ - rcu_for_each_node_breadth_first(rsp, rnp) { - raw_spin_lock_irqsave(&rnp->lock, flags); - rnp->completed = rsp->gpnum; - raw_spin_unlock_irqrestore(&rnp->lock, flags); - cond_resched(); - } - rnp = rcu_get_root(rsp); + /* + * Propagate new ->completed value to rcu_node structures so + * that other CPUs don't have to wait until the start of the next + * grace period to process their callbacks. This also avoids + * some nasty RCU grace-period initialization races. + */ + rcu_for_each_node_breadth_first(rsp, rnp) { raw_spin_lock_irqsave(&rnp->lock, flags); + rnp->completed = rsp->gpnum; + raw_spin_unlock_irqrestore(&rnp->lock, flags); + cond_resched(); } + rnp = rcu_get_root(rsp); + raw_spin_lock_irqsave(&rnp->lock, flags); rsp->completed = rsp->gpnum; /* Declare grace period done. */ trace_rcu_grace_period(rsp->name, rsp->completed, "end");