From patchwork Thu Feb 11 09:26:59 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lee Jones X-Patchwork-Id: 380896 Delivered-To: patch@linaro.org Received: by 2002:a02:b18a:0:0:0:0:0 with SMTP id t10csp1909687jah; Thu, 11 Feb 2021 01:31:57 -0800 (PST) X-Google-Smtp-Source: ABdhPJx3TVfVzJI0Smkxrx6kh1ZUBB1I3e6WkJAiPNslNvbMTPTbFND6MdFyzTzF/X3VPvV7RUBG X-Received: by 2002:a50:bf47:: with SMTP id g7mr7330844edk.323.1613035917137; Thu, 11 Feb 2021 01:31:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1613035917; cv=none; d=google.com; s=arc-20160816; b=OYGma3p1+R33ZzSZDicrxToUBlNs306fixj6zCfp9VbL9OrZdm1jViuxjJsJcV1sZG WoCPCDoF/4dbRgtI3Vm1iCmbwFry36av1OPz596yUjJKw5oP6oR5d6kiPnqXTqIB/F5O 0xs1420SvVMAsiAlGCukGt+HPW6ydjbhYu0wwxZoNmPINSVR/yhDae194bXafy8nKSUI UWEywfNy24i4gnoM2K6nPLFuBJfrgtjaUgA1ekgZXECfOk6JQBMRJQmr2f2m3wbqC4Ws ch9bfEgj2kDlk/QyVFrBlWyzxoFlbY1jX9teXb2IVUI6F8D9+CBnn18roXoUQqcBs92E l8BQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=CxJjMXkMj/WyIEge7vE6JXU1t1YR/vT3MHOJ75WJKbg=; b=zL+K8kdOtCf3xFTVazwEiGY9wyoFHOf5w/Ofjp70rfblvFNZPXp8QC4NsUOzeNnCB0 fGvcgbODTUM/onJmZm2DUF7utiRsVxBDVvef9ehDSg1HBjBFIvZvauTTXOuuwTkPnwzA NgmzSzpqqya/6EAL8XP3t5TXNF07HuRNsoGhvbbWTM5EVQkfHe84hpFSdZufy3YjXchm tJMC0sN9Ka1SCSyOakO1Q/Dd+GdH5+pUySbvxbAlCGYuK0lYfxQu72hFkqgEHFCW2vsd h29EujBfUCIt8b2gamTlBIeDlrv2DfHb3IKnriFiYw8bIuG99vJ9WBKMHphbxbUS4AyK vQfQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=MP5Ab8cg; spf=pass (google.com: domain of stable-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=stable-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id t20si3394112eju.239.2021.02.11.01.31.56; Thu, 11 Feb 2021 01:31:57 -0800 (PST) Received-SPF: pass (google.com: domain of stable-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=MP5Ab8cg; spf=pass (google.com: domain of stable-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=stable-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230063AbhBKJaZ (ORCPT + 13 others); Thu, 11 Feb 2021 04:30:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60222 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230080AbhBKJ2O (ORCPT ); Thu, 11 Feb 2021 04:28:14 -0500 Received: from mail-wm1-x330.google.com (mail-wm1-x330.google.com [IPv6:2a00:1450:4864:20::330]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D9492C06178B for ; Thu, 11 Feb 2021 01:27:28 -0800 (PST) Received: by mail-wm1-x330.google.com with SMTP id w4so4775582wmi.4 for ; Thu, 11 Feb 2021 01:27:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=CxJjMXkMj/WyIEge7vE6JXU1t1YR/vT3MHOJ75WJKbg=; b=MP5Ab8cgakCP4eYLpA3NexSZkhYH29ZLk5DAd1LojwAt8he8j7sWlmhbtLRrJR3tjR J7Zp3E65m52drtm9s/if8lfD2KMahbAkr4+Czhnz2zcuYdSqFQ524hHI4bm5FoA+5KeN 0Y9yutoN9qQX9Cmvc5+DsCai5MEV8AuNTb0P8zW/cfS35oN/47gGQwLe3cZ2m6JjCH0Y Jgsx1cqfnn8J4PKPWC2yBIo94TnhXBK/lvYQmc2W3oqItex/xV19aYM9QMAGhVyo//UR SmK7a/1W3wDSJe7vDIIW55olJ63lzBCVj8ep1UUwApxZng5Htn2fxRtCEkWNizLaW4jx aFtg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=CxJjMXkMj/WyIEge7vE6JXU1t1YR/vT3MHOJ75WJKbg=; b=Uy9usiQ7g+PT1nHnX4dar6RI8SYWBS8g1X8MJtWLLb1zQW5gaQf+qhd8kpU6yjyboW tAcUr+vEU8+9L9biHow2WnXwJInxsueu97YYFjumFF/icsSuOFbUQuINkNBD5UOdkL9o DA2ZZQ3ZczA31jh6OfAoy3djpC2qyhog2VKXOjZcqhcoZet8Zgbg9rASA9/N8+/yEyav K6BRYQPfMJQdPzCWbGv48EWuhf+m6EtaOrGmc4MVyeOEsv1X61HeRgoP5X8HgPZelg+0 UNx4OfED1O17rQT7a7stPTovj2swZzWCSMlgG2Sy21spxH9+dwSL96e7sg+o1veY++Qs cRaw== X-Gm-Message-State: AOAM531uqHVnnST28UtQ7x4W6WaBLJCCvfDhIyEyZi0Spu8+epH++UMT 1OUicHpGl6WpirA9jVCSzL2T6Ls+o5FKhw== X-Received: by 2002:a7b:c193:: with SMTP id y19mr4132390wmi.23.1613035647285; Thu, 11 Feb 2021 01:27:27 -0800 (PST) Received: from dell.default ([91.110.221.159]) by smtp.gmail.com with ESMTPSA id y5sm8335124wmi.10.2021.02.11.01.27.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Feb 2021 01:27:26 -0800 (PST) From: Lee Jones To: stable@vger.kernel.org Cc: zhengyejian@foxmail.com, Peter Zijlstra , juri.lelli@arm.com, bigeasy@linutronix.de, xlpang@redhat.com, rostedt@goodmis.org, mathieu.desnoyers@efficios.com, jdesfossez@efficios.com, dvhart@infradead.org, bristot@redhat.com, Thomas Gleixner , Lee Jones Subject: [PATCH 2/3] futex: Change locking rules Date: Thu, 11 Feb 2021 09:26:59 +0000 Message-Id: <20210211092700.11772-3-lee.jones@linaro.org> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210211092700.11772-1-lee.jones@linaro.org> References: <20210211092700.11772-1-lee.jones@linaro.org> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Peter Zijlstra Currently futex-pi relies on hb->lock to serialize everything. But hb->lock creates another set of problems, especially priority inversions on RT where hb->lock becomes a rt_mutex itself. The rt_mutex::wait_lock is the most obvious protection for keeping the futex user space value and the kernel internal pi_state in sync. Rework and document the locking so rt_mutex::wait_lock is held accross all operations which modify the user space value and the pi state. This allows to invoke rt_mutex_unlock() (including deboost) without holding hb->lock as a next step. Nothing yet relies on the new locking rules. Signed-off-by: Peter Zijlstra (Intel) Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.751993333@infradead.org Signed-off-by: Thomas Gleixner [Lee: Back-ported in support of a previous futex back-port attempt] Signed-off-by: Lee Jones --- kernel/futex.c | 138 +++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 112 insertions(+), 26 deletions(-) -- 2.25.1 diff --git a/kernel/futex.c b/kernel/futex.c index a43cf67c2fe91..829e897c8883b 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1019,6 +1019,39 @@ static void exit_pi_state_list(struct task_struct *curr) * [10] There is no transient state which leaves owner and user space * TID out of sync. Except one error case where the kernel is denied * write access to the user address, see fixup_pi_state_owner(). + * + * + * Serialization and lifetime rules: + * + * hb->lock: + * + * hb -> futex_q, relation + * futex_q -> pi_state, relation + * + * (cannot be raw because hb can contain arbitrary amount + * of futex_q's) + * + * pi_mutex->wait_lock: + * + * {uval, pi_state} + * + * (and pi_mutex 'obviously') + * + * p->pi_lock: + * + * p->pi_state_list -> pi_state->list, relation + * + * pi_state->refcount: + * + * pi_state lifetime + * + * + * Lock order: + * + * hb->lock + * pi_mutex->wait_lock + * p->pi_lock + * */ /* @@ -1026,10 +1059,12 @@ static void exit_pi_state_list(struct task_struct *curr) * the pi_state against the user space value. If correct, attach to * it. */ -static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, +static int attach_to_pi_state(u32 __user *uaddr, u32 uval, + struct futex_pi_state *pi_state, struct futex_pi_state **ps) { pid_t pid = uval & FUTEX_TID_MASK; + int ret, uval2; /* * Userspace might have messed up non-PI and PI futexes [3] @@ -1037,8 +1072,33 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, if (unlikely(!pi_state)) return -EINVAL; + /* + * We get here with hb->lock held, and having found a + * futex_top_waiter(). This means that futex_lock_pi() of said futex_q + * has dropped the hb->lock in between queue_me() and unqueue_me_pi(), + * which in turn means that futex_lock_pi() still has a reference on + * our pi_state. + */ WARN_ON(!atomic_read(&pi_state->refcount)); + /* + * Now that we have a pi_state, we can acquire wait_lock + * and do the state validation. + */ + raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); + + /* + * Since {uval, pi_state} is serialized by wait_lock, and our current + * uval was read without holding it, it can have changed. Verify it + * still is what we expect it to be, otherwise retry the entire + * operation. + */ + if (get_futex_value_locked(&uval2, uaddr)) + goto out_efault; + + if (uval != uval2) + goto out_eagain; + /* * Handle the owner died case: */ @@ -1054,11 +1114,11 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, * is not 0. Inconsistent state. [5] */ if (pid) - return -EINVAL; + goto out_einval; /* * Take a ref on the state and return success. [4] */ - goto out_state; + goto out_attach; } /* @@ -1070,14 +1130,14 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, * Take a ref on the state and return success. [6] */ if (!pid) - goto out_state; + goto out_attach; } else { /* * If the owner died bit is not set, then the pi_state * must have an owner. [7] */ if (!pi_state->owner) - return -EINVAL; + goto out_einval; } /* @@ -1086,11 +1146,29 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state, * user space TID. [9/10] */ if (pid != task_pid_vnr(pi_state->owner)) - return -EINVAL; -out_state: + goto out_einval; + +out_attach: atomic_inc(&pi_state->refcount); + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); *ps = pi_state; return 0; + +out_einval: + ret = -EINVAL; + goto out_error; + +out_eagain: + ret = -EAGAIN; + goto out_error; + +out_efault: + ret = -EFAULT; + goto out_error; + +out_error: + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); + return ret; } /** @@ -1183,6 +1261,9 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key, /* * No existing pi state. First waiter. [2] + * + * This creates pi_state, we have hb->lock held, this means nothing can + * observe this state, wait_lock is irrelevant. */ pi_state = alloc_pi_state(); @@ -1207,7 +1288,8 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key, return 0; } -static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb, +static int lookup_pi_state(u32 __user *uaddr, u32 uval, + struct futex_hash_bucket *hb, union futex_key *key, struct futex_pi_state **ps, struct task_struct **exiting) { @@ -1218,7 +1300,7 @@ static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb, * attach to the pi_state when the validation succeeds. */ if (match) - return attach_to_pi_state(uval, match->pi_state, ps); + return attach_to_pi_state(uaddr, uval, match->pi_state, ps); /* * We are the first waiter - try to look up the owner based on @@ -1237,7 +1319,7 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval) if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))) return -EFAULT; - /*If user space value changed, let the caller retry */ + /* If user space value changed, let the caller retry */ return curval != uval ? -EAGAIN : 0; } @@ -1301,7 +1383,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb, */ match = futex_top_waiter(hb, key); if (match) - return attach_to_pi_state(uval, match->pi_state, ps); + return attach_to_pi_state(uaddr, uval, match->pi_state, ps); /* * No waiter and user TID is 0. We are here because the @@ -1441,6 +1523,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this, if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) { ret = -EFAULT; + } else if (curval != uval) { /* * If a unconditional UNLOCK_PI operation (user space did not @@ -1977,7 +2060,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, * If that call succeeds then we have pi_state and an * initial refcount on it. */ - ret = lookup_pi_state(ret, hb2, &key2, + ret = lookup_pi_state(uaddr2, ret, hb2, &key2, &pi_state, &exiting); } @@ -2282,7 +2365,6 @@ static int __fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, int err = 0; oldowner = pi_state->owner; - /* Owner died? */ if (!pi_state->owner) newtid |= FUTEX_OWNER_DIED; @@ -2305,11 +2387,10 @@ static int __fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, * because we can fault here. Imagine swapped out pages or a fork * that marked all the anonymous memory readonly for cow. * - * Modifying pi_state _before_ the user space value would - * leave the pi_state in an inconsistent state when we fault - * here, because we need to drop the hash bucket lock to - * handle the fault. This might be observed in the PID check - * in lookup_pi_state. + * Modifying pi_state _before_ the user space value would leave the + * pi_state in an inconsistent state when we fault here, because we + * need to drop the locks to handle the fault. This might be observed + * in the PID check in lookup_pi_state. */ retry: if (!argowner) { @@ -2367,21 +2448,26 @@ static int __fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, return argowner == current; /* - * To handle the page fault we need to drop the hash bucket - * lock here. That gives the other task (either the highest priority - * waiter itself or the task which stole the rtmutex) the - * chance to try the fixup of the pi_state. So once we are - * back from handling the fault we need to check the pi_state - * after reacquiring the hash bucket lock and before trying to - * do another fixup. When the fixup has been done already we - * simply return. + * To handle the page fault we need to drop the locks here. That gives + * the other task (either the highest priority waiter itself or the + * task which stole the rtmutex) the chance to try the fixup of the + * pi_state. So once we are back from handling the fault we need to + * check the pi_state after reacquiring the locks and before trying to + * do another fixup. When the fixup has been done already we simply + * return. + * + * Note: we hold both hb->lock and pi_mutex->wait_lock. We can safely + * drop hb->lock since the caller owns the hb -> futex_q relation. + * Dropping the pi_mutex->wait_lock requires the state revalidate. */ handle_fault: + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); spin_unlock(q->lock_ptr); err = fault_in_user_writeable(uaddr); spin_lock(q->lock_ptr); + raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); /* * Check if someone else fixed it for us: