From patchwork Thu Apr 21 15:02:49 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Zijlstra X-Patchwork-Id: 565555 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F31EC433EF for ; Thu, 21 Apr 2022 15:10:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1389913AbiDUPNj (ORCPT ); Thu, 21 Apr 2022 11:13:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46928 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1389885AbiDUPNf (ORCPT ); Thu, 21 Apr 2022 11:13:35 -0400 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2392B2F00D; Thu, 21 Apr 2022 08:10:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=oF+PVY6vIU4SUP3fyaaG3nUd44JsjVm4tT8uFNkBQgY=; b=NVN13hy2D2IC+wwCb5CK1GDrqQ I4LEzJzh5Nen6kEHH5tL1QNf4SrCEBth4RSqQmzHvOFoeAFZBM/Em9iDKEr6KWNbao8tC49AizD2F 2G3M0L6a7zM4pIv9Q9H3H9zTOAlE5PfN6SmdKI3ZR7FYINdZSIyGEn2I9UUjTGonMHBhZNPeS3PE3 CHRxH5l/YC640b79WAICyCAia68d3Go6wl44Oj5tyJz0nEC0gufG5Twi9r5kI5ct/rs3ifaVywB8l ascBhlJmjwFIJUBZkNM4Dv09b9bVj7vlLlHhQx6ZrCfQJP1V7Di2yo1tOiLLkRoa9l2SiseahAday 5zinzkhA==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1nhYRk-007QKP-3l; Thu, 21 Apr 2022 15:10:16 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 98C8A30066D; Thu, 21 Apr 2022 17:10:13 +0200 (CEST) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id 65FDF2B34A737; Thu, 21 Apr 2022 17:10:13 +0200 (CEST) Message-ID: <20220421150654.757693825@infradead.org> User-Agent: quilt/0.66 Date: Thu, 21 Apr 2022 17:02:49 +0200 From: Peter Zijlstra To: rjw@rjwysocki.net, oleg@redhat.com, mingo@kernel.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, mgorman@suse.de, ebiederm@xmission.com, bigeasy@linutronix.de, Will Deacon Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, tj@kernel.org, linux-pm@vger.kernel.org Subject: [PATCH v2 1/5] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state References: <20220421150248.667412396@infradead.org> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org Currently ptrace_stop() / do_signal_stop() rely on the special states TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this state exists only in task->__state and nowhere else. There's two spots of bother with this: - PREEMPT_RT has task->saved_state which complicates matters, meaning task_is_{traced,stopped}() needs to check an additional variable. - An alternative freezer implementation that itself relies on a special TASK state would loose TASK_TRACED/TASK_STOPPED and will result in misbehaviour. As such, add additional state to task->jobctl to track this state outside of task->__state. NOTE: this doesn't actually fix anything yet, just adds extra state. Signed-off-by: Peter Zijlstra (Intel) --- include/linux/sched.h | 8 +++----- include/linux/sched/jobctl.h | 6 ++++++ include/linux/sched/signal.h | 5 ++++- kernel/ptrace.c | 26 +++++++++++++++----------- kernel/signal.c | 16 ++++++++++++---- 5 files changed, 40 insertions(+), 21 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -118,11 +118,9 @@ struct task_group; #define task_is_running(task) (READ_ONCE((task)->__state) == TASK_RUNNING) -#define task_is_traced(task) ((READ_ONCE(task->__state) & __TASK_TRACED) != 0) - -#define task_is_stopped(task) ((READ_ONCE(task->__state) & __TASK_STOPPED) != 0) - -#define task_is_stopped_or_traced(task) ((READ_ONCE(task->__state) & (__TASK_STOPPED | __TASK_TRACED)) != 0) +#define task_is_traced(task) ((READ_ONCE(task->jobctl) & JOBCTL_TRACED) != 0) +#define task_is_stopped(task) ((READ_ONCE(task->jobctl) & JOBCTL_STOPPED) != 0) +#define task_is_stopped_or_traced(task) ((READ_ONCE(task->jobctl) & (JOBCTL_STOPPED | JOBCTL_TRACED)) != 0) /* * Special states are those that do not use the normal wait-loop pattern. See --- a/include/linux/sched/jobctl.h +++ b/include/linux/sched/jobctl.h @@ -20,6 +20,9 @@ struct task_struct; #define JOBCTL_LISTENING_BIT 22 /* ptracer is listening for events */ #define JOBCTL_TRAP_FREEZE_BIT 23 /* trap for cgroup freezer */ +#define JOBCTL_STOPPED_BIT 24 /* do_signal_stop() */ +#define JOBCTL_TRACED_BIT 25 /* ptrace_stop() */ + #define JOBCTL_STOP_DEQUEUED (1UL << JOBCTL_STOP_DEQUEUED_BIT) #define JOBCTL_STOP_PENDING (1UL << JOBCTL_STOP_PENDING_BIT) #define JOBCTL_STOP_CONSUME (1UL << JOBCTL_STOP_CONSUME_BIT) @@ -29,6 +32,9 @@ struct task_struct; #define JOBCTL_LISTENING (1UL << JOBCTL_LISTENING_BIT) #define JOBCTL_TRAP_FREEZE (1UL << JOBCTL_TRAP_FREEZE_BIT) +#define JOBCTL_STOPPED (1UL << JOBCTL_STOPPED_BIT) +#define JOBCTL_TRACED (1UL << JOBCTL_TRACED_BIT) + #define JOBCTL_TRAP_MASK (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY) #define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK) --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -294,8 +294,10 @@ static inline int kernel_dequeue_signal( static inline void kernel_signal_stop(void) { spin_lock_irq(¤t->sighand->siglock); - if (current->jobctl & JOBCTL_STOP_DEQUEUED) + if (current->jobctl & JOBCTL_STOP_DEQUEUED) { + current->jobctl |= JOBCTL_STOPPED; set_special_state(TASK_STOPPED); + } spin_unlock_irq(¤t->sighand->siglock); schedule(); @@ -439,6 +441,7 @@ static inline void signal_wake_up(struct { signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0); } + static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume) { signal_wake_up_state(t, resume ? __TASK_TRACED : 0); --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -185,7 +185,12 @@ static bool looks_like_a_spurious_pid(st return true; } -/* Ensure that nothing can wake it up, even SIGKILL */ +/* + * Ensure that nothing can wake it up, even SIGKILL + * + * A task is switched to this state while a ptrace operation is in progress; + * such that the ptrace operation is uninterruptible. + */ static bool ptrace_freeze_traced(struct task_struct *task) { bool ret = false; @@ -218,9 +223,10 @@ static void ptrace_unfreeze_traced(struc */ spin_lock_irq(&task->sighand->siglock); if (READ_ONCE(task->__state) == __TASK_TRACED) { - if (__fatal_signal_pending(task)) + if (__fatal_signal_pending(task)) { + task->jobctl &= ~JOBCTL_TRACED; wake_up_state(task, __TASK_TRACED); - else + } else WRITE_ONCE(task->__state, TASK_TRACED); } spin_unlock_irq(&task->sighand->siglock); @@ -475,8 +481,10 @@ static int ptrace_attach(struct task_str * in and out of STOPPED are protected by siglock. */ if (task_is_stopped(task) && - task_set_jobctl_pending(task, JOBCTL_TRAP_STOP | JOBCTL_TRAPPING)) + task_set_jobctl_pending(task, JOBCTL_TRAP_STOP | JOBCTL_TRAPPING)) { + task->jobctl &= ~JOBCTL_STOPPED; signal_wake_up_state(task, __TASK_STOPPED); + } spin_unlock(&task->sighand->siglock); @@ -850,8 +858,6 @@ static long ptrace_get_rseq_configuratio static int ptrace_resume(struct task_struct *child, long request, unsigned long data) { - bool need_siglock; - if (!valid_signal(data)) return -EIO; @@ -892,13 +898,11 @@ static int ptrace_resume(struct task_str * status and clears the code too; this can't race with the tracee, it * takes siglock after resume. */ - need_siglock = data && !thread_group_empty(current); - if (need_siglock) - spin_lock_irq(&child->sighand->siglock); + spin_lock_irq(&child->sighand->siglock); child->exit_code = data; + child->jobctl &= ~JOBCTL_TRACED; wake_up_state(child, __TASK_TRACED); - if (need_siglock) - spin_unlock_irq(&child->sighand->siglock); + spin_unlock_irq(&child->sighand->siglock); return 0; } --- a/kernel/signal.c +++ b/kernel/signal.c @@ -762,7 +762,10 @@ static int dequeue_synchronous_signal(ke */ void signal_wake_up_state(struct task_struct *t, unsigned int state) { + lockdep_assert_held(&t->sighand->siglock); + set_tsk_thread_flag(t, TIF_SIGPENDING); + /* * TASK_WAKEKILL also means wake it up in the stopped/traced/killable * case. We don't check t->state here because there is a race with it @@ -770,7 +773,9 @@ void signal_wake_up_state(struct task_st * By using wake_up_state, we ensure the process will wake up and * handle its death signal. */ - if (!wake_up_state(t, state | TASK_INTERRUPTIBLE)) + if (wake_up_state(t, state | TASK_INTERRUPTIBLE)) + t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED); + else kick_process(t); } @@ -884,7 +889,7 @@ static int check_kill_permission(int sig static void ptrace_trap_notify(struct task_struct *t) { WARN_ON_ONCE(!(t->ptrace & PT_SEIZED)); - assert_spin_locked(&t->sighand->siglock); + lockdep_assert_held(&t->sighand->siglock); task_set_jobctl_pending(t, JOBCTL_TRAP_NOTIFY); ptrace_signal_wake_up(t, t->jobctl & JOBCTL_LISTENING); @@ -930,9 +935,10 @@ static bool prepare_signal(int sig, stru for_each_thread(p, t) { flush_sigqueue_mask(&flush, &t->pending); task_clear_jobctl_pending(t, JOBCTL_STOP_PENDING); - if (likely(!(t->ptrace & PT_SEIZED))) + if (likely(!(t->ptrace & PT_SEIZED))) { + t->jobctl &= ~JOBCTL_STOPPED; wake_up_state(t, __TASK_STOPPED); - else + } else ptrace_trap_notify(t); } @@ -2219,6 +2225,7 @@ static int ptrace_stop(int exit_code, in * schedule() will not sleep if there is a pending signal that * can awaken the task. */ + current->jobctl |= JOBCTL_TRACED; set_special_state(TASK_TRACED); /* @@ -2460,6 +2467,7 @@ static bool do_signal_stop(int signr) if (task_participate_group_stop(current)) notify = CLD_STOPPED; + current->jobctl |= JOBCTL_STOPPED; set_special_state(TASK_STOPPED); spin_unlock_irq(¤t->sighand->siglock); From patchwork Thu Apr 21 15:02:50 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Zijlstra X-Patchwork-Id: 565556 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CBA2CC433FE for ; Thu, 21 Apr 2022 15:10:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1389909AbiDUPNi (ORCPT ); Thu, 21 Apr 2022 11:13:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46930 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1389881AbiDUPNf (ORCPT ); Thu, 21 Apr 2022 11:13:35 -0400 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 239AB2F3B1; Thu, 21 Apr 2022 08:10:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=uC9Q8gMKi0sDdYgcl7ndC55xzbrgbUPkwPLGz7U5puw=; b=q7ayW9MyHkJ95jurnDkIA6E4Qb J7ivj8vvxrehJbVXoyu9Hrcr+NGh5/pqu1fi+U0Eu6wAxRByLietVaVpTYGcPF9RVuiRFUlvMc2CK z3Yxp1gFzxjhMkrh4fYmYcjmj7laVidQLCkGhbtxGWvdIKNLUQCACap4Nh+Y9b/HNi0CJ7yhJKDNr MOesr/hw4P3VVryqHt9Bj6DCb8jKFf+ZRNHwN5xFgPr0BGuqBiy0L7wpB5nIdYx+s1MkrbrJM2Ert jLCRdAvo6hZfY0G9Ki/cCuunbC3fW90vFx3wvuDRDCWTY7h4n7MXMDFgEqS/i7QwV6K0g61nbLqaL 7GNn7Peg==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1nhYRk-007QKO-4c; Thu, 21 Apr 2022 15:10:16 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 971BA30040C; Thu, 21 Apr 2022 17:10:13 +0200 (CEST) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id 6C9C52B553DEF; Thu, 21 Apr 2022 17:10:13 +0200 (CEST) Message-ID: <20220421150654.817117821@infradead.org> User-Agent: quilt/0.66 Date: Thu, 21 Apr 2022 17:02:50 +0200 From: Peter Zijlstra To: rjw@rjwysocki.net, oleg@redhat.com, mingo@kernel.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, mgorman@suse.de, ebiederm@xmission.com, bigeasy@linutronix.de, Will Deacon Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, tj@kernel.org, linux-pm@vger.kernel.org Subject: [PATCH v2 2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT References: <20220421150248.667412396@infradead.org> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org Rework ptrace_check_attach() / ptrace_unfreeze_traced() to not rely on task->__state as much. Due to how PREEMPT_RT is changing the rules vs task->__state with the introduction of task->saved_state while TASK_RTLOCK_WAIT (the whole blocking spinlock thing), the way ptrace freeze tries to do things no longer works. Specifically there are two problems: - due to ->saved_state, the ->__state modification removing TASK_WAKEKILL no longer works reliably. - due to ->saved_state, wait_task_inactive() also no longer works reliably. The first problem is solved by a suggestion from Eric that instead of changing __state, TASK_WAKEKILL be delayed. The second problem is solved by a suggestion from Oleg; add JOBCTL_TRACED_QUIESCE to cover the chunk of code between set_current_state(TASK_TRACED) and schedule(), such that ptrace_check_attach() can first wait for JOBCTL_TRACED_QUIESCE to get cleared, and then use wait_task_inactive(). Suggested-by: Oleg Nesterov Suggested-by: "Eric W. Biederman" Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Peter Zijlstra (Intel) --- include/linux/sched/jobctl.h | 8 ++- kernel/ptrace.c | 90 ++++++++++++++++++++++--------------------- kernel/sched/core.c | 5 -- kernel/signal.c | 36 ++++++++++++++--- 4 files changed, 86 insertions(+), 53 deletions(-) --- a/include/linux/sched/jobctl.h +++ b/include/linux/sched/jobctl.h @@ -19,9 +19,11 @@ struct task_struct; #define JOBCTL_TRAPPING_BIT 21 /* switching to TRACED */ #define JOBCTL_LISTENING_BIT 22 /* ptracer is listening for events */ #define JOBCTL_TRAP_FREEZE_BIT 23 /* trap for cgroup freezer */ +#define JOBCTL_DELAY_WAKEKILL_BIT 24 /* delay killable wakeups */ -#define JOBCTL_STOPPED_BIT 24 /* do_signal_stop() */ -#define JOBCTL_TRACED_BIT 25 /* ptrace_stop() */ +#define JOBCTL_STOPPED_BIT 25 /* do_signal_stop() */ +#define JOBCTL_TRACED_BIT 26 /* ptrace_stop() */ +#define JOBCTL_TRACED_QUIESCE_BIT 27 #define JOBCTL_STOP_DEQUEUED (1UL << JOBCTL_STOP_DEQUEUED_BIT) #define JOBCTL_STOP_PENDING (1UL << JOBCTL_STOP_PENDING_BIT) @@ -31,9 +33,11 @@ struct task_struct; #define JOBCTL_TRAPPING (1UL << JOBCTL_TRAPPING_BIT) #define JOBCTL_LISTENING (1UL << JOBCTL_LISTENING_BIT) #define JOBCTL_TRAP_FREEZE (1UL << JOBCTL_TRAP_FREEZE_BIT) +#define JOBCTL_DELAY_WAKEKILL (1UL << JOBCTL_DELAY_WAKEKILL_BIT) #define JOBCTL_STOPPED (1UL << JOBCTL_STOPPED_BIT) #define JOBCTL_TRACED (1UL << JOBCTL_TRACED_BIT) +#define JOBCTL_TRACED_QUIESCE (1UL << JOBCTL_TRACED_QUIESCE_BIT) #define JOBCTL_TRAP_MASK (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY) #define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK) --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -193,41 +193,44 @@ static bool looks_like_a_spurious_pid(st */ static bool ptrace_freeze_traced(struct task_struct *task) { + unsigned long flags; bool ret = false; /* Lockless, nobody but us can set this flag */ if (task->jobctl & JOBCTL_LISTENING) return ret; - spin_lock_irq(&task->sighand->siglock); - if (task_is_traced(task) && !looks_like_a_spurious_pid(task) && + if (!lock_task_sighand(task, &flags)) + return ret; + + if (task_is_traced(task) && + !looks_like_a_spurious_pid(task) && !__fatal_signal_pending(task)) { - WRITE_ONCE(task->__state, __TASK_TRACED); + WARN_ON_ONCE(READ_ONCE(task->__state) != TASK_TRACED); + WARN_ON_ONCE(task->jobctl & JOBCTL_DELAY_WAKEKILL); + task->jobctl |= JOBCTL_DELAY_WAKEKILL; ret = true; } - spin_unlock_irq(&task->sighand->siglock); + unlock_task_sighand(task, &flags); return ret; } static void ptrace_unfreeze_traced(struct task_struct *task) { - if (READ_ONCE(task->__state) != __TASK_TRACED) + if (!task_is_traced(task)) return; WARN_ON(!task->ptrace || task->parent != current); - /* - * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely. - * Recheck state under the lock to close this race. - */ spin_lock_irq(&task->sighand->siglock); - if (READ_ONCE(task->__state) == __TASK_TRACED) { + if (task_is_traced(task)) { +// WARN_ON_ONCE(!(task->jobctl & JOBCTL_DELAY_WAKEKILL)); + task->jobctl &= ~JOBCTL_DELAY_WAKEKILL; if (__fatal_signal_pending(task)) { task->jobctl &= ~JOBCTL_TRACED; - wake_up_state(task, __TASK_TRACED); - } else - WRITE_ONCE(task->__state, TASK_TRACED); + wake_up_state(task, TASK_WAKEKILL); + } } spin_unlock_irq(&task->sighand->siglock); } @@ -251,40 +254,45 @@ static void ptrace_unfreeze_traced(struc */ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) { - int ret = -ESRCH; + int traced; /* * We take the read lock around doing both checks to close a - * possible race where someone else was tracing our child and - * detached between these two checks. After this locked check, - * we are sure that this is our traced child and that can only - * be changed by us so it's not changing right after this. + * possible race where someone else attaches or detaches our + * natural child. */ read_lock(&tasklist_lock); - if (child->ptrace && child->parent == current) { - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); - /* - * child->sighand can't be NULL, release_task() - * does ptrace_unlink() before __exit_signal(). - */ - if (ignore_state || ptrace_freeze_traced(child)) - ret = 0; - } + traced = child->ptrace && child->parent == current; read_unlock(&tasklist_lock); + if (!traced) + return -ESRCH; - if (!ret && !ignore_state) { - if (!wait_task_inactive(child, __TASK_TRACED)) { - /* - * This can only happen if may_ptrace_stop() fails and - * ptrace_stop() changes ->state back to TASK_RUNNING, - * so we should not worry about leaking __TASK_TRACED. - */ - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); - ret = -ESRCH; - } + if (ignore_state) + return 0; + + if (!task_is_traced(child)) + return -ESRCH; + + WARN_ON_ONCE(READ_ONCE(child->jobctl) & JOBCTL_DELAY_WAKEKILL); + + /* Wait for JOBCTL_TRACED_QUIESCE to go away, see ptrace_stop(). */ + for (;;) { + if (fatal_signal_pending(current)) + return -EINTR; + + set_current_state(TASK_KILLABLE); + if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED_QUIESCE)) + break; + + schedule(); } + __set_current_state(TASK_RUNNING); - return ret; + if (!wait_task_inactive(child, TASK_TRACED) || + !ptrace_freeze_traced(child)) + return -ESRCH; + + return 0; } static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode) @@ -1329,8 +1337,7 @@ SYSCALL_DEFINE4(ptrace, long, request, l goto out_put_task_struct; ret = arch_ptrace(child, request, addr, data); - if (ret || request != PTRACE_DETACH) - ptrace_unfreeze_traced(child); + ptrace_unfreeze_traced(child); out_put_task_struct: put_task_struct(child); @@ -1472,8 +1479,7 @@ COMPAT_SYSCALL_DEFINE4(ptrace, compat_lo request == PTRACE_INTERRUPT); if (!ret) { ret = compat_arch_ptrace(child, request, addr, data); - if (ret || request != PTRACE_DETACH) - ptrace_unfreeze_traced(child); + ptrace_unfreeze_traced(child); } out_put_task_struct: --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6310,10 +6310,7 @@ static void __sched notrace __schedule(u /* * We must load prev->state once (task_struct::state is volatile), such - * that: - * - * - we form a control dependency vs deactivate_task() below. - * - ptrace_{,un}freeze_traced() can change ->state underneath us. + * that we form a control dependency vs deactivate_task() below. */ prev_state = READ_ONCE(prev->__state); if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) { --- a/kernel/signal.c +++ b/kernel/signal.c @@ -764,6 +764,10 @@ void signal_wake_up_state(struct task_st { lockdep_assert_held(&t->sighand->siglock); + /* Suppress wakekill? */ + if (t->jobctl & JOBCTL_DELAY_WAKEKILL) + state &= ~TASK_WAKEKILL; + set_tsk_thread_flag(t, TIF_SIGPENDING); /* @@ -774,7 +778,7 @@ void signal_wake_up_state(struct task_st * handle its death signal. */ if (wake_up_state(t, state | TASK_INTERRUPTIBLE)) - t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED); + t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE); else kick_process(t); } @@ -2187,6 +2191,15 @@ static void do_notify_parent_cldstop(str spin_unlock_irqrestore(&sighand->siglock, flags); } +static void clear_traced_quiesce(void) +{ + spin_lock_irq(¤t->sighand->siglock); + WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE)); + current->jobctl &= ~JOBCTL_TRACED_QUIESCE; + wake_up_state(current->parent, TASK_KILLABLE); + spin_unlock_irq(¤t->sighand->siglock); +} + /* * This must be called with current->sighand->siglock held. * @@ -2225,7 +2238,7 @@ static int ptrace_stop(int exit_code, in * schedule() will not sleep if there is a pending signal that * can awaken the task. */ - current->jobctl |= JOBCTL_TRACED; + current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE; set_special_state(TASK_TRACED); /* @@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in /* * Don't want to allow preemption here, because * sys_ptrace() needs this task to be inactive. - * - * XXX: implement read_unlock_no_resched(). */ preempt_disable(); read_unlock(&tasklist_lock); - cgroup_enter_frozen(); + cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!! + + /* + * JOBCTL_TRACE_QUIESCE bridges the gap between + * set_current_state(TASK_TRACED) above and schedule() below. + * There must not be any blocking (specifically anything that + * touched ->saved_state on PREEMPT_RT) between here and + * schedule(). + * + * ptrace_check_attach() relies on this with its + * wait_task_inactive() usage. + */ + clear_traced_quiesce(); + preempt_enable_no_resched(); freezable_schedule(); + cgroup_leave_frozen(true); } else { /* @@ -2335,6 +2360,7 @@ static int ptrace_stop(int exit_code, in /* LISTENING can be set only during STOP traps, clear it */ current->jobctl &= ~JOBCTL_LISTENING; + current->jobctl &= ~JOBCTL_DELAY_WAKEKILL; /* * Queued signals ignored us while we were stopped for tracing. From patchwork Thu Apr 21 15:02:51 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Zijlstra X-Patchwork-Id: 564702 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AB0ACC4332F for ; Thu, 21 Apr 2022 15:10:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1389910AbiDUPNi (ORCPT ); Thu, 21 Apr 2022 11:13:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46938 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1389883AbiDUPNf (ORCPT ); Thu, 21 Apr 2022 11:13:35 -0400 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 243D93E0F4; Thu, 21 Apr 2022 08:10:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=bZGwtg721/vos0v+4uWN1TQ12j+Iplb0bq7iDFHye/U=; b=CNJ9qoSVzEA3wjShMAZfbYMINP JDXosMX0wuWfG3faLbx7Qj89bwjkqLHHBT+Ut7AxeTBVisC3KXEKilhwhDMFIASY5CgmcevAVqbSL uPHwSjaMcu5Y2h7oIZS86wusSqXlZDZ4k32/Xgr4cA/u0W5CCdp+dcGujNFaVmvKhwUR6T+Qs/gLy KTRoWKxHyNf2AXMtTRQs756eXD2ZgjJqBHOXxh1Br8QKKUt4vBjv4t5IWBLGjcq9Xkn5Ld2ZBQic0 5t6NRBxIZdZ7J97ASTNCIcRvUfGwb4I9EoRR7GU5RRC25eOakDmeg240kBBvQEnGFBa776O2BXL1m G34eWlPA==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1nhYRk-007QKS-4a; Thu, 21 Apr 2022 15:10:16 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 9C159300750; Thu, 21 Apr 2022 17:10:13 +0200 (CEST) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id 740C82B553DF1; Thu, 21 Apr 2022 17:10:13 +0200 (CEST) Message-ID: <20220421150654.879033968@infradead.org> User-Agent: quilt/0.66 Date: Thu, 21 Apr 2022 17:02:51 +0200 From: Peter Zijlstra To: rjw@rjwysocki.net, oleg@redhat.com, mingo@kernel.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, mgorman@suse.de, ebiederm@xmission.com, bigeasy@linutronix.de, Will Deacon Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, tj@kernel.org, linux-pm@vger.kernel.org Subject: [PATCH v2 3/5] freezer: Have {,un}lock_system_sleep() save/restore flags References: <20220421150248.667412396@infradead.org> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org Rafael explained that the reason for having both PF_NOFREEZE and PF_FREEZER_SKIP is that {,un}lock_system_sleep() is callable from kthread context that has previously called set_freezable(). In preparation of merging the flags, have {,un}lock_system_slee() save and restore current->flags. Signed-off-by: Peter Zijlstra (Intel) --- drivers/acpi/x86/s2idle.c | 12 ++++++++---- drivers/scsi/scsi_transport_spi.c | 7 ++++--- include/linux/suspend.h | 8 ++++---- kernel/power/hibernate.c | 35 ++++++++++++++++++++++------------- kernel/power/main.c | 16 ++++++++++------ kernel/power/suspend.c | 12 ++++++++---- kernel/power/user.c | 24 ++++++++++++++---------- 7 files changed, 70 insertions(+), 44 deletions(-) --- a/drivers/acpi/x86/s2idle.c +++ b/drivers/acpi/x86/s2idle.c @@ -538,12 +538,14 @@ void acpi_s2idle_setup(void) int acpi_register_lps0_dev(struct acpi_s2idle_dev_ops *arg) { + unsigned int sleep_flags; + if (!lps0_device_handle || sleep_no_lps0) return -ENODEV; - lock_system_sleep(); + sleep_flags = lock_system_sleep(); list_add(&arg->list_node, &lps0_s2idle_devops_head); - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); return 0; } @@ -551,12 +553,14 @@ EXPORT_SYMBOL_GPL(acpi_register_lps0_dev void acpi_unregister_lps0_dev(struct acpi_s2idle_dev_ops *arg) { + unsigned int sleep_flags; + if (!lps0_device_handle || sleep_no_lps0) return; - lock_system_sleep(); + sleep_flags = lock_system_sleep(); list_del(&arg->list_node); - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); } EXPORT_SYMBOL_GPL(acpi_unregister_lps0_dev); --- a/drivers/scsi/scsi_transport_spi.c +++ b/drivers/scsi/scsi_transport_spi.c @@ -998,8 +998,9 @@ void spi_dv_device(struct scsi_device *sdev) { struct scsi_target *starget = sdev->sdev_target; - u8 *buffer; const int len = SPI_MAX_ECHO_BUFFER_SIZE*2; + unsigned int sleep_flags; + u8 *buffer; /* * Because this function and the power management code both call @@ -1007,7 +1008,7 @@ spi_dv_device(struct scsi_device *sdev) * while suspend or resume is in progress. Hence the * lock/unlock_system_sleep() calls. */ - lock_system_sleep(); + sleep_flags = lock_system_sleep(); if (scsi_autopm_get_device(sdev)) goto unlock_system_sleep; @@ -1058,7 +1059,7 @@ spi_dv_device(struct scsi_device *sdev) scsi_autopm_put_device(sdev); unlock_system_sleep: - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); } EXPORT_SYMBOL(spi_dv_device); --- a/include/linux/suspend.h +++ b/include/linux/suspend.h @@ -510,8 +510,8 @@ extern bool pm_save_wakeup_count(unsigne extern void pm_wakep_autosleep_enabled(bool set); extern void pm_print_active_wakeup_sources(void); -extern void lock_system_sleep(void); -extern void unlock_system_sleep(void); +extern unsigned int lock_system_sleep(void); +extern void unlock_system_sleep(unsigned int); #else /* !CONFIG_PM_SLEEP */ @@ -534,8 +534,8 @@ static inline void pm_system_wakeup(void static inline void pm_wakeup_clear(bool reset) {} static inline void pm_system_irq_wakeup(unsigned int irq_number) {} -static inline void lock_system_sleep(void) {} -static inline void unlock_system_sleep(void) {} +static inline unsigned int lock_system_sleep(void) { return 0; } +static inline void unlock_system_sleep(unsigned int flags) {} #endif /* !CONFIG_PM_SLEEP */ --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -92,20 +92,24 @@ bool hibernation_available(void) */ void hibernation_set_ops(const struct platform_hibernation_ops *ops) { + unsigned int sleep_flags; + if (ops && !(ops->begin && ops->end && ops->pre_snapshot && ops->prepare && ops->finish && ops->enter && ops->pre_restore && ops->restore_cleanup && ops->leave)) { WARN_ON(1); return; } - lock_system_sleep(); + + sleep_flags = lock_system_sleep(); + hibernation_ops = ops; if (ops) hibernation_mode = HIBERNATION_PLATFORM; else if (hibernation_mode == HIBERNATION_PLATFORM) hibernation_mode = HIBERNATION_SHUTDOWN; - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); } EXPORT_SYMBOL_GPL(hibernation_set_ops); @@ -713,6 +717,7 @@ static int load_image_and_restore(void) int hibernate(void) { bool snapshot_test = false; + unsigned int sleep_flags; int error; if (!hibernation_available()) { @@ -720,7 +725,7 @@ int hibernate(void) return -EPERM; } - lock_system_sleep(); + sleep_flags = lock_system_sleep(); /* The snapshot device should not be opened while we're running */ if (!hibernate_acquire()) { error = -EBUSY; @@ -794,7 +799,7 @@ int hibernate(void) pm_restore_console(); hibernate_release(); Unlock: - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); pr_info("hibernation exit\n"); return error; @@ -809,9 +814,10 @@ int hibernate(void) */ int hibernate_quiet_exec(int (*func)(void *data), void *data) { + unsigned int sleep_flags; int error; - lock_system_sleep(); + sleep_flags = lock_system_sleep(); if (!hibernate_acquire()) { error = -EBUSY; @@ -891,7 +897,7 @@ int hibernate_quiet_exec(int (*func)(voi hibernate_release(); unlock: - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); return error; } @@ -1100,11 +1106,12 @@ static ssize_t disk_show(struct kobject static ssize_t disk_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t n) { + int mode = HIBERNATION_INVALID; + unsigned int sleep_flags; int error = 0; - int i; int len; char *p; - int mode = HIBERNATION_INVALID; + int i; if (!hibernation_available()) return -EPERM; @@ -1112,7 +1119,7 @@ static ssize_t disk_store(struct kobject p = memchr(buf, '\n', n); len = p ? p - buf : n; - lock_system_sleep(); + sleep_flags = lock_system_sleep(); for (i = HIBERNATION_FIRST; i <= HIBERNATION_MAX; i++) { if (len == strlen(hibernation_modes[i]) && !strncmp(buf, hibernation_modes[i], len)) { @@ -1142,7 +1149,7 @@ static ssize_t disk_store(struct kobject if (!error) pm_pr_dbg("Hibernation mode set to '%s'\n", hibernation_modes[mode]); - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); return error ? error : n; } @@ -1158,9 +1165,10 @@ static ssize_t resume_show(struct kobjec static ssize_t resume_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t n) { - dev_t res; + unsigned int sleep_flags; int len = n; char *name; + dev_t res; if (len && buf[len-1] == '\n') len--; @@ -1173,9 +1181,10 @@ static ssize_t resume_store(struct kobje if (!res) return -EINVAL; - lock_system_sleep(); + sleep_flags = lock_system_sleep(); swsusp_resume_device = res; - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); + pm_pr_dbg("Configured hibernation resume from disk to %u\n", swsusp_resume_device); noresume = 0; --- a/kernel/power/main.c +++ b/kernel/power/main.c @@ -21,14 +21,16 @@ #ifdef CONFIG_PM_SLEEP -void lock_system_sleep(void) +unsigned int lock_system_sleep(void) { + unsigned int flags = current->flags; current->flags |= PF_FREEZER_SKIP; mutex_lock(&system_transition_mutex); + return flags; } EXPORT_SYMBOL_GPL(lock_system_sleep); -void unlock_system_sleep(void) +void unlock_system_sleep(unsigned int flags) { /* * Don't use freezer_count() because we don't want the call to @@ -46,7 +48,8 @@ void unlock_system_sleep(void) * Which means, if we use try_to_freeze() here, it would make them * enter the refrigerator, thus causing hibernation to lockup. */ - current->flags &= ~PF_FREEZER_SKIP; + if (!(flags & PF_FREEZER_SKIP)) + current->flags &= ~PF_FREEZER_SKIP; mutex_unlock(&system_transition_mutex); } EXPORT_SYMBOL_GPL(unlock_system_sleep); @@ -260,16 +263,17 @@ static ssize_t pm_test_show(struct kobje static ssize_t pm_test_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t n) { + unsigned int sleep_flags; const char * const *s; + int error = -EINVAL; int level; char *p; int len; - int error = -EINVAL; p = memchr(buf, '\n', n); len = p ? p - buf : n; - lock_system_sleep(); + sleep_flags = lock_system_sleep(); level = TEST_FIRST; for (s = &pm_tests[level]; level <= TEST_MAX; s++, level++) @@ -279,7 +283,7 @@ static ssize_t pm_test_store(struct kobj break; } - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); return error ? error : n; } --- a/kernel/power/suspend.c +++ b/kernel/power/suspend.c @@ -75,9 +75,11 @@ EXPORT_SYMBOL_GPL(pm_suspend_default_s2i void s2idle_set_ops(const struct platform_s2idle_ops *ops) { - lock_system_sleep(); + unsigned int sleep_flags; + + sleep_flags = lock_system_sleep(); s2idle_ops = ops; - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); } static void s2idle_begin(void) @@ -200,7 +202,9 @@ __setup("mem_sleep_default=", mem_sleep_ */ void suspend_set_ops(const struct platform_suspend_ops *ops) { - lock_system_sleep(); + unsigned int sleep_flags; + + sleep_flags = lock_system_sleep(); suspend_ops = ops; @@ -216,7 +220,7 @@ void suspend_set_ops(const struct platfo mem_sleep_current = PM_SUSPEND_MEM; } - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); } EXPORT_SYMBOL_GPL(suspend_set_ops); --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -46,12 +46,13 @@ int is_hibernate_resume_dev(dev_t dev) static int snapshot_open(struct inode *inode, struct file *filp) { struct snapshot_data *data; + unsigned int sleep_flags; int error; if (!hibernation_available()) return -EPERM; - lock_system_sleep(); + sleep_flags = lock_system_sleep(); if (!hibernate_acquire()) { error = -EBUSY; @@ -97,7 +98,7 @@ static int snapshot_open(struct inode *i data->dev = 0; Unlock: - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); return error; } @@ -105,8 +106,9 @@ static int snapshot_open(struct inode *i static int snapshot_release(struct inode *inode, struct file *filp) { struct snapshot_data *data; + unsigned int sleep_flags; - lock_system_sleep(); + sleep_flags = lock_system_sleep(); swsusp_free(); data = filp->private_data; @@ -123,7 +125,7 @@ static int snapshot_release(struct inode PM_POST_HIBERNATION : PM_POST_RESTORE); hibernate_release(); - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); return 0; } @@ -131,11 +133,12 @@ static int snapshot_release(struct inode static ssize_t snapshot_read(struct file *filp, char __user *buf, size_t count, loff_t *offp) { + loff_t pg_offp = *offp & ~PAGE_MASK; struct snapshot_data *data; + unsigned int sleep_flags; ssize_t res; - loff_t pg_offp = *offp & ~PAGE_MASK; - lock_system_sleep(); + sleep_flags = lock_system_sleep(); data = filp->private_data; if (!data->ready) { @@ -156,7 +159,7 @@ static ssize_t snapshot_read(struct file *offp += res; Unlock: - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); return res; } @@ -164,11 +167,12 @@ static ssize_t snapshot_read(struct file static ssize_t snapshot_write(struct file *filp, const char __user *buf, size_t count, loff_t *offp) { + loff_t pg_offp = *offp & ~PAGE_MASK; struct snapshot_data *data; + unsigned int sleep_flags; ssize_t res; - loff_t pg_offp = *offp & ~PAGE_MASK; - lock_system_sleep(); + sleep_flags = lock_system_sleep(); data = filp->private_data; @@ -190,7 +194,7 @@ static ssize_t snapshot_write(struct fil if (res > 0) *offp += res; unlock: - unlock_system_sleep(); + unlock_system_sleep(sleep_flags); return res; } From patchwork Thu Apr 21 15:02:52 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Zijlstra X-Patchwork-Id: 564703 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B514C433EF for ; Thu, 21 Apr 2022 15:10:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1389898AbiDUPNg (ORCPT ); Thu, 21 Apr 2022 11:13:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46916 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1389877AbiDUPNe (ORCPT ); Thu, 21 Apr 2022 11:13:34 -0400 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2387F2ED7A; Thu, 21 Apr 2022 08:10:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=knAjhsG/6jthQQWVo2QqXwIZ8VPSdd1nuBR6NhivhIM=; b=TOgR8P+26tGgWgiGF/Q8xDLhu4 JXfqmp3WHvsI+j1twYZiPHFhDHAYZoGULR55vSQpneJtLu9vv1lsaFXvIn5EAO08QW9U4o6ON0YEX uhC3S3p0sTi9A7DI0NaZGUDBKN3HKSsulbk1al/snEF5dgBJTw6cJrobBZbQ06A29/7BfQlb+7WVC HeaZAURee2zm/BF/Nn8fa0VPNuSAsoLuXUiS6L9dH9M1L4Upj9YoP2rDceKls+CRuxxl86yW5pm93 hOHaRzQgBHjzj5QQb8ItCzNLMF55Z51WH5NYvZ3//pa8Prm0DMqUouytBpjfwn6e9ZKNBGT/Ji5W1 5a2WN6jw==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1nhYRk-007QKQ-3l; Thu, 21 Apr 2022 15:10:16 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 9E17E300923; Thu, 21 Apr 2022 17:10:13 +0200 (CEST) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id 7AC582B553DF3; Thu, 21 Apr 2022 17:10:13 +0200 (CEST) Message-ID: <20220421150654.940646810@infradead.org> User-Agent: quilt/0.66 Date: Thu, 21 Apr 2022 17:02:52 +0200 From: Peter Zijlstra To: rjw@rjwysocki.net, oleg@redhat.com, mingo@kernel.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, mgorman@suse.de, ebiederm@xmission.com, bigeasy@linutronix.de, Will Deacon Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, tj@kernel.org, linux-pm@vger.kernel.org Subject: [PATCH v2 4/5] freezer,umh: Clean up freezer/initrd interaction References: <20220421150248.667412396@infradead.org> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org handle_initrd() marks itself as PF_FREEZER_SKIP in order to ensure that the UMH, which is going to freeze the system, doesn't indefinitely wait for it's caller. Rework things by adding UMH_FREEZABLE to indicate the completion is freezable. Signed-off-by: Peter Zijlstra (Intel) --- include/linux/umh.h | 9 +++++---- init/do_mounts_initrd.c | 10 +--------- kernel/umh.c | 8 ++++++++ 3 files changed, 14 insertions(+), 13 deletions(-) --- a/include/linux/umh.h +++ b/include/linux/umh.h @@ -11,10 +11,11 @@ struct cred; struct file; -#define UMH_NO_WAIT 0 /* don't wait at all */ -#define UMH_WAIT_EXEC 1 /* wait for the exec, but not the process */ -#define UMH_WAIT_PROC 2 /* wait for the process to complete */ -#define UMH_KILLABLE 4 /* wait for EXEC/PROC killable */ +#define UMH_NO_WAIT 0x00 /* don't wait at all */ +#define UMH_WAIT_EXEC 0x01 /* wait for the exec, but not the process */ +#define UMH_WAIT_PROC 0x02 /* wait for the process to complete */ +#define UMH_KILLABLE 0x04 /* wait for EXEC/PROC killable */ +#define UMH_FREEZABLE 0x08 /* wait for EXEC/PROC freezable */ struct subprocess_info { struct work_struct work; --- a/init/do_mounts_initrd.c +++ b/init/do_mounts_initrd.c @@ -79,19 +79,11 @@ static void __init handle_initrd(void) init_mkdir("/old", 0700); init_chdir("/old"); - /* - * In case that a resume from disk is carried out by linuxrc or one of - * its children, we need to tell the freezer not to wait for us. - */ - current->flags |= PF_FREEZER_SKIP; - info = call_usermodehelper_setup("/linuxrc", argv, envp_init, GFP_KERNEL, init_linuxrc, NULL, NULL); if (!info) return; - call_usermodehelper_exec(info, UMH_WAIT_PROC); - - current->flags &= ~PF_FREEZER_SKIP; + call_usermodehelper_exec(info, UMH_WAIT_PROC|UMH_FREEZABLE); /* move initrd to rootfs' /old */ init_mount("..", ".", NULL, MS_MOVE, NULL); --- a/kernel/umh.c +++ b/kernel/umh.c @@ -28,6 +28,7 @@ #include #include #include +#include #include @@ -436,6 +437,9 @@ int call_usermodehelper_exec(struct subp if (wait == UMH_NO_WAIT) /* task has freed sub_info */ goto unlock; + if (wait & UMH_FREEZABLE) + freezer_do_not_count(); + if (wait & UMH_KILLABLE) { retval = wait_for_completion_killable(&done); if (!retval) @@ -448,6 +452,10 @@ int call_usermodehelper_exec(struct subp } wait_for_completion(&done); + + if (wait & UMH_FREEZABLE) + freezer_count(); + wait_done: retval = sub_info->retval; out: From patchwork Thu Apr 21 15:02:53 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Zijlstra X-Patchwork-Id: 564701 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0D4FFC4321E for ; Thu, 21 Apr 2022 15:10:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1389911AbiDUPNj (ORCPT ); Thu, 21 Apr 2022 11:13:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46932 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1389904AbiDUPNh (ORCPT ); Thu, 21 Apr 2022 11:13:37 -0400 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C364513F97; Thu, 21 Apr 2022 08:10:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=bnYQhapLonX4vhuJ0e6VnrIQMI3O3/mj75Ys59L+Wig=; b=XIEHJIw7ykvD5rH1y7zU5Ly0pt 90zSaIYZsfZcLA8zGJtIsrOWaxogaPmNUtGFRo1cCjW/Oly6Do9kVCsKLx1ecZkzVozerEJk/k2/R dvYrHaAnX/XP1mMOP9IdJGMYBHjUvZ6Hol0vtM2I4+n9mnpd1Z0ENgGn92cVyt0K/j2PMp/YtG+Pd krs40hSr4+Gv5d+r5vutxQfdURmEH1wHP3/RaB2sHjjY/hwjBCN/2lHz4K5uKRIC1fGipPj5oqA+o 7/1IvWALdAvhrGXx8nkDhmTBTU7ZyPm1z9ZVIhvYeLuNCyr9sxUjF/g3/rjmUPGlXsehdDBLGWxCa tYZypAbQ==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1nhYRl-007QL1-IP; Thu, 21 Apr 2022 15:10:18 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 409CB300E57; Thu, 21 Apr 2022 17:10:14 +0200 (CEST) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 0) id 8194C2B553DF5; Thu, 21 Apr 2022 17:10:13 +0200 (CEST) Message-ID: <20220421150655.001952823@infradead.org> User-Agent: quilt/0.66 Date: Thu, 21 Apr 2022 17:02:53 +0200 From: Peter Zijlstra To: rjw@rjwysocki.net, oleg@redhat.com, mingo@kernel.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, mgorman@suse.de, ebiederm@xmission.com, bigeasy@linutronix.de, Will Deacon Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, tj@kernel.org, linux-pm@vger.kernel.org Subject: [PATCH v2 5/5] freezer,sched: Rewrite core freezer logic References: <20220421150248.667412396@infradead.org> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org Rewrite the core freezer to behave better wrt thawing and be simpler in general. By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible. As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay). Specifically; the current scheme works a little like: freezer_do_not_count(); schedule(); freezer_count(); And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues. However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time. That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back. This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run. Signed-off-by: Peter Zijlstra (Intel) --- drivers/android/binder.c | 4 drivers/media/pci/pt3/pt3.c | 4 fs/cifs/inode.c | 4 fs/cifs/transport.c | 5 fs/coredump.c | 5 fs/nfs/file.c | 3 fs/nfs/inode.c | 12 -- fs/nfs/nfs3proc.c | 3 fs/nfs/nfs4proc.c | 14 +- fs/nfs/nfs4state.c | 3 fs/nfs/pnfs.c | 4 fs/xfs/xfs_trans_ail.c | 8 - include/linux/completion.h | 1 include/linux/freezer.h | 244 +---------------------------------------- include/linux/sched.h | 41 +++--- include/linux/sunrpc/sched.h | 7 - include/linux/wait.h | 40 +++++- kernel/cgroup/legacy_freezer.c | 23 +-- kernel/exit.c | 4 kernel/fork.c | 5 kernel/freezer.c | 137 ++++++++++++++++------- kernel/futex/waitwake.c | 8 - kernel/hung_task.c | 4 kernel/power/main.c | 6 - kernel/power/process.c | 10 - kernel/ptrace.c | 2 kernel/sched/completion.c | 9 + kernel/sched/core.c | 19 ++- kernel/signal.c | 14 +- kernel/time/hrtimer.c | 4 kernel/umh.c | 20 +-- mm/khugepaged.c | 4 net/sunrpc/sched.c | 12 -- net/unix/af_unix.c | 8 - 34 files changed, 281 insertions(+), 410 deletions(-) --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -4034,10 +4034,9 @@ static int binder_wait_for_work(struct b struct binder_proc *proc = thread->proc; int ret = 0; - freezer_do_not_count(); binder_inner_proc_lock(proc); for (;;) { - prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE); + prepare_to_wait(&thread->wait, &wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE); if (binder_has_work_ilocked(thread, do_proc_work)) break; if (do_proc_work) @@ -4054,7 +4053,6 @@ static int binder_wait_for_work(struct b } finish_wait(&thread->wait, &wait); binder_inner_proc_unlock(proc); - freezer_count(); return ret; } --- a/drivers/media/pci/pt3/pt3.c +++ b/drivers/media/pci/pt3/pt3.c @@ -445,8 +445,8 @@ static int pt3_fetch_thread(void *data) pt3_proc_dma(adap); delay = ktime_set(0, PT3_FETCH_DELAY * NSEC_PER_MSEC); - set_current_state(TASK_UNINTERRUPTIBLE); - freezable_schedule_hrtimeout_range(&delay, + set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE); + schedule_hrtimeout_range(&delay, PT3_FETCH_DELAY_DELTA * NSEC_PER_MSEC, HRTIMER_MODE_REL); } --- a/fs/cifs/inode.c +++ b/fs/cifs/inode.c @@ -2286,7 +2286,7 @@ cifs_invalidate_mapping(struct inode *in static int cifs_wait_bit_killable(struct wait_bit_key *key, int mode) { - freezable_schedule_unsafe(); + schedule(); if (signal_pending_state(mode, current)) return -ERESTARTSYS; return 0; @@ -2304,7 +2304,7 @@ cifs_revalidate_mapping(struct inode *in return 0; rc = wait_on_bit_lock_action(flags, CIFS_INO_LOCK, cifs_wait_bit_killable, - TASK_KILLABLE); + TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); if (rc) return rc; --- a/fs/cifs/transport.c +++ b/fs/cifs/transport.c @@ -760,8 +760,9 @@ wait_for_response(struct TCP_Server_Info { int error; - error = wait_event_freezekillable_unsafe(server->response_q, - midQ->mid_state != MID_REQUEST_SUBMITTED); + error = wait_event_state(server->response_q, + midQ->mid_state != MID_REQUEST_SUBMITTED, + (TASK_KILLABLE|TASK_FREEZABLE_UNSAFE)); if (error < 0) return -ERESTARTSYS; --- a/fs/coredump.c +++ b/fs/coredump.c @@ -402,9 +402,8 @@ static int coredump_wait(int exit_code, if (core_waiters > 0) { struct core_thread *ptr; - freezer_do_not_count(); - wait_for_completion(&core_state->startup); - freezer_count(); + wait_for_completion_state(&core_state->startup, + TASK_UNINTERRUPTIBLE|TASK_FREEZABLE); /* * Wait for all the threads to become inactive, so that * all the thread context (extended register state, like --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -565,7 +565,8 @@ static vm_fault_t nfs_vm_page_mkwrite(st } wait_on_bit_action(&NFS_I(inode)->flags, NFS_INO_INVALIDATING, - nfs_wait_bit_killable, TASK_KILLABLE); + nfs_wait_bit_killable, + TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); lock_page(page); mapping = page_file_mapping(page); --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -72,18 +72,13 @@ nfs_fattr_to_ino_t(struct nfs_fattr *fat return nfs_fileid_to_ino_t(fattr->fileid); } -static int nfs_wait_killable(int mode) +int nfs_wait_bit_killable(struct wait_bit_key *key, int mode) { - freezable_schedule_unsafe(); + schedule(); if (signal_pending_state(mode, current)) return -ERESTARTSYS; return 0; } - -int nfs_wait_bit_killable(struct wait_bit_key *key, int mode) -{ - return nfs_wait_killable(mode); -} EXPORT_SYMBOL_GPL(nfs_wait_bit_killable); /** @@ -1331,7 +1326,8 @@ int nfs_clear_invalid_mapping(struct add */ for (;;) { ret = wait_on_bit_action(bitlock, NFS_INO_INVALIDATING, - nfs_wait_bit_killable, TASK_KILLABLE); + nfs_wait_bit_killable, + TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); if (ret) goto out; spin_lock(&inode->i_lock); --- a/fs/nfs/nfs3proc.c +++ b/fs/nfs/nfs3proc.c @@ -36,7 +36,8 @@ nfs3_rpc_wrapper(struct rpc_clnt *clnt, res = rpc_call_sync(clnt, msg, flags); if (res != -EJUKEBOX) break; - freezable_schedule_timeout_killable_unsafe(NFS_JUKEBOX_RETRY_TIME); + __set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); + schedule_timeout(NFS_JUKEBOX_RETRY_TIME); res = -ERESTARTSYS; } while (!fatal_signal_pending(current)); return res; --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -408,8 +408,8 @@ static int nfs4_delay_killable(long *tim { might_sleep(); - freezable_schedule_timeout_killable_unsafe( - nfs4_update_delay(timeout)); + __set_current_state(TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); + schedule_timeout(nfs4_update_delay(timeout)); if (!__fatal_signal_pending(current)) return 0; return -EINTR; @@ -419,7 +419,8 @@ static int nfs4_delay_interruptible(long { might_sleep(); - freezable_schedule_timeout_interruptible_unsafe(nfs4_update_delay(timeout)); + __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE_UNSAFE); + schedule_timeout(nfs4_update_delay(timeout)); if (!signal_pending(current)) return 0; return __fatal_signal_pending(current) ? -EINTR :-ERESTARTSYS; @@ -7363,7 +7364,8 @@ nfs4_retry_setlk_simple(struct nfs4_stat status = nfs4_proc_setlk(state, cmd, request); if ((status != -EAGAIN) || IS_SETLK(cmd)) break; - freezable_schedule_timeout_interruptible(timeout); + __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); + schedule_timeout(timeout); timeout *= 2; timeout = min_t(unsigned long, NFS4_LOCK_MAXTIMEOUT, timeout); status = -ERESTARTSYS; @@ -7431,10 +7433,8 @@ nfs4_retry_setlk(struct nfs4_state *stat break; status = -ERESTARTSYS; - freezer_do_not_count(); - wait_woken(&waiter.wait, TASK_INTERRUPTIBLE, + wait_woken(&waiter.wait, TASK_INTERRUPTIBLE|TASK_FREEZABLE, NFS4_LOCK_MAXTIMEOUT); - freezer_count(); } while (!signalled()); remove_wait_queue(q, &waiter.wait); --- a/fs/nfs/nfs4state.c +++ b/fs/nfs/nfs4state.c @@ -1314,7 +1314,8 @@ int nfs4_wait_clnt_recover(struct nfs_cl refcount_inc(&clp->cl_count); res = wait_on_bit_action(&clp->cl_state, NFS4CLNT_MANAGER_RUNNING, - nfs_wait_bit_killable, TASK_KILLABLE); + nfs_wait_bit_killable, + TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); if (res) goto out; if (clp->cl_cons_state < 0) --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -1907,7 +1907,7 @@ static int pnfs_prepare_to_retry_layoutg pnfs_layoutcommit_inode(lo->plh_inode, false); return wait_on_bit_action(&lo->plh_flags, NFS_LAYOUT_RETURN, nfs_wait_bit_killable, - TASK_KILLABLE); + TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); } static void nfs_layoutget_begin(struct pnfs_layout_hdr *lo) @@ -3182,7 +3182,7 @@ pnfs_layoutcommit_inode(struct inode *in status = wait_on_bit_lock_action(&nfsi->flags, NFS_INO_LAYOUTCOMMITTING, nfs_wait_bit_killable, - TASK_KILLABLE); + TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); if (status) goto out; } --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -602,9 +602,9 @@ xfsaild( while (1) { if (tout && tout <= 20) - set_current_state(TASK_KILLABLE); + set_current_state(TASK_KILLABLE|TASK_FREEZABLE); else - set_current_state(TASK_INTERRUPTIBLE); + set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); /* * Check kthread_should_stop() after we set the task state to @@ -653,14 +653,14 @@ xfsaild( ailp->ail_target == ailp->ail_target_prev && list_empty(&ailp->ail_buf_list)) { spin_unlock(&ailp->ail_lock); - freezable_schedule(); + schedule(); tout = 0; continue; } spin_unlock(&ailp->ail_lock); if (tout) - freezable_schedule_timeout(msecs_to_jiffies(tout)); + schedule_timeout(msecs_to_jiffies(tout)); __set_current_state(TASK_RUNNING); --- a/include/linux/completion.h +++ b/include/linux/completion.h @@ -103,6 +103,7 @@ extern void wait_for_completion(struct c extern void wait_for_completion_io(struct completion *); extern int wait_for_completion_interruptible(struct completion *x); extern int wait_for_completion_killable(struct completion *x); +extern int wait_for_completion_state(struct completion *x, unsigned int state); extern unsigned long wait_for_completion_timeout(struct completion *x, unsigned long timeout); extern unsigned long wait_for_completion_io_timeout(struct completion *x, --- a/include/linux/freezer.h +++ b/include/linux/freezer.h @@ -8,9 +8,11 @@ #include #include #include +#include #ifdef CONFIG_FREEZER -extern atomic_t system_freezing_cnt; /* nr of freezing conds in effect */ +DECLARE_STATIC_KEY_FALSE(freezer_active); + extern bool pm_freezing; /* PM freezing in effect */ extern bool pm_nosig_freezing; /* PM nosig freezing in effect */ @@ -22,10 +24,7 @@ extern unsigned int freeze_timeout_msecs /* * Check if a process has been frozen */ -static inline bool frozen(struct task_struct *p) -{ - return p->flags & PF_FROZEN; -} +extern bool frozen(struct task_struct *p); extern bool freezing_slow_path(struct task_struct *p); @@ -34,9 +33,10 @@ extern bool freezing_slow_path(struct ta */ static inline bool freezing(struct task_struct *p) { - if (likely(!atomic_read(&system_freezing_cnt))) - return false; - return freezing_slow_path(p); + if (static_branch_unlikely(&freezer_active)) + return freezing_slow_path(p); + + return false; } /* Takes and releases task alloc lock using task_lock() */ @@ -48,23 +48,14 @@ extern int freeze_kernel_threads(void); extern void thaw_processes(void); extern void thaw_kernel_threads(void); -/* - * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION - * If try_to_freeze causes a lockdep warning it means the caller may deadlock - */ -static inline bool try_to_freeze_unsafe(void) +static inline bool try_to_freeze(void) { might_sleep(); if (likely(!freezing(current))) return false; - return __refrigerator(false); -} - -static inline bool try_to_freeze(void) -{ if (!(current->flags & PF_NOFREEZE)) debug_check_no_locks_held(); - return try_to_freeze_unsafe(); + return __refrigerator(false); } extern bool freeze_task(struct task_struct *p); @@ -79,195 +70,6 @@ static inline bool cgroup_freezing(struc } #endif /* !CONFIG_CGROUP_FREEZER */ -/* - * The PF_FREEZER_SKIP flag should be set by a vfork parent right before it - * calls wait_for_completion(&vfork) and reset right after it returns from this - * function. Next, the parent should call try_to_freeze() to freeze itself - * appropriately in case the child has exited before the freezing of tasks is - * complete. However, we don't want kernel threads to be frozen in unexpected - * places, so we allow them to block freeze_processes() instead or to set - * PF_NOFREEZE if needed. Fortunately, in the ____call_usermodehelper() case the - * parent won't really block freeze_processes(), since ____call_usermodehelper() - * (the child) does a little before exec/exit and it can't be frozen before - * waking up the parent. - */ - - -/** - * freezer_do_not_count - tell freezer to ignore %current - * - * Tell freezers to ignore the current task when determining whether the - * target frozen state is reached. IOW, the current task will be - * considered frozen enough by freezers. - * - * The caller shouldn't do anything which isn't allowed for a frozen task - * until freezer_cont() is called. Usually, freezer[_do_not]_count() pair - * wrap a scheduling operation and nothing much else. - */ -static inline void freezer_do_not_count(void) -{ - current->flags |= PF_FREEZER_SKIP; -} - -/** - * freezer_count - tell freezer to stop ignoring %current - * - * Undo freezer_do_not_count(). It tells freezers that %current should be - * considered again and tries to freeze if freezing condition is already in - * effect. - */ -static inline void freezer_count(void) -{ - current->flags &= ~PF_FREEZER_SKIP; - /* - * If freezing is in progress, the following paired with smp_mb() - * in freezer_should_skip() ensures that either we see %true - * freezing() or freezer_should_skip() sees !PF_FREEZER_SKIP. - */ - smp_mb(); - try_to_freeze(); -} - -/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */ -static inline void freezer_count_unsafe(void) -{ - current->flags &= ~PF_FREEZER_SKIP; - smp_mb(); - try_to_freeze_unsafe(); -} - -/** - * freezer_should_skip - whether to skip a task when determining frozen - * state is reached - * @p: task in quesion - * - * This function is used by freezers after establishing %true freezing() to - * test whether a task should be skipped when determining the target frozen - * state is reached. IOW, if this function returns %true, @p is considered - * frozen enough. - */ -static inline bool freezer_should_skip(struct task_struct *p) -{ - /* - * The following smp_mb() paired with the one in freezer_count() - * ensures that either freezer_count() sees %true freezing() or we - * see cleared %PF_FREEZER_SKIP and return %false. This makes it - * impossible for a task to slip frozen state testing after - * clearing %PF_FREEZER_SKIP. - */ - smp_mb(); - return p->flags & PF_FREEZER_SKIP; -} - -/* - * These functions are intended to be used whenever you want allow a sleeping - * task to be frozen. Note that neither return any clear indication of - * whether a freeze event happened while in this function. - */ - -/* Like schedule(), but should not block the freezer. */ -static inline void freezable_schedule(void) -{ - freezer_do_not_count(); - schedule(); - freezer_count(); -} - -/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */ -static inline void freezable_schedule_unsafe(void) -{ - freezer_do_not_count(); - schedule(); - freezer_count_unsafe(); -} - -/* - * Like schedule_timeout(), but should not block the freezer. Do not - * call this with locks held. - */ -static inline long freezable_schedule_timeout(long timeout) -{ - long __retval; - freezer_do_not_count(); - __retval = schedule_timeout(timeout); - freezer_count(); - return __retval; -} - -/* - * Like schedule_timeout_interruptible(), but should not block the freezer. Do not - * call this with locks held. - */ -static inline long freezable_schedule_timeout_interruptible(long timeout) -{ - long __retval; - freezer_do_not_count(); - __retval = schedule_timeout_interruptible(timeout); - freezer_count(); - return __retval; -} - -/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */ -static inline long freezable_schedule_timeout_interruptible_unsafe(long timeout) -{ - long __retval; - - freezer_do_not_count(); - __retval = schedule_timeout_interruptible(timeout); - freezer_count_unsafe(); - return __retval; -} - -/* Like schedule_timeout_killable(), but should not block the freezer. */ -static inline long freezable_schedule_timeout_killable(long timeout) -{ - long __retval; - freezer_do_not_count(); - __retval = schedule_timeout_killable(timeout); - freezer_count(); - return __retval; -} - -/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */ -static inline long freezable_schedule_timeout_killable_unsafe(long timeout) -{ - long __retval; - freezer_do_not_count(); - __retval = schedule_timeout_killable(timeout); - freezer_count_unsafe(); - return __retval; -} - -/* - * Like schedule_hrtimeout_range(), but should not block the freezer. Do not - * call this with locks held. - */ -static inline int freezable_schedule_hrtimeout_range(ktime_t *expires, - u64 delta, const enum hrtimer_mode mode) -{ - int __retval; - freezer_do_not_count(); - __retval = schedule_hrtimeout_range(expires, delta, mode); - freezer_count(); - return __retval; -} - -/* - * Freezer-friendly wrappers around wait_event_interruptible(), - * wait_event_killable() and wait_event_interruptible_timeout(), originally - * defined in - */ - -/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */ -#define wait_event_freezekillable_unsafe(wq, condition) \ -({ \ - int __retval; \ - freezer_do_not_count(); \ - __retval = wait_event_killable(wq, (condition)); \ - freezer_count_unsafe(); \ - __retval; \ -}) - #else /* !CONFIG_FREEZER */ static inline bool frozen(struct task_struct *p) { return false; } static inline bool freezing(struct task_struct *p) { return false; } @@ -281,35 +83,9 @@ static inline void thaw_kernel_threads(v static inline bool try_to_freeze(void) { return false; } -static inline void freezer_do_not_count(void) {} static inline void freezer_count(void) {} -static inline int freezer_should_skip(struct task_struct *p) { return 0; } static inline void set_freezable(void) {} -#define freezable_schedule() schedule() - -#define freezable_schedule_unsafe() schedule() - -#define freezable_schedule_timeout(timeout) schedule_timeout(timeout) - -#define freezable_schedule_timeout_interruptible(timeout) \ - schedule_timeout_interruptible(timeout) - -#define freezable_schedule_timeout_interruptible_unsafe(timeout) \ - schedule_timeout_interruptible(timeout) - -#define freezable_schedule_timeout_killable(timeout) \ - schedule_timeout_killable(timeout) - -#define freezable_schedule_timeout_killable_unsafe(timeout) \ - schedule_timeout_killable(timeout) - -#define freezable_schedule_hrtimeout_range(expires, delta, mode) \ - schedule_hrtimeout_range(expires, delta, mode) - -#define wait_event_freezekillable_unsafe(wq, condition) \ - wait_event_killable(wq, condition) - #endif /* !CONFIG_FREEZER */ #endif /* FREEZER_H_INCLUDED */ --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -80,25 +80,32 @@ struct task_group; */ /* Used in tsk->state: */ -#define TASK_RUNNING 0x0000 -#define TASK_INTERRUPTIBLE 0x0001 -#define TASK_UNINTERRUPTIBLE 0x0002 -#define __TASK_STOPPED 0x0004 -#define __TASK_TRACED 0x0008 +#define TASK_RUNNING 0x000000 +#define TASK_INTERRUPTIBLE 0x000001 +#define TASK_UNINTERRUPTIBLE 0x000002 +#define __TASK_STOPPED 0x000004 +#define __TASK_TRACED 0x000008 /* Used in tsk->exit_state: */ -#define EXIT_DEAD 0x0010 -#define EXIT_ZOMBIE 0x0020 +#define EXIT_DEAD 0x000010 +#define EXIT_ZOMBIE 0x000020 #define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD) /* Used in tsk->state again: */ -#define TASK_PARKED 0x0040 -#define TASK_DEAD 0x0080 -#define TASK_WAKEKILL 0x0100 -#define TASK_WAKING 0x0200 -#define TASK_NOLOAD 0x0400 -#define TASK_NEW 0x0800 -/* RT specific auxilliary flag to mark RT lock waiters */ -#define TASK_RTLOCK_WAIT 0x1000 -#define TASK_STATE_MAX 0x2000 +#define TASK_PARKED 0x000040 +#define TASK_DEAD 0x000080 +#define TASK_WAKEKILL 0x000100 +#define TASK_WAKING 0x000200 +#define TASK_NOLOAD 0x000400 +#define TASK_NEW 0x000800 +#define TASK_FREEZABLE 0x001000 +#define __TASK_FREEZABLE_UNSAFE (0x002000 * IS_ENABLED(CONFIG_LOCKDEP)) +#define TASK_FROZEN 0x004000 +#define TASK_RTLOCK_WAIT 0x008000 +#define TASK_STATE_MAX 0x010000 + +/* + * DO NOT ADD ANY NEW USERS ! + */ +#define TASK_FREEZABLE_UNSAFE (TASK_FREEZABLE | __TASK_FREEZABLE_UNSAFE) /* Convenience macros for the sake of set_current_state: */ #define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE) @@ -1698,7 +1705,6 @@ extern struct pid *cad_pid; #define PF_NPROC_EXCEEDED 0x00001000 /* set_user() noticed that RLIMIT_NPROC was exceeded */ #define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */ #define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */ -#define PF_FROZEN 0x00010000 /* Frozen for system suspend */ #define PF_KSWAPD 0x00020000 /* I am kswapd */ #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ @@ -1709,7 +1715,6 @@ extern struct pid *cad_pid; #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */ #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ #define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */ -#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */ #define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */ /* --- a/include/linux/sunrpc/sched.h +++ b/include/linux/sunrpc/sched.h @@ -252,7 +252,7 @@ int rpc_malloc(struct rpc_task *); void rpc_free(struct rpc_task *); int rpciod_up(void); void rpciod_down(void); -int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *); +int rpc_wait_for_completion_task(struct rpc_task *task); #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) struct net; void rpc_show_tasks(struct net *); @@ -264,11 +264,6 @@ extern struct workqueue_struct *xprtiod_ void rpc_prepare_task(struct rpc_task *task); gfp_t rpc_task_gfp_mask(void); -static inline int rpc_wait_for_completion_task(struct rpc_task *task) -{ - return __rpc_wait_for_completion_task(task, NULL); -} - #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) || IS_ENABLED(CONFIG_TRACEPOINTS) static inline const char * rpc_qname(const struct rpc_wait_queue *q) { --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -361,8 +361,8 @@ do { \ } while (0) #define __wait_event_freezable(wq_head, condition) \ - ___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0, \ - freezable_schedule()) + ___wait_event(wq_head, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE), \ + 0, 0, schedule()) /** * wait_event_freezable - sleep (or freeze) until a condition gets true @@ -420,8 +420,8 @@ do { \ #define __wait_event_freezable_timeout(wq_head, condition, timeout) \ ___wait_event(wq_head, ___wait_cond_timeout(condition), \ - TASK_INTERRUPTIBLE, 0, timeout, \ - __ret = freezable_schedule_timeout(__ret)) + (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 0, timeout, \ + __ret = schedule_timeout(__ret)) /* * like wait_event_timeout() -- except it uses TASK_INTERRUPTIBLE to avoid @@ -641,8 +641,8 @@ do { \ #define __wait_event_freezable_exclusive(wq, condition) \ - ___wait_event(wq, condition, TASK_INTERRUPTIBLE, 1, 0, \ - freezable_schedule()) + ___wait_event(wq, condition, (TASK_INTERRUPTIBLE|TASK_FREEZABLE), 1, 0,\ + schedule()) #define wait_event_freezable_exclusive(wq, condition) \ ({ \ @@ -931,6 +931,34 @@ extern int do_wait_intr_irq(wait_queue_h __ret; \ }) +#define __wait_event_state(wq, condition, state) \ + ___wait_event(wq, condition, state, 0, 0, schedule()) + +/** + * wait_event_state - sleep until a condition gets true + * @wq_head: the waitqueue to wait on + * @condition: a C expression for the event to wait for + * @state: state to sleep in + * + * The process is put to sleep (@state) until the @condition evaluates to true + * or a signal is received. The @condition is checked each time the waitqueue + * @wq_head is woken up. + * + * wake_up() has to be called after changing any variable that could + * change the result of the wait condition. + * + * The function will return -ERESTARTSYS if it was interrupted by a + * signal and 0 if @condition evaluated to true. + */ +#define wait_event_state(wq_head, condition, state) \ +({ \ + int __ret = 0; \ + might_sleep(); \ + if (!(condition)) \ + __ret = __wait_event_state(wq_head, condition, state); \ + __ret; \ +}) + #define __wait_event_killable_timeout(wq_head, condition, timeout) \ ___wait_event(wq_head, ___wait_cond_timeout(condition), \ TASK_KILLABLE, 0, timeout, \ --- a/kernel/cgroup/legacy_freezer.c +++ b/kernel/cgroup/legacy_freezer.c @@ -113,7 +113,7 @@ static int freezer_css_online(struct cgr if (parent && (parent->state & CGROUP_FREEZING)) { freezer->state |= CGROUP_FREEZING_PARENT | CGROUP_FROZEN; - atomic_inc(&system_freezing_cnt); + static_branch_inc(&freezer_active); } mutex_unlock(&freezer_mutex); @@ -134,7 +134,7 @@ static void freezer_css_offline(struct c mutex_lock(&freezer_mutex); if (freezer->state & CGROUP_FREEZING) - atomic_dec(&system_freezing_cnt); + static_branch_dec(&freezer_active); freezer->state = 0; @@ -179,6 +179,7 @@ static void freezer_attach(struct cgroup __thaw_task(task); } else { freeze_task(task); + /* clear FROZEN and propagate upwards */ while (freezer && (freezer->state & CGROUP_FROZEN)) { freezer->state &= ~CGROUP_FROZEN; @@ -271,16 +272,8 @@ static void update_if_frozen(struct cgro css_task_iter_start(css, 0, &it); while ((task = css_task_iter_next(&it))) { - if (freezing(task)) { - /* - * freezer_should_skip() indicates that the task - * should be skipped when determining freezing - * completion. Consider it frozen in addition to - * the usual frozen condition. - */ - if (!frozen(task) && !freezer_should_skip(task)) - goto out_iter_end; - } + if (freezing(task) && !frozen(task)) + goto out_iter_end; } freezer->state |= CGROUP_FROZEN; @@ -357,7 +350,7 @@ static void freezer_apply_state(struct f if (freeze) { if (!(freezer->state & CGROUP_FREEZING)) - atomic_inc(&system_freezing_cnt); + static_branch_inc(&freezer_active); freezer->state |= state; freeze_cgroup(freezer); } else { @@ -366,9 +359,9 @@ static void freezer_apply_state(struct f freezer->state &= ~state; if (!(freezer->state & CGROUP_FREEZING)) { - if (was_freezing) - atomic_dec(&system_freezing_cnt); freezer->state &= ~CGROUP_FROZEN; + if (was_freezing) + static_branch_dec(&freezer_active); unfreeze_cgroup(freezer); } } --- a/kernel/exit.c +++ b/kernel/exit.c @@ -374,10 +374,10 @@ static void coredump_task_exit(struct ta complete(&core_state->startup); for (;;) { - set_current_state(TASK_UNINTERRUPTIBLE); + set_current_state(TASK_UNINTERRUPTIBLE|TASK_FREEZABLE); if (!self.task) /* see coredump_finish() */ break; - freezable_schedule(); + schedule(); } __set_current_state(TASK_RUNNING); } --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1417,13 +1417,12 @@ static void complete_vfork_done(struct t static int wait_for_vfork_done(struct task_struct *child, struct completion *vfork) { + unsigned int state = TASK_UNINTERRUPTIBLE|TASK_KILLABLE|TASK_FREEZABLE; int killed; - freezer_do_not_count(); cgroup_enter_frozen(); - killed = wait_for_completion_killable(vfork); + killed = wait_for_completion_state(vfork, state); cgroup_leave_frozen(false); - freezer_count(); if (killed) { task_lock(child); --- a/kernel/freezer.c +++ b/kernel/freezer.c @@ -13,10 +13,11 @@ #include /* total number of freezing conditions in effect */ -atomic_t system_freezing_cnt = ATOMIC_INIT(0); -EXPORT_SYMBOL(system_freezing_cnt); +DEFINE_STATIC_KEY_FALSE(freezer_active); +EXPORT_SYMBOL(freezer_active); -/* indicate whether PM freezing is in effect, protected by +/* + * indicate whether PM freezing is in effect, protected by * system_transition_mutex */ bool pm_freezing; @@ -29,7 +30,7 @@ static DEFINE_SPINLOCK(freezer_lock); * freezing_slow_path - slow path for testing whether a task needs to be frozen * @p: task to be tested * - * This function is called by freezing() if system_freezing_cnt isn't zero + * This function is called by freezing() if freezer_active isn't zero * and tests whether @p needs to enter and stay in frozen state. Can be * called under any context. The freezers are responsible for ensuring the * target tasks see the updated state. @@ -52,41 +53,40 @@ bool freezing_slow_path(struct task_stru } EXPORT_SYMBOL(freezing_slow_path); +bool frozen(struct task_struct *p) +{ + return READ_ONCE(p->__state) & TASK_FROZEN; +} + /* Refrigerator is place where frozen processes are stored :-). */ bool __refrigerator(bool check_kthr_stop) { - /* Hmm, should we be allowed to suspend when there are realtime - processes around? */ + unsigned int state = get_current_state(); bool was_frozen = false; - unsigned int save = get_current_state(); pr_debug("%s entered refrigerator\n", current->comm); + WARN_ON_ONCE(state && !(state & TASK_NORMAL)); + for (;;) { - set_current_state(TASK_UNINTERRUPTIBLE); + bool freeze; + + set_current_state(TASK_FROZEN); spin_lock_irq(&freezer_lock); - current->flags |= PF_FROZEN; - if (!freezing(current) || - (check_kthr_stop && kthread_should_stop())) - current->flags &= ~PF_FROZEN; + freeze = freezing(current) && !(check_kthr_stop && kthread_should_stop()); spin_unlock_irq(&freezer_lock); - if (!(current->flags & PF_FROZEN)) + if (!freeze) break; + was_frozen = true; schedule(); } + __set_current_state(TASK_RUNNING); pr_debug("%s left refrigerator\n", current->comm); - /* - * Restore saved task state before returning. The mb'd version - * needs to be used; otherwise, it might silently break - * synchronization which depends on ordered task state change. - */ - set_current_state(save); - return was_frozen; } EXPORT_SYMBOL(__refrigerator); @@ -101,6 +101,44 @@ static void fake_signal_wake_up(struct t } } +static int __set_task_frozen(struct task_struct *p, void *arg) +{ + unsigned int state = READ_ONCE(p->__state); + + if (p->on_rq) + return 0; + + if (p != current && task_curr(p)) + return 0; + + if (!(state & (TASK_FREEZABLE | __TASK_STOPPED | __TASK_TRACED))) + return 0; + + /* + * Only TASK_NORMAL can be augmented with TASK_FREEZABLE, since they + * can suffer spurious wakeups. + */ + if (state & TASK_FREEZABLE) + WARN_ON_ONCE(!(state & TASK_NORMAL)); + +#ifdef CONFIG_LOCKDEP + /* + * It's dangerous to freeze with locks held; there be dragons there. + */ + if (!(state & __TASK_FREEZABLE_UNSAFE)) + WARN_ON_ONCE(debug_locks && p->lockdep_depth); +#endif + + WRITE_ONCE(p->__state, TASK_FROZEN); + return TASK_FROZEN; +} + +static bool __freeze_task(struct task_struct *p) +{ + /* TASK_FREEZABLE|TASK_STOPPED|TASK_TRACED -> TASK_FROZEN */ + return task_call_func(p, __set_task_frozen, NULL); +} + /** * freeze_task - send a freeze request to given task * @p: task to send the request to @@ -116,20 +154,8 @@ bool freeze_task(struct task_struct *p) { unsigned long flags; - /* - * This check can race with freezer_do_not_count, but worst case that - * will result in an extra wakeup being sent to the task. It does not - * race with freezer_count(), the barriers in freezer_count() and - * freezer_should_skip() ensure that either freezer_count() sees - * freezing == true in try_to_freeze() and freezes, or - * freezer_should_skip() sees !PF_FREEZE_SKIP and freezes the task - * normally. - */ - if (freezer_should_skip(p)) - return false; - spin_lock_irqsave(&freezer_lock, flags); - if (!freezing(p) || frozen(p)) { + if (!freezing(p) || frozen(p) || __freeze_task(p)) { spin_unlock_irqrestore(&freezer_lock, flags); return false; } @@ -137,19 +163,56 @@ bool freeze_task(struct task_struct *p) if (!(p->flags & PF_KTHREAD)) fake_signal_wake_up(p); else - wake_up_state(p, TASK_INTERRUPTIBLE); + wake_up_state(p, TASK_NORMAL); spin_unlock_irqrestore(&freezer_lock, flags); return true; } +/* + * The special task states (TASK_STOPPED, TASK_TRACED) keep their canonical + * state in p->jobctl. If either of them got a wakeup that was missed because + * TASK_FROZEN, then their canonical state reflects that and the below will + * refuse to restore the special state and instead issue the wakeup. + */ +static int __set_task_special(struct task_struct *p, void *arg) +{ + unsigned int state = 0; + + if (p->jobctl & JOBCTL_TRACED) + state = TASK_TRACED; + + else if (p->jobctl & JOBCTL_STOPPED) + state = TASK_STOPPED; + + if (__fatal_signal_pending(p) && + !(p->jobctl & JOBCTL_DELAY_WAKEKILL)) + state = 0; + + if (state) + WRITE_ONCE(p->__state, state); + + return state; +} + void __thaw_task(struct task_struct *p) { - unsigned long flags; + unsigned long flags, flags2; spin_lock_irqsave(&freezer_lock, flags); - if (frozen(p)) - wake_up_process(p); + if (WARN_ON_ONCE(freezing(p))) + goto unlock; + + if (lock_task_sighand(p, &flags2)) { + /* TASK_FROZEN -> TASK_{STOPPED,TRACED} */ + bool ret = task_call_func(p, __set_task_special, NULL); + unlock_task_sighand(p, &flags2); + if (ret) + goto unlock; + } + + wake_up_state(p, TASK_FROZEN); +unlock: spin_unlock_irqrestore(&freezer_lock, flags); } --- a/kernel/futex/waitwake.c +++ b/kernel/futex/waitwake.c @@ -334,7 +334,7 @@ void futex_wait_queue(struct futex_hash_ * futex_queue() calls spin_unlock() upon completion, both serializing * access to the hash list and forcing another memory barrier. */ - set_current_state(TASK_INTERRUPTIBLE); + set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); futex_queue(q, hb); /* Arm the timer */ @@ -352,7 +352,7 @@ void futex_wait_queue(struct futex_hash_ * is no timeout, or if it has yet to expire. */ if (!timeout || timeout->task) - freezable_schedule(); + schedule(); } __set_current_state(TASK_RUNNING); } @@ -430,7 +430,7 @@ static int futex_wait_multiple_setup(str return ret; } - set_current_state(TASK_INTERRUPTIBLE); + set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); for (i = 0; i < count; i++) { u32 __user *uaddr = (u32 __user *)(unsigned long)vs[i].w.uaddr; @@ -504,7 +504,7 @@ static void futex_sleep_multiple(struct return; } - freezable_schedule(); + schedule(); } /** --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -95,8 +95,8 @@ static void check_hung_task(struct task_ * Ensure the task is not frozen. * Also, skip vfork and any other user process that freezer should skip. */ - if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP))) - return; + if (unlikely(READ_ONCE(t->__state) & (TASK_FREEZABLE | TASK_FROZEN))) + return; /* * When a freshly created task is scheduled once, changes its state to --- a/kernel/power/main.c +++ b/kernel/power/main.c @@ -24,7 +24,7 @@ unsigned int lock_system_sleep(void) { unsigned int flags = current->flags; - current->flags |= PF_FREEZER_SKIP; + current->flags |= PF_NOFREEZE; mutex_lock(&system_transition_mutex); return flags; } @@ -48,8 +48,8 @@ void unlock_system_sleep(unsigned int fl * Which means, if we use try_to_freeze() here, it would make them * enter the refrigerator, thus causing hibernation to lockup. */ - if (!(flags & PF_FREEZER_SKIP)) - current->flags &= ~PF_FREEZER_SKIP; + if (!(flags & PF_NOFREEZE)) + current->flags &= ~PF_NOFREEZE; mutex_unlock(&system_transition_mutex); } EXPORT_SYMBOL_GPL(unlock_system_sleep); --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -53,8 +53,7 @@ static int try_to_freeze_tasks(bool user if (p == current || !freeze_task(p)) continue; - if (!freezer_should_skip(p)) - todo++; + todo++; } read_unlock(&tasklist_lock); @@ -99,8 +98,7 @@ static int try_to_freeze_tasks(bool user if (!wakeup || pm_debug_messages_on) { read_lock(&tasklist_lock); for_each_process_thread(g, p) { - if (p != current && !freezer_should_skip(p) - && freezing(p) && !frozen(p)) + if (p != current && freezing(p) && !frozen(p)) sched_show_task(p); } read_unlock(&tasklist_lock); @@ -132,7 +130,7 @@ int freeze_processes(void) current->flags |= PF_SUSPEND_TASK; if (!pm_freezing) - atomic_inc(&system_freezing_cnt); + static_branch_inc(&freezer_active); pm_wakeup_clear(0); pr_info("Freezing user space processes ... "); @@ -193,7 +191,7 @@ void thaw_processes(void) trace_suspend_resume(TPS("thaw_processes"), 0, true); if (pm_freezing) - atomic_dec(&system_freezing_cnt); + static_branch_dec(&freezer_active); pm_freezing = false; pm_nosig_freezing = false; --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -288,7 +288,7 @@ static int ptrace_check_attach(struct ta } __set_current_state(TASK_RUNNING); - if (!wait_task_inactive(child, TASK_TRACED) || + if (!wait_task_inactive(child, TASK_TRACED|TASK_FREEZABLE) || !ptrace_freeze_traced(child)) return -ESRCH; --- a/kernel/sched/completion.c +++ b/kernel/sched/completion.c @@ -247,6 +247,15 @@ int __sched wait_for_completion_killable } EXPORT_SYMBOL(wait_for_completion_killable); +int __sched wait_for_completion_state(struct completion *x, unsigned int state) +{ + long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, state); + if (t == -ERESTARTSYS) + return t; + return 0; +} +EXPORT_SYMBOL(wait_for_completion_state); + /** * wait_for_completion_killable_timeout: - waits for completion of a task (w/(to,killable)) * @x: holds the state of this particular completion --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3260,6 +3260,19 @@ int migrate_swap(struct task_struct *cur } #endif /* CONFIG_NUMA_BALANCING */ +static inline bool __wti_match(struct task_struct *p, unsigned int match_state) +{ + unsigned int state = READ_ONCE(p->__state); + + if ((match_state & TASK_FREEZABLE) && state == TASK_FROZEN) + return true; + + if (state == (match_state & ~TASK_FREEZABLE)) + return true; + + return false; +} + /* * wait_task_inactive - wait for a thread to unschedule. * @@ -3304,7 +3317,7 @@ unsigned long wait_task_inactive(struct * is actually now running somewhere else! */ while (task_running(rq, p)) { - if (match_state && unlikely(READ_ONCE(p->__state) != match_state)) + if (match_state && !__wti_match(p, match_state)) return 0; cpu_relax(); } @@ -3319,7 +3332,7 @@ unsigned long wait_task_inactive(struct running = task_running(rq, p); queued = task_on_rq_queued(p); ncsw = 0; - if (!match_state || READ_ONCE(p->__state) == match_state) + if (!match_state || __wti_match(p, match_state)) ncsw = p->nvcsw | LONG_MIN; /* sets MSB */ task_rq_unlock(rq, p, &rf); @@ -6320,7 +6333,7 @@ static void __sched notrace __schedule(u prev->sched_contributes_to_load = (prev_state & TASK_UNINTERRUPTIBLE) && !(prev_state & TASK_NOLOAD) && - !(prev->flags & PF_FROZEN); + !(prev_state & TASK_FROZEN); if (prev->sched_contributes_to_load) rq->nr_uninterruptible++; --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2321,7 +2321,7 @@ static int ptrace_stop(int exit_code, in clear_traced_quiesce(); preempt_enable_no_resched(); - freezable_schedule(); + schedule(); cgroup_leave_frozen(true); } else { @@ -2514,7 +2514,7 @@ static bool do_signal_stop(int signr) /* Now we don't run again until woken by SIGCONT or SIGKILL */ cgroup_enter_frozen(); - freezable_schedule(); + schedule(); return true; } else { /* @@ -2589,11 +2589,11 @@ static void do_freezer_trap(void) * immediately (if there is a non-fatal signal pending), and * put the task into sleep. */ - __set_current_state(TASK_INTERRUPTIBLE); + __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); clear_thread_flag(TIF_SIGPENDING); spin_unlock_irq(¤t->sighand->siglock); cgroup_enter_frozen(); - freezable_schedule(); + schedule(); } static int ptrace_signal(int signr, kernel_siginfo_t *info, enum pid_type type) @@ -3639,9 +3639,9 @@ static int do_sigtimedwait(const sigset_ recalc_sigpending(); spin_unlock_irq(&tsk->sighand->siglock); - __set_current_state(TASK_INTERRUPTIBLE); - ret = freezable_schedule_hrtimeout_range(to, tsk->timer_slack_ns, - HRTIMER_MODE_REL); + __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); + ret = schedule_hrtimeout_range(to, tsk->timer_slack_ns, + HRTIMER_MODE_REL); spin_lock_irq(&tsk->sighand->siglock); __set_task_blocked(tsk, &tsk->real_blocked); sigemptyset(&tsk->real_blocked); --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -2037,11 +2037,11 @@ static int __sched do_nanosleep(struct h struct restart_block *restart; do { - set_current_state(TASK_INTERRUPTIBLE); + set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); hrtimer_sleeper_start_expires(t, mode); if (likely(t->task)) - freezable_schedule(); + schedule(); hrtimer_cancel(&t->timer); mode = HRTIMER_MODE_ABS; --- a/kernel/umh.c +++ b/kernel/umh.c @@ -404,6 +404,7 @@ EXPORT_SYMBOL(call_usermodehelper_setup) */ int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait) { + unsigned int state = TASK_UNINTERRUPTIBLE; DECLARE_COMPLETION_ONSTACK(done); int retval = 0; @@ -437,25 +438,22 @@ int call_usermodehelper_exec(struct subp if (wait == UMH_NO_WAIT) /* task has freed sub_info */ goto unlock; + if (wait & UMH_KILLABLE) + state |= TASK_KILLABLE; + if (wait & UMH_FREEZABLE) - freezer_do_not_count(); + state |= TASK_FREEZABLE; - if (wait & UMH_KILLABLE) { - retval = wait_for_completion_killable(&done); - if (!retval) - goto wait_done; + retval = wait_for_completion_state(&done, state); + if (!retval) + goto wait_done; + if (wait & UMH_KILLABLE) { /* umh_complete() will see NULL and free sub_info */ if (xchg(&sub_info->complete, NULL)) goto unlock; - /* fallthrough, umh_complete() was already called */ } - wait_for_completion(&done); - - if (wait & UMH_FREEZABLE) - freezer_count(); - wait_done: retval = sub_info->retval; out: --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -780,8 +780,8 @@ static void khugepaged_alloc_sleep(void) DEFINE_WAIT(wait); add_wait_queue(&khugepaged_wait, &wait); - freezable_schedule_timeout_interruptible( - msecs_to_jiffies(khugepaged_alloc_sleep_millisecs)); + __set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); + schedule_timeout(msecs_to_jiffies(khugepaged_alloc_sleep_millisecs)); remove_wait_queue(&khugepaged_wait, &wait); } --- a/net/sunrpc/sched.c +++ b/net/sunrpc/sched.c @@ -268,7 +268,7 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode) { - freezable_schedule_unsafe(); + schedule(); if (signal_pending_state(mode, current)) return -ERESTARTSYS; return 0; @@ -332,14 +332,12 @@ static int rpc_complete_task(struct rpc_ * to enforce taking of the wq->lock and hence avoid races with * rpc_complete_task(). */ -int __rpc_wait_for_completion_task(struct rpc_task *task, wait_bit_action_f *action) +int rpc_wait_for_completion_task(struct rpc_task *task) { - if (action == NULL) - action = rpc_wait_bit_killable; return out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE, - action, TASK_KILLABLE); + rpc_wait_bit_killable, TASK_KILLABLE|TASK_FREEZABLE_UNSAFE); } -EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task); +EXPORT_SYMBOL_GPL(rpc_wait_for_completion_task); /* * Make an RPC task runnable. @@ -963,7 +961,7 @@ static void __rpc_execute(struct rpc_tas trace_rpc_task_sync_sleep(task, task->tk_action); status = out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_QUEUED, rpc_wait_bit_killable, - TASK_KILLABLE); + TASK_KILLABLE|TASK_FREEZABLE); if (status < 0) { /* * When a sync task receives a signal, it exits with --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -2530,13 +2530,14 @@ static long unix_stream_data_wait(struct struct sk_buff *last, unsigned int last_len, bool freezable) { + unsigned int state = TASK_INTERRUPTIBLE | freezable * TASK_FREEZABLE; struct sk_buff *tail; DEFINE_WAIT(wait); unix_state_lock(sk); for (;;) { - prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE); + prepare_to_wait(sk_sleep(sk), &wait, state); tail = skb_peek_tail(&sk->sk_receive_queue); if (tail != last || @@ -2549,10 +2550,7 @@ static long unix_stream_data_wait(struct sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); unix_state_unlock(sk); - if (freezable) - timeo = freezable_schedule_timeout(timeo); - else - timeo = schedule_timeout(timeo); + timeo = schedule_timeout(timeo); unix_state_lock(sk); if (sock_flag(sk, SOCK_DEAD))