From patchwork Wed Apr 20 20:54:15 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Eric W. Biederman" X-Patchwork-Id: 565562 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C37DC433EF for ; Wed, 20 Apr 2022 20:54:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1382289AbiDTU5h (ORCPT ); Wed, 20 Apr 2022 16:57:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37942 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238578AbiDTU5g (ORCPT ); Wed, 20 Apr 2022 16:57:36 -0400 Received: from out01.mta.xmission.com (out01.mta.xmission.com [166.70.13.231]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 725F72AC65; Wed, 20 Apr 2022 13:54:47 -0700 (PDT) Received: from in01.mta.xmission.com ([166.70.13.51]:38704) by out01.mta.xmission.com with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1nhHLZ-00GAGP-L9; Wed, 20 Apr 2022 14:54:45 -0600 Received: from ip68-227-174-4.om.om.cox.net ([68.227.174.4]:35090 helo=email.froward.int.ebiederm.org.xmission.com) by in01.mta.xmission.com with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1nhHLW-005fp0-8f; Wed, 20 Apr 2022 14:54:45 -0600 From: "Eric W. Biederman" To: Oleg Nesterov Cc: Peter Zijlstra , rjw@rjwysocki.net, mingo@kernel.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, mgorman@suse.de, bigeasy@linutronix.de, Will Deacon , linux-kernel@vger.kernel.org, tj@kernel.org, linux-pm@vger.kernel.org References: <20220413192053.GY2731@worktop.programming.kicks-ass.net> <20220413195612.GC2762@worktop.programming.kicks-ass.net> <20220414115410.GA32752@redhat.com> <20220414183433.GC32752@redhat.com> <20220415101644.GA10421@redhat.com> <20220415105755.GA15217@redhat.com> <20220418170104.GA16199@redhat.com> <20220420131731.GF2731@worktop.programming.kicks-ass.net> <20220420180323.GA14947@redhat.com> Date: Wed, 20 Apr 2022 15:54:15 -0500 In-Reply-To: <20220420180323.GA14947@redhat.com> (Oleg Nesterov's message of "Wed, 20 Apr 2022 20:03:23 +0200") Message-ID: <875yn3zdag.fsf_-_@email.froward.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 X-XM-SPF: eid=1nhHLW-005fp0-8f; ; ; mid=<875yn3zdag.fsf_-_@email.froward.int.ebiederm.org>; ; ; hst=in01.mta.xmission.com; ; ; ip=68.227.174.4; ; ; frm=ebiederm@xmission.com; ; ; spf=softfail X-XM-AID: U2FsdGVkX1+Ovwzz2K7RgEt23h9OPCR7KLC7E09SQ+A= X-SA-Exim-Connect-IP: 68.227.174.4 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: [RFC][PATCH] ptrace: Don't change __state X-SA-Exim-Version: 4.2.1 (built Sat, 08 Feb 2020 21:53:50 +0000) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org I was thinking about this and I have an approach from a different direction. In particular it removes the need for ptrace_freeze_attach and ptrace_unfreeze_attach to change __state. Instead a jobctl bit is used to suppress waking up a process with TASK_WAKEKILL. I think this would be a good technique to completely decouple PREEMPT_RT from the work that ptrace_freeze_attach does. Comments? Eric --- The games ptrace_freeze_traced and ptrace_unfreeze_traced have been playing with __state to remove TASK_WAKEKILL from __state while a ptrace command is executing. Instead add JOBCTL_DELAY_WAKEILL and set it in ptrace_freeze_task and clear it in ptrace_unfreeze_task and the final exit of ptrace_stop. Then in signal_wake_up_state drop TASK_WAKEKILL if it is sent while JOBCTL_DELAY_WAKEKILL is set. As signal_wake_up is the only function that sets TASK_WAKEKILL when waking a process this achives the same effect with less messing with state. Signed-off-by: "Eric W. Biederman" As a proof of concept this seems to work most of the time, unfortunately I have a bug somewhere (a memory stomp?) and I am hitting a BUG_ON that asserts __send_signal is not holding the siglock. [ 97.000168] ------------[ cut here ]------------ [ 97.001782] kernel BUG at kernel/signal.c:1086! [ 97.002539] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 97.002846] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.18.0-rc1userns+ #449 [ 97.003135] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 97.003480] RIP: 0010:__send_signal+0x256/0x580 [ 97.003812] Code: f0 ff ff 49 89 c7 48 85 c0 0f 85 a8 fe ff ff 41 bf 04 00 00 00 44 8d 45 ff e9 f8 fe ff ff 45 31 ff 40 [ 97.003812] RSP: 0018:ffffc900000c8e88 EFLAGS: 00000046 [ 97.003812] RAX: 0000000000000000 RBX: ffff88800875df00 RCX: 0000000000000001 [ 97.003812] RDX: ffff88800875df00 RSI: 0000000000000001 RDI: 000000000000000e [ 97.003812] RBP: 000000000000000e R08: 0000000000000001 R09: 0000000000000000 [ 97.003812] R10: 0000000000000000 R11: ffffc900000c8ff8 R12: 0000000000000000 [ 97.003812] R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff8115f3f3 [ 97.003812] FS: 0000000000000000(0000) GS:ffff8880bcb00000(0000) knlGS:0000000000000000 [ 97.003812] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 97.003812] CR2: 00007ffff739c830 CR3: 00000000094d6000 CR4: 00000000000006e0 [ 97.003812] Call Trace: [ 97.003812] [ 97.003812] group_send_sig_info+0xe1/0x190 [ 97.003812] kill_pid_info+0x78/0x150 [ 97.003812] it_real_fn+0x35/0xe0 [ 97.003812] ? __x64_sys_setitimer+0x150/0x150 [ 97.003812] __hrtimer_run_queues+0x1ca/0x3f0 [ 97.003812] hrtimer_interrupt+0x106/0x220 [ 97.003812] __sysvec_apic_timer_interrupt+0x96/0x260 [ 97.003812] sysvec_apic_timer_interrupt+0x9d/0xd0 [ 97.003812] [ 97.003812] [ 97.003812] asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 97.003812] RIP: 0010:default_idle+0x10/0x20 [ 97.003812] Code: 8b 04 25 00 b0 01 00 f0 80 60 02 df c3 0f ae f0 0f ae 38 0f ae f0 eb b9 66 90 0f 1f 44 00 00 eb 07 03 [ 97.003812] RSP: 0018:ffffc90000093ef8 EFLAGS: 00000246 [ 97.003812] RAX: ffffffff823fafb0 RBX: ffff8880056f97c0 RCX: 0000000000000000 [ 97.003812] RDX: ffffffff81193e1d RSI: ffffffff82da5301 RDI: ffffffff823fb111 [ 97.003812] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 [ 97.003812] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [ 97.003812] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 97.003812] ? mwait_idle+0x70/0x70 [ 97.003812] ? nohz_balance_enter_idle+0x6d/0x220 [ 97.003812] ? default_idle_call+0x21/0x90 [ 97.003812] ? mwait_idle+0x70/0x70 [ 97.003812] default_idle_call+0x59/0x90 [ 97.003812] do_idle+0x1f3/0x260 [ 97.003812] ? finish_task_switch.isra.0+0xef/0x350 [ 97.003812] cpu_startup_entry+0x19/0x20 [ 97.003812] secondary_startup_64_no_verify+0xc3/0xcb [ 97.003812] [ 97.003812] Modules linked in: [ 97.003812] ---[ end trace 0000000000000000 ]--- [ 97.003812] RIP: 0010:__send_signal+0x256/0x580 [ 97.003812] Code: f0 ff ff 49 89 c7 48 85 c0 0f 85 a8 fe ff ff 41 bf 04 00 00 00 44 8d 45 ff e9 f8 fe ff ff 45 31 ff 40 [ 97.003812] RSP: 0018:ffffc900000c8e88 EFLAGS: 00000046 [ 97.003812] RAX: 0000000000000000 RBX: ffff88800875df00 RCX: 0000000000000001 [ 97.003812] RDX: ffff88800875df00 RSI: 0000000000000001 RDI: 000000000000000e [ 97.003812] RBP: 000000000000000e R08: 0000000000000001 R09: 0000000000000000 [ 97.003812] R10: 0000000000000000 R11: ffffc900000c8ff8 R12: 0000000000000000 [ 97.003812] R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff8115f3f3 [ 97.003812] FS: 0000000000000000(0000) GS:ffff8880bcb00000(0000) knlGS:0000000000000000 [ 97.003812] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 97.003812] CR2: 00007ffff739c830 CR3: 00000000094d6000 CR4: 00000000000006e0 [ 97.003812] Kernel panic - not syncing: Fatal exception in interrupt [ 97.003812] Kernel Offset: disabled [ 97.003812] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- --- include/linux/sched/jobctl.h | 2 ++ kernel/ptrace.c | 32 ++++++++++++++------------------ kernel/signal.c | 5 +++++ 3 files changed, 21 insertions(+), 18 deletions(-) diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h index fa067de9f1a9..4e154ad8205f 100644 --- a/include/linux/sched/jobctl.h +++ b/include/linux/sched/jobctl.h @@ -19,6 +19,7 @@ struct task_struct; #define JOBCTL_TRAPPING_BIT 21 /* switching to TRACED */ #define JOBCTL_LISTENING_BIT 22 /* ptracer is listening for events */ #define JOBCTL_TRAP_FREEZE_BIT 23 /* trap for cgroup freezer */ +#define JOBCTL_DELAY_WAKEKILL_BIT 24 /* delay killable wakeups */ #define JOBCTL_STOP_DEQUEUED (1UL << JOBCTL_STOP_DEQUEUED_BIT) #define JOBCTL_STOP_PENDING (1UL << JOBCTL_STOP_PENDING_BIT) @@ -28,6 +29,7 @@ struct task_struct; #define JOBCTL_TRAPPING (1UL << JOBCTL_TRAPPING_BIT) #define JOBCTL_LISTENING (1UL << JOBCTL_LISTENING_BIT) #define JOBCTL_TRAP_FREEZE (1UL << JOBCTL_TRAP_FREEZE_BIT) +#define JOBCTL_DELAY_WAKEKILL (1UL << JOBCTL_DELAY_WAKEKILL_BIT) #define JOBCTL_TRAP_MASK (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY) #define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK) diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 4f78d0d5940c..a55320a0e8e3 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -197,7 +197,8 @@ static bool ptrace_freeze_traced(struct task_struct *task) spin_lock_irq(&task->sighand->siglock); if (task_is_traced(task) && !looks_like_a_spurious_pid(task) && !__fatal_signal_pending(task)) { - WRITE_ONCE(task->__state, __TASK_TRACED); + //WARN_ON((task->jobctl & JOBCTL_DELAY_WAKEKILL)); + task->jobctl |= JOBCTL_DELAY_WAKEKILL; ret = true; } spin_unlock_irq(&task->sighand->siglock); @@ -207,7 +208,7 @@ static bool ptrace_freeze_traced(struct task_struct *task) static void ptrace_unfreeze_traced(struct task_struct *task) { - if (READ_ONCE(task->__state) != __TASK_TRACED) + if (!task_is_traced(task)) return; WARN_ON(!task->ptrace || task->parent != current); @@ -216,13 +217,11 @@ static void ptrace_unfreeze_traced(struct task_struct *task) * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely. * Recheck state under the lock to close this race. */ - spin_lock_irq(&task->sighand->siglock); - if (READ_ONCE(task->__state) == __TASK_TRACED) { - if (__fatal_signal_pending(task)) - wake_up_state(task, __TASK_TRACED); - else - WRITE_ONCE(task->__state, TASK_TRACED); - } + spin_unlock_irq(&task->sighand->siglock); + WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL)); + task->jobctl &= ~JOBCTL_DELAY_WAKEKILL; + if (fatal_signal_pending(task)) + wake_up_state(task, TASK_WAKEKILL); spin_unlock_irq(&task->sighand->siglock); } @@ -256,7 +255,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) */ read_lock(&tasklist_lock); if (child->ptrace && child->parent == current) { - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); + WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL); /* * child->sighand can't be NULL, release_task() * does ptrace_unlink() before __exit_signal(). @@ -267,13 +266,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) read_unlock(&tasklist_lock); if (!ret && !ignore_state) { - if (!wait_task_inactive(child, __TASK_TRACED)) { + if (!wait_task_inactive(child, TASK_TRACED)) { /* * This can only happen if may_ptrace_stop() fails and * ptrace_stop() changes ->state back to TASK_RUNNING, - * so we should not worry about leaking __TASK_TRACED. + * so we should not worry about leaking JOBCTL_DELAY_WAKEKILL. */ - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); + WARN_ON(!(child->jobctl & JOBCTL_DELAY_WAKEKILL)); ret = -ESRCH; } } @@ -892,7 +891,6 @@ static int ptrace_resume(struct task_struct *child, long request, * status and clears the code too; this can't race with the tracee, it * takes siglock after resume. */ - need_siglock = data && !thread_group_empty(current); if (need_siglock) spin_lock_irq(&child->sighand->siglock); child->exit_code = data; @@ -1325,8 +1323,7 @@ SYSCALL_DEFINE4(ptrace, long, request, long, pid, unsigned long, addr, goto out_put_task_struct; ret = arch_ptrace(child, request, addr, data); - if (ret || request != PTRACE_DETACH) - ptrace_unfreeze_traced(child); + ptrace_unfreeze_traced(child); out_put_task_struct: put_task_struct(child); @@ -1468,8 +1465,7 @@ COMPAT_SYSCALL_DEFINE4(ptrace, compat_long_t, request, compat_long_t, pid, request == PTRACE_INTERRUPT); if (!ret) { ret = compat_arch_ptrace(child, request, addr, data); - if (ret || request != PTRACE_DETACH) - ptrace_unfreeze_traced(child); + ptrace_unfreeze_traced(child); } out_put_task_struct: diff --git a/kernel/signal.c b/kernel/signal.c index 30cd1ca43bcd..a1277580c92e 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -762,6 +762,10 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info) */ void signal_wake_up_state(struct task_struct *t, unsigned int state) { + /* Suppress wakekill? */ + if (t->jobctl & JOBCTL_DELAY_WAKEKILL) + state &= ~TASK_WAKEKILL; + set_tsk_thread_flag(t, TIF_SIGPENDING); /* * TASK_WAKEKILL also means wake it up in the stopped/traced/killable @@ -2328,6 +2332,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code, /* LISTENING can be set only during STOP traps, clear it */ current->jobctl &= ~JOBCTL_LISTENING; + current->jobctl &= ~JOBCTL_DELAY_WAKEKILL; /* * Queued signals ignored us while we were stopped for tracing.