diff mbox series

[2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT

Message ID 20220412114853.842942162@infradead.org
State New
Headers show
Series ptrace-vs-PREEMPT_RT and freezer rewrite | expand

Commit Message

Peter Zijlstra April 12, 2022, 11:44 a.m. UTC
Currently ptrace_check_attach() / ptrace_freeze_traced() rely on
specific scheduler behaviour to freeze the task. In specific, it
relies on there not being any blocking behaviour between:

  set_special_state(TASK_TRACED);
  ...
  schedule();

specifically it relies on being able to change p->__state between
those two points (to clear/set TASK_WAKEKILL) and relies on
wait_task_inactive() to ensure the task has hit that schedule().

However, PREEMPT_RT is breaking this by introducing sleeping
spinlocks. The consequence of sleeping spinlocks is that p->__state
can change (also see p->saved_state) and that a task can be inactive
(off the runqueue) and *NOT* be at the schedule() as expected.

That is, both the p->__state and wait_task_inactive() usage are
broken.

In order to avoid having to deal with p->saved_state, flip the
wait_task_inactive() and p->__state poking. That is, first wait for
the victim to be in schedule() and then poke p->__state, which is in a
known state by then.

The only problem with this is that it's possible to race with a
concurrent ptrace_detach()+pthread_attach() landing back in the same
situation as before. To deal with this, compare the task's nvcsw
count to make sure there is no scheduling between the initial
quiescence and the final task state poking.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/ptrace.c     |  175 +++++++++++++++++++++++++++++++++++++++++-----------
 kernel/sched/core.c |    5 -
 2 files changed, 141 insertions(+), 39 deletions(-)

Comments

Oleg Nesterov April 13, 2022, 1:24 p.m. UTC | #1
Hi Peter,

I like 1-2 but I need to read them (and other patches) again, a
couple of nits right now.

On 04/12, Peter Zijlstra wrote:
>
> +static int __ptrace_freeze_cond(struct task_struct *p)
> +{
> +	if (!task_is_traced(p))
> +		return -ESRCH;

	if (!task_is_traced(p) || p->parent != current)
		return -ESRCH;

we should not spin/sleep if it is traced by another task

> +static int __ptrace_freeze(struct task_struct *p, void *arg)
> +{
> +	int ret;
> +
> +	ret = __ptrace_freeze_cond(p);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * Task scheduled between __ptrace_pre_freeze() and here, not our task
> +	 * anymore.
> +	 */
> +	if (*(unsigned long *)arg != p->nvcsw)
> +		return -ESRCH;
> +
> +	if (looks_like_a_spurious_pid(p))
> +		return -ESRCH;

Oh, I do not think __ptrace_freeze() should check for spurious pid...
looks_like_a_spurious_pid() should be called once in ptrace_check_attach()
before task_call_func(__ptrace_freeze).

Oleg.
Peter Zijlstra April 13, 2022, 4:58 p.m. UTC | #2
On Wed, Apr 13, 2022 at 03:24:52PM +0200, Oleg Nesterov wrote:
> Hi Peter,
> 
> I like 1-2 but I need to read them (and other patches) again, a
> couple of nits right now.
> 
> On 04/12, Peter Zijlstra wrote:
> >
> > +static int __ptrace_freeze_cond(struct task_struct *p)
> > +{
> > +	if (!task_is_traced(p))
> > +		return -ESRCH;
> 
> 	if (!task_is_traced(p) || p->parent != current)
> 		return -ESRCH;
> 
> we should not spin/sleep if it is traced by another task

Yes, fair enough. And I suppose doing this test without holding siglock
is safe enough.

> > +static int __ptrace_freeze(struct task_struct *p, void *arg)
> > +{
> > +	int ret;
> > +
> > +	ret = __ptrace_freeze_cond(p);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/*
> > +	 * Task scheduled between __ptrace_pre_freeze() and here, not our task
> > +	 * anymore.
> > +	 */
> > +	if (*(unsigned long *)arg != p->nvcsw)
> > +		return -ESRCH;
> > +
> > +	if (looks_like_a_spurious_pid(p))
> > +		return -ESRCH;
> 
> Oh, I do not think __ptrace_freeze() should check for spurious pid...
> looks_like_a_spurious_pid() should be called once in ptrace_check_attach()
> before task_call_func(__ptrace_freeze).

I can certainly do that, but since that needs be done with siglock held,
and the __ptrace_freeze call is a one-time affair, I didn't really see
the point in making the code more complicated.

Something like so then?

--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -222,7 +222,7 @@ static void ptrace_unfreeze_traced(struc
  */
 static int __ptrace_freeze_cond(struct task_struct *p)
 {
-	if (!task_is_traced(p))
+	if (!task_is_traced(p) || p->parent != current)
 		return -ESRCH;
 
 	if (task_curr(p))
@@ -283,9 +283,6 @@ static int __ptrace_freeze(struct task_s
 	if (*(unsigned long *)arg != p->nvcsw)
 		return -ESRCH;
 
-	if (looks_like_a_spurious_pid(p))
-		return -ESRCH;
-
 	if (__fatal_signal_pending(p))
 		return -ESRCH;
 
@@ -378,6 +375,9 @@ static int ptrace_check_attach(struct ta
 		 * does ptrace_unlink() before __exit_signal().
 		 */
 		spin_lock_irq(&child->sighand->siglock);
+		if (looks_like_a_spurious_pid(child))
+			goto unlock_sig;
+
 		ret = task_call_func(child, __ptrace_freeze, &nvcsw);
 		if (ret) {
 			/*
@@ -386,6 +386,7 @@ static int ptrace_check_attach(struct ta
 			 */
 			ret = -ESRCH;
 		}
+unlock_sig:
 		spin_unlock_irq(&child->sighand->siglock);
 	}
 unlock:
Oleg Nesterov April 13, 2022, 6:57 p.m. UTC | #3
On 04/13, Oleg Nesterov wrote:
>
> I like 1-2 but I need to read them (and other patches) again, a
> couple of nits right now.

Sorry, didn't have time to do this today, and now I am already sleeping.

But... on a second thought, it seems there is a better solution. If nothing
else it is simpler and doesn't duplicate the wait_task_inactive() logic.

How about the patch below instead? On top of 1/5.

Yes,yes, incomplete. in particular see the "!!!!!!!!!" comments. Just to
explain the idea.

Oleg.
Oleg Nesterov April 13, 2022, 6:59 p.m. UTC | #4
On 04/13, Oleg Nesterov wrote:
>
> On 04/13, Oleg Nesterov wrote:
> >
> > I like 1-2 but I need to read them (and other patches) again, a
> > couple of nits right now.
>
> Sorry, didn't have time to do this today, and now I am already sleeping.
>
> But... on a second thought, it seems there is a better solution. If nothing
> else it is simpler and doesn't duplicate the wait_task_inactive() logic.
>
> How about the patch below instead? On top of 1/5.
>
> Yes,yes, incomplete. in particular see the "!!!!!!!!!" comments. Just to
> explain the idea.

Cough. forget to attach the patch, sorry for noise.

Oleg.
---

diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
index ec8b312f7506..1b5a57048e13 100644
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -22,7 +22,8 @@ struct task_struct;
 
 #define JOBCTL_STOPPED_BIT	24
 #define JOBCTL_TRACED_BIT	25
-#define JOBCTL_TRACED_FROZEN_BIT 26
+#define JOBCTL_TRACED_XXX_BIT	25
+#define JOBCTL_TRACED_FROZEN_BIT 27
 
 #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
 #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
@@ -35,6 +36,7 @@ struct task_struct;
 
 #define JOBCTL_STOPPED		(1UL << JOBCTL_STOPPED_BIT)
 #define JOBCTL_TRACED		(1UL << JOBCTL_TRACED_BIT)
+#define JOBCTL_TRACED_XXX	(1UL << JOBCTL_TRACED_XXX_BIT)
 #define JOBCTL_TRACED_FROZEN	(1UL << JOBCTL_TRACED_FROZEN_BIT)
 
 #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 626f96d275c7..86b5226e6ba2 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -255,6 +255,19 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 {
 	int ret = -ESRCH;
 
+	if (!(child->ptrace && child->parent == current))
+		return ret;
+
+	if (ignore_state)
+		return 0;
+
+	if (wait_on_bit(&task->jobctl, JOBCTL_TRACED_XXX_BIT, TASK_KILLABLE))
+		return -EINTR;
+	// now that the tracee cleared JOBCTL_TRACED_XXX_BIT
+	// wait_task_inactive() should succeed or fail "really soon".
+	if (!wait_task_inactive(child, __TASK_TRACED))
+		return ret;
+
 	/*
 	 * We take the read lock around doing both checks to close a
 	 * possible race where someone else was tracing our child and
@@ -269,23 +282,11 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 		 * child->sighand can't be NULL, release_task()
 		 * does ptrace_unlink() before __exit_signal().
 		 */
-		if (ignore_state || ptrace_freeze_traced(child))
+		if (ptrace_freeze_traced(child))
 			ret = 0;
 	}
 	read_unlock(&tasklist_lock);
 
-	if (!ret && !ignore_state) {
-		if (!wait_task_inactive(child, __TASK_TRACED)) {
-			/*
-			 * This can only happen if may_ptrace_stop() fails and
-			 * ptrace_stop() changes ->state back to TASK_RUNNING,
-			 * so we should not worry about leaking __TASK_TRACED.
-			 */
-			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-			ret = -ESRCH;
-		}
-	}
-
 	return ret;
 }
 
diff --git a/kernel/signal.c b/kernel/signal.c
index 0aea3f0a8002..5ca6235e5231 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2220,7 +2220,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 	 * schedule() will not sleep if there is a pending signal that
 	 * can awaken the task.
 	 */
-	current->jobctl |= JOBCTL_TRACED;
+	current->jobctl |= (JOBCTL_TRACED | JOBCTL_TRACED_XXX);
 	set_special_state(TASK_TRACED);
 
 	/*
@@ -2291,6 +2291,10 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 		preempt_disable();
 		read_unlock(&tasklist_lock);
 		cgroup_enter_frozen();
+		// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+		// wrong, needs siglock
+		current->jobctl &= ~JOBCTL_TRACED_XXX;
+		wake_up_bit(&current->jobctl, ~JOBCTL_TRACED_XXX_BIT);
 		preempt_enable_no_resched();
 		freezable_schedule();
 		cgroup_leave_frozen(true);
@@ -2308,6 +2312,8 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 		if (gstop_done)
 			do_notify_parent_cldstop(current, false, why);
 
+		// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+		// need to clear ~JOBCTL_TRACED_XXX
 		/* tasklist protects us from ptrace_freeze_traced() */
 		__set_current_state(TASK_RUNNING);
 		read_code = false;
Peter Zijlstra April 13, 2022, 7:20 p.m. UTC | #5
On Wed, Apr 13, 2022 at 08:59:10PM +0200, Oleg Nesterov wrote:


> +		// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> +		// wrong, needs siglock
> +		current->jobctl &= ~JOBCTL_TRACED_XXX;
> +		wake_up_bit(&current->jobctl, ~JOBCTL_TRACED_XXX_BIT);
		  __wake_up_common_lock()
		    spin_lock_irqsave()
		      current_save_and_set_rtlock_wait_state();

			> +	if (wait_on_bit(&task->jobctl, JOBCTL_TRACED_XXX_BIT, TASK_KILLABLE))
			> +		return -EINTR;
			> +	// now that the tracee cleared JOBCTL_TRACED_XXX_BIT
			> +	// wait_task_inactive() should succeed or fail "really soon".
			> +	if (!wait_task_inactive(child, __TASK_TRACED))
			> +		return ret;


	*whoopsie* ?

>  		preempt_enable_no_resched();
>  		freezable_schedule();
>  		cgroup_leave_frozen(true);
Peter Zijlstra April 13, 2022, 7:56 p.m. UTC | #6
On Wed, Apr 13, 2022 at 09:20:53PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 13, 2022 at 08:59:10PM +0200, Oleg Nesterov wrote:
> 
> 
> > +		// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > +		// wrong, needs siglock
> > +		current->jobctl &= ~JOBCTL_TRACED_XXX;
> > +		wake_up_bit(&current->jobctl, ~JOBCTL_TRACED_XXX_BIT);
> 		  __wake_up_common_lock()
> 		    spin_lock_irqsave()
> 		      current_save_and_set_rtlock_wait_state();
> 
> 			> +	if (wait_on_bit(&task->jobctl, JOBCTL_TRACED_XXX_BIT, TASK_KILLABLE))
> 			> +		return -EINTR;
> 			> +	// now that the tracee cleared JOBCTL_TRACED_XXX_BIT
> 			> +	// wait_task_inactive() should succeed or fail "really soon".
> 			> +	if (!wait_task_inactive(child, __TASK_TRACED))
> 			> +		return ret;
> 
> 
> 	*whoopsie* ?
> 
> >  		preempt_enable_no_resched();
> >  		freezable_schedule();
> >  		cgroup_leave_frozen(true);

Something that might work; since there's only the one ptracer, is to
instead do something like:

	current->jobctl &= ~JOBCTL_TRACED_XXX; // siglock
	if (current->ptrace)
		wake_up_process(current->parent);
	preempt_enable_no_resched();
	schedule();


vs

	for (;;) {
		set_current_state(TASK_UNINTERRUPTIBLE);
		if (!(p->jobctl & JOBCTL_TRACED_XXX))
			break;
		schedule();
	}
	__set_current_state(TASK_RUNNING);

	if (!wait_task_inactive(...))


ptrace_detach() needs some additional magic as well I think, but this
might just work.
Oleg Nesterov April 14, 2022, 11:54 a.m. UTC | #7
On 04/13, Peter Zijlstra wrote:
>
> On Wed, Apr 13, 2022 at 09:20:53PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 13, 2022 at 08:59:10PM +0200, Oleg Nesterov wrote:
> >
> >
> > > +		// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > > +		// wrong, needs siglock
> > > +		current->jobctl &= ~JOBCTL_TRACED_XXX;
> > > +		wake_up_bit(&current->jobctl, ~JOBCTL_TRACED_XXX_BIT);
> > 		  __wake_up_common_lock()
> > 		    spin_lock_irqsave()
> > 		      current_save_and_set_rtlock_wait_state();

OOPS, thanks.

> Something that might work; since there's only the one ptracer, is to
> instead do something like:
>
> 	current->jobctl &= ~JOBCTL_TRACED_XXX; // siglock
> 	if (current->ptrace)
> 		wake_up_process(current->parent);
> 	preempt_enable_no_resched();
> 	schedule();
>
>
> vs
>
> 	for (;;) {
> 		set_current_state(TASK_UNINTERRUPTIBLE);
> 		if (!(p->jobctl & JOBCTL_TRACED_XXX))
> 			break;
> 		schedule();

Yes, thanks... this is racy, see below, but probably fixeable.

> ptrace_detach() needs some additional magic as well I think, but this
> might just work.

I don't think so, JOBCTL_TRACED_XXX must be always cleared in ptrace_stop()
and ptrace_detach() implies ptrace_check_attach().


Lets forget about the proble above for the moment. There is another problem
with my patch,

	if (!(child->ptrace && child->parent == current))
		return ret;

this check is racy without tasklist, we can race with another task attaching
to our natural child (so that child->parent == current), ptrace_attach() sets
task->ptrace = flags first and changes child->parent after that.

In this case:

	if (ignore_state)
		return 0;

this is just wrong,

	if (wait_on_bit(&task->jobctl, JOBCTL_TRACED_XXX_BIT, TASK_KILLABLE))
		return -EINTR;

this is fine,

	if (!wait_task_inactive(child, __TASK_TRACED))

this not right too. wait_task_inactive() can loop "forever" doing schedule_hrtimeout()
if the actual debugger stops/resumes the tracee continuously. This is pure theoretical,
but still.

And this also means that the code above needs some changes too, we can rely on
wake_up_process(current->parent).

OK, let me think about it. Thanks!

Oleg.
Oleg Nesterov April 14, 2022, 6:34 p.m. UTC | #8
On 04/14, Oleg Nesterov wrote:
>
> OK, let me think about it. Thanks!

So. what do you think about the patch below?

If it can work, then 1/5 needs some changes, I think. In particular,
it should not introduce JOBCTL_TRACED_FROZEN until 5/5, and perhaps
we can avoid this flag altogether...

This is how ptrace_check_attach() looks with the patch applied:

	static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
	{
		int traced;
		/*
		 * We take the read lock around doing both checks to close a
		 * possible race where someone else attaches or detaches our
		 * natural child.
		 */
		read_lock(&tasklist_lock);
		traced = child->ptrace && child->parent == current;
		read_unlock(&tasklist_lock);

		if (!traced)
			return -ESRCH;

		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
		if (ignore_state)
			return 0;

		for (;;) {
			if (fatal_signal_pending(current))
				return -EINTR;
			set_current_state(TASK_KILLABLE);
			if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED)) {
				__set_current_state(TASK_RUNNING);
				break;
			}
			schedule();
		}

		if (!wait_task_inactive(child, TASK_TRACED) ||
		    !ptrace_freeze_traced(child))
			return -ESRCH;

		return 0;
	}

Oleg.
---

diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
index ec8b312f7506..1b5a57048e13 100644
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -22,7 +22,8 @@ struct task_struct;
 
 #define JOBCTL_STOPPED_BIT	24
 #define JOBCTL_TRACED_BIT	25
-#define JOBCTL_TRACED_FROZEN_BIT 26
+#define JOBCTL_TRACED_XXX_BIT	25
+#define JOBCTL_TRACED_FROZEN_BIT 27
 
 #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
 #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
@@ -35,6 +36,7 @@ struct task_struct;
 
 #define JOBCTL_STOPPED		(1UL << JOBCTL_STOPPED_BIT)
 #define JOBCTL_TRACED		(1UL << JOBCTL_TRACED_BIT)
+#define JOBCTL_TRACED_XXX	(1UL << JOBCTL_TRACED_XXX_BIT)
 #define JOBCTL_TRACED_FROZEN	(1UL << JOBCTL_TRACED_FROZEN_BIT)
 
 #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 626f96d275c7..5a03ae5cb0c0 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -193,20 +193,22 @@ static bool looks_like_a_spurious_pid(struct task_struct *task)
  */
 static bool ptrace_freeze_traced(struct task_struct *task)
 {
+	unsigned long flags;
 	bool ret = false;
 
 	/* Lockless, nobody but us can set this flag */
 	if (task->jobctl & JOBCTL_LISTENING)
 		return ret;
+	if (!lock_task_sighand(task, &flags))
+		return ret;
 
-	spin_lock_irq(&task->sighand->siglock);
 	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
 	    !__fatal_signal_pending(task)) {
 		task->jobctl |= JOBCTL_TRACED_FROZEN;
 		WRITE_ONCE(task->__state, __TASK_TRACED);
 		ret = true;
 	}
-	spin_unlock_irq(&task->sighand->siglock);
+	unlock_task_sighand(task, &flags);
 
 	return ret;
 }
@@ -253,40 +255,39 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
  */
 static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 {
-	int ret = -ESRCH;
-
+	int traced;
 	/*
 	 * We take the read lock around doing both checks to close a
-	 * possible race where someone else was tracing our child and
-	 * detached between these two checks.  After this locked check,
-	 * we are sure that this is our traced child and that can only
-	 * be changed by us so it's not changing right after this.
+	 * possible race where someone else attaches or detaches our
+	 * natural child.
 	 */
 	read_lock(&tasklist_lock);
-	if (child->ptrace && child->parent == current) {
-		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-		/*
-		 * child->sighand can't be NULL, release_task()
-		 * does ptrace_unlink() before __exit_signal().
-		 */
-		if (ignore_state || ptrace_freeze_traced(child))
-			ret = 0;
-	}
+	traced = child->ptrace && child->parent == current;
 	read_unlock(&tasklist_lock);
 
-	if (!ret && !ignore_state) {
-		if (!wait_task_inactive(child, __TASK_TRACED)) {
-			/*
-			 * This can only happen if may_ptrace_stop() fails and
-			 * ptrace_stop() changes ->state back to TASK_RUNNING,
-			 * so we should not worry about leaking __TASK_TRACED.
-			 */
-			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-			ret = -ESRCH;
+	if (!traced)
+		return -ESRCH;
+
+	WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
+	if (ignore_state)
+		return 0;
+
+	for (;;) {
+		if (fatal_signal_pending(current))
+			return -EINTR;
+		set_current_state(TASK_KILLABLE);
+		if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED)) {
+			__set_current_state(TASK_RUNNING);
+			break;
 		}
+		schedule();
 	}
 
-	return ret;
+	if (!wait_task_inactive(child, TASK_TRACED) ||
+	    !ptrace_freeze_traced(child))
+		return -ESRCH;
+
+	return 0;
 }
 
 static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
diff --git a/kernel/signal.c b/kernel/signal.c
index 0aea3f0a8002..684f0a0e9c71 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2182,6 +2182,14 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
 	spin_unlock_irqrestore(&sighand->siglock, flags);
 }
 
+static void clear_traced_xxx(void)
+{
+	spin_lock_irq(&current->sighand->siglock);
+	current->jobctl &= ~JOBCTL_TRACED_XXX;
+	spin_unlock_irq(&current->sighand->siglock);
+	wake_up_state(current->parent, TASK_KILLABLE);
+}
+
 /*
  * This must be called with current->sighand->siglock held.
  *
@@ -2220,7 +2228,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 	 * schedule() will not sleep if there is a pending signal that
 	 * can awaken the task.
 	 */
-	current->jobctl |= JOBCTL_TRACED;
+	current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_XXX;
 	set_special_state(TASK_TRACED);
 
 	/*
@@ -2282,6 +2290,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 		if (gstop_done && ptrace_reparented(current))
 			do_notify_parent_cldstop(current, false, why);
 
+		clear_traced_xxx();
 		/*
 		 * Don't want to allow preemption here, because
 		 * sys_ptrace() needs this task to be inactive.
@@ -2296,9 +2305,6 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 		cgroup_leave_frozen(true);
 	} else {
 		/*
-		 * By the time we got the lock, our tracer went away.
-		 * Don't drop the lock yet, another tracer may come.
-		 *
 		 * If @gstop_done, the ptracer went away between group stop
 		 * completion and here.  During detach, it would have set
 		 * JOBCTL_STOP_PENDING on us and we'll re-enter
@@ -2307,13 +2313,14 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 		 */
 		if (gstop_done)
 			do_notify_parent_cldstop(current, false, why);
+		clear_traced_xxx();
+		read_unlock(&tasklist_lock);
 
-		/* tasklist protects us from ptrace_freeze_traced() */
+		/* JOBCTL_TRACED_XXX protects us from ptrace_freeze_traced() */
 		__set_current_state(TASK_RUNNING);
 		read_code = false;
 		if (clear_code)
 			exit_code = 0;
-		read_unlock(&tasklist_lock);
 	}
 
 	/*
Peter Zijlstra April 14, 2022, 10:45 p.m. UTC | #9
On Thu, Apr 14, 2022 at 08:34:33PM +0200, Oleg Nesterov wrote:
> On 04/14, Oleg Nesterov wrote:
> >
> > OK, let me think about it. Thanks!
> 
> So. what do you think about the patch below?

Might just work, will have to look at it again though ;-) Comments
below.

> If it can work, then 1/5 needs some changes, I think. In particular,
> it should not introduce JOBCTL_TRACED_FROZEN until 5/5, and perhaps

That TRACED_FROZEN was to distinguish the TASK_TRACED and __TASK_TRACED
state, and isn't related to the freezer.

> we can avoid this flag altogether...
> 
> This is how ptrace_check_attach() looks with the patch applied:
> 
> 	static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
> 	{
> 		int traced;
> 		/*
> 		 * We take the read lock around doing both checks to close a
> 		 * possible race where someone else attaches or detaches our
> 		 * natural child.
> 		 */
> 		read_lock(&tasklist_lock);
> 		traced = child->ptrace && child->parent == current;
> 		read_unlock(&tasklist_lock);
> 
> 		if (!traced)
> 			return -ESRCH;

The thing being, that if it is our ptrace child, it won't be going away
since we're running this code and not ptrace_detach().  Right?

I failed to realize this earlier and I think that's part of why my
solution is somewhat over complicated.

> 		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> 		if (ignore_state)
> 			return 0;
> 
> 		for (;;) {
> 			if (fatal_signal_pending(current))
> 				return -EINTR;

What if signal_wake_up(.resume=true) happens here? In that case we miss
the fatal pending, and task state isn't changed yet so we'll happily go
sleep.

> 			set_current_state(TASK_KILLABLE);
> 			if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED)) {

  TRACED_XXX ?

> 				__set_current_state(TASK_RUNNING);
> 				break;
> 			}
> 			schedule();
> 		}

That is, shouldn't we write this like:

		for (;;) {
			set_current_state(TASK_KILLABLE);
			if (fatal_signal_pending(current)) {
				ret = -EINTR;
				break;
			}
			if (!(READ_ONCE(current->jobctl) & JOBCTL_TRACED_XXX)) {
				ret = 0;
				break;
			}
			schedule();
		}
		__set_current_state(TASK_RUNNING);
		if (ret)
			return ret;

> 		if (!wait_task_inactive(child, TASK_TRACED) ||
> 		    !ptrace_freeze_traced(child))
> 			return -ESRCH;
> 
> 		return 0;
> 	}
> 
> Oleg.
> ---
> 
> diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
> index ec8b312f7506..1b5a57048e13 100644
> --- a/include/linux/sched/jobctl.h
> +++ b/include/linux/sched/jobctl.h
> @@ -22,7 +22,8 @@ struct task_struct;
>  
>  #define JOBCTL_STOPPED_BIT	24
>  #define JOBCTL_TRACED_BIT	25
> -#define JOBCTL_TRACED_FROZEN_BIT 26
> +#define JOBCTL_TRACED_XXX_BIT	25
> +#define JOBCTL_TRACED_FROZEN_BIT 27
>  
>  #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
>  #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
> @@ -35,6 +36,7 @@ struct task_struct;
>  
>  #define JOBCTL_STOPPED		(1UL << JOBCTL_STOPPED_BIT)
>  #define JOBCTL_TRACED		(1UL << JOBCTL_TRACED_BIT)
> +#define JOBCTL_TRACED_XXX	(1UL << JOBCTL_TRACED_XXX_BIT)
>  #define JOBCTL_TRACED_FROZEN	(1UL << JOBCTL_TRACED_FROZEN_BIT)
>  
>  #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 626f96d275c7..5a03ae5cb0c0 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -193,20 +193,22 @@ static bool looks_like_a_spurious_pid(struct task_struct *task)
>   */
>  static bool ptrace_freeze_traced(struct task_struct *task)
>  {
> +	unsigned long flags;
>  	bool ret = false;
>  
>  	/* Lockless, nobody but us can set this flag */
>  	if (task->jobctl & JOBCTL_LISTENING)
>  		return ret;
> +	if (!lock_task_sighand(task, &flags))
> +		return ret;
>  
> -	spin_lock_irq(&task->sighand->siglock);
>  	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
>  	    !__fatal_signal_pending(task)) {
>  		task->jobctl |= JOBCTL_TRACED_FROZEN;
>  		WRITE_ONCE(task->__state, __TASK_TRACED);
>  		ret = true;
>  	}

I would feel much better if this were still a task_func_call()
validating !->on_rq && !->on_cpu.

> -	spin_unlock_irq(&task->sighand->siglock);
> +	unlock_task_sighand(task, &flags);
>  
>  	return ret;
>  }
> @@ -253,40 +255,39 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
>   */
>  static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>  {
> -	int ret = -ESRCH;
> -
> +	int traced;
>  	/*
>  	 * We take the read lock around doing both checks to close a
> -	 * possible race where someone else was tracing our child and
> -	 * detached between these two checks.  After this locked check,
> -	 * we are sure that this is our traced child and that can only
> -	 * be changed by us so it's not changing right after this.
> +	 * possible race where someone else attaches or detaches our
> +	 * natural child.
>  	 */
>  	read_lock(&tasklist_lock);
> -	if (child->ptrace && child->parent == current) {
> -		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> -		/*
> -		 * child->sighand can't be NULL, release_task()
> -		 * does ptrace_unlink() before __exit_signal().
> -		 */
> -		if (ignore_state || ptrace_freeze_traced(child))
> -			ret = 0;
> -	}
> +	traced = child->ptrace && child->parent == current;
>  	read_unlock(&tasklist_lock);
>  
> -	if (!ret && !ignore_state) {
> -		if (!wait_task_inactive(child, __TASK_TRACED)) {
> -			/*
> -			 * This can only happen if may_ptrace_stop() fails and
> -			 * ptrace_stop() changes ->state back to TASK_RUNNING,
> -			 * so we should not worry about leaking __TASK_TRACED.
> -			 */
> -			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> -			ret = -ESRCH;
> +	if (!traced)
> +		return -ESRCH;
> +
> +	WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> +	if (ignore_state)
> +		return 0;
> +
> +	for (;;) {
> +		if (fatal_signal_pending(current))
> +			return -EINTR;
> +		set_current_state(TASK_KILLABLE);
> +		if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED)) {
> +			__set_current_state(TASK_RUNNING);
> +			break;
>  		}
> +		schedule();
>  	}
>  
> -	return ret;
> +	if (!wait_task_inactive(child, TASK_TRACED) ||
> +	    !ptrace_freeze_traced(child))
> +		return -ESRCH;
> +
> +	return 0;
>  }
>  
>  static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 0aea3f0a8002..684f0a0e9c71 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2182,6 +2182,14 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
>  	spin_unlock_irqrestore(&sighand->siglock, flags);
>  }
>  
> +static void clear_traced_xxx(void)
> +{
> +	spin_lock_irq(&current->sighand->siglock);
> +	current->jobctl &= ~JOBCTL_TRACED_XXX;
> +	spin_unlock_irq(&current->sighand->siglock);
> +	wake_up_state(current->parent, TASK_KILLABLE);
> +}
> +
>  /*
>   * This must be called with current->sighand->siglock held.
>   *
> @@ -2220,7 +2228,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  	 * schedule() will not sleep if there is a pending signal that
>  	 * can awaken the task.
>  	 */
> -	current->jobctl |= JOBCTL_TRACED;
> +	current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_XXX;
>  	set_special_state(TASK_TRACED);
>  
>  	/*
> @@ -2282,6 +2290,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  		if (gstop_done && ptrace_reparented(current))
>  			do_notify_parent_cldstop(current, false, why);
>  
> +		clear_traced_xxx();
>  		/*
>  		 * Don't want to allow preemption here, because
>  		 * sys_ptrace() needs this task to be inactive.
> @@ -2296,9 +2305,6 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  		cgroup_leave_frozen(true);
>  	} else {
>  		/*
> -		 * By the time we got the lock, our tracer went away.
> -		 * Don't drop the lock yet, another tracer may come.
> -		 *
>  		 * If @gstop_done, the ptracer went away between group stop
>  		 * completion and here.  During detach, it would have set
>  		 * JOBCTL_STOP_PENDING on us and we'll re-enter
> @@ -2307,13 +2313,14 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  		 */
>  		if (gstop_done)
>  			do_notify_parent_cldstop(current, false, why);
> +		clear_traced_xxx();
> +		read_unlock(&tasklist_lock);
>  
> -		/* tasklist protects us from ptrace_freeze_traced() */
> +		/* JOBCTL_TRACED_XXX protects us from ptrace_freeze_traced() */

But... TRACED_XXX has just been cleared ?!

>  		__set_current_state(TASK_RUNNING);
>  		read_code = false;
>  		if (clear_code)
>  			exit_code = 0;
> -		read_unlock(&tasklist_lock);
>  	}
>  
>  	/*
>
Oleg Nesterov April 15, 2022, 10:16 a.m. UTC | #10
On 04/15, Peter Zijlstra wrote:
>
> On Thu, Apr 14, 2022 at 08:34:33PM +0200, Oleg Nesterov wrote:
>
> > If it can work, then 1/5 needs some changes, I think. In particular,
> > it should not introduce JOBCTL_TRACED_FROZEN until 5/5, and perhaps
>
> That TRACED_FROZEN was to distinguish the TASK_TRACED and __TASK_TRACED
> state, and isn't related to the freezer.

Lets forget about 3-5 which I didn't read carefully yet. So why do we
need TRACED_FROZEN?
Oleg Nesterov April 15, 2022, 10:57 a.m. UTC | #11
On 04/15, Oleg Nesterov wrote:
>
> OK, so far it seems that this patch needs a couple of simple fixes you
> pointed out, but before I send V2:
>
> 	- do you agree we can avoid JOBCTL_TRACED_FROZEN in 1-2 ?
>
> 	- will you agree if I change ptrace_freeze_traced() to rely
> 	  on __state == TASK_TRACED rather than task_is_traced() ?
>

Forgot to say, I think 1/5 needs some changes in any case...

ptrace_resume() does  wake_up_state(child, __TASK_TRACED) but doesn't
clear JOBCTL_TRACED. The "else" branch in ptrace_stop() leaks this flag
too. Perhaps I missed something, I'll reread 1/5 again, but the main
question to me is whether 1-2 actually need the JOBCTL_TRACED_FROZEN flag.

Oleg.
Peter Zijlstra April 15, 2022, noon UTC | #12
On Fri, Apr 15, 2022 at 12:16:44PM +0200, Oleg Nesterov wrote:
> On 04/15, Peter Zijlstra wrote:
> >
> > On Thu, Apr 14, 2022 at 08:34:33PM +0200, Oleg Nesterov wrote:
> >
> > > If it can work, then 1/5 needs some changes, I think. In particular,
> > > it should not introduce JOBCTL_TRACED_FROZEN until 5/5, and perhaps
> >
> > That TRACED_FROZEN was to distinguish the TASK_TRACED and __TASK_TRACED
> > state, and isn't related to the freezer.
> 
> Lets forget about 3-5 which I didn't read carefully yet. So why do we
> need TRACED_FROZEN?

The purpose of 1/5 was to not have any unique state in __state. To at
all times be able to reconstruct __state from outside information (where
needed).

Agreed that this particular piece of state isn't needed until 5/5, but
the concept is independent (also 5/5 is insanely large already).

> From 1/5:
> 
> 	 static inline void signal_wake_up(struct task_struct *t, bool resume)
> 	 {
> 	+	lockdep_assert_held(&t->sighand->siglock);
> 	+
> 	+	if (resume && !(t->jobctl & JOBCTL_TRACED_FROZEN))
> 	+		t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED);
> 	+
> 		signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
> 	 }
> 	+
> 	 static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
> 	 {
> 	+	lockdep_assert_held(&t->sighand->siglock);
> 	+
> 	+	if (resume)
> 	+		t->jobctl &= ~JOBCTL_TRACED;
> 	+
> 		signal_wake_up_state(t, resume ? __TASK_TRACED : 0);
> 	 }
> 
> Can't we simply change signal_wake_up_state(),
> 
> 	void signal_wake_up_state(struct task_struct *t, unsigned int state)
> 	{
> 		set_tsk_thread_flag(t, TIF_SIGPENDING);
> 		/*
> 		 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
> 		 * case. We don't check t->state here because there is a race with it
> 		 * executing another processor and just now entering stopped state.
> 		 * By using wake_up_state, we ensure the process will wake up and
> 		 * handle its death signal.
> 		 */
> 		if (wake_up_state(t, state | TASK_INTERRUPTIBLE))
> 			t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED);
> 		else
> 			kick_process(t);
> 	}
> 
> ?

This would be broken when we so signal_wake_up_state() when state
doesn't match. Does that happen? I'm thikning siglock protects us from
the most obvious races, but still.

If not broken, then it needs at least a comment explaining why not etc..
I'm sure to not remember many of these details.

Also, signal_wake_up_state() really can do with that
lockdep_assert_held() as well ;-)

> > > 		/*
> > > 		 * We take the read lock around doing both checks to close a
> > > 		 * possible race where someone else attaches or detaches our
> > > 		 * natural child.
> > > 		 */
> > > 		read_lock(&tasklist_lock);
> > > 		traced = child->ptrace && child->parent == current;
> > > 		read_unlock(&tasklist_lock);
> > >
> > > 		if (!traced)
> > > 			return -ESRCH;
> >
> > The thing being, that if it is our ptrace child, it won't be going away
> > since we're running this code and not ptrace_detach().  Right?
> 
> Yes. and nobody else can detach it.
> 
> Another tracer can't attach until child->ptrace is cleared, but this can
> only happen if a) this child is killed and b) another thread does wait()
> and reaps it; but after that attach() is obviously impossible.
> 
> But since this child can go away, the patch changes ptrace_freeze_traced()
> to use lock_task_sighand().

Right.

> > > 		for (;;) {
> > > 			if (fatal_signal_pending(current))
> > > 				return -EINTR;
> >
> > What if signal_wake_up(.resume=true) happens here? In that case we miss
> > the fatal pending, and task state isn't changed yet so we'll happily go
> > sleep.
> 
> No, it won't sleep, see the signal_pending_state() check in schedule().

Urgh, forgot about that one ;-)

> > > 			set_current_state(TASK_KILLABLE);
> 
> And let me explain TASK_KILLABLE just in case... We could just use
> TASK_UNINTERRUPTIBLE and avoid the signal_pending() check, but KILLABLE
> looks "safer" to me. If the tracer hangs because of some bug, at least
> it can be killed from userspace.

Agreed.

> 
> > > 			if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED)) {
> >
> >   TRACED_XXX ?
> 
> oops ;)
> 
> > > -	spin_lock_irq(&task->sighand->siglock);
> > >  	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
> > >  	    !__fatal_signal_pending(task)) {
> > >  		task->jobctl |= JOBCTL_TRACED_FROZEN;
> > >  		WRITE_ONCE(task->__state, __TASK_TRACED);
> > >  		ret = true;
> > >  	}
> >
> > I would feel much better if this were still a task_func_call()
> > validating !->on_rq && !->on_cpu.
> 
> Well, but "on_rq || on_cpu" would mean that wait_task_inactive() is buggy ?

Yes, but I'm starting to feel a little paranoid here. Better safe than
sorry etc..

> But! I forgot to make anothet change in this code. I do not think it should
> rely on task_is_traced(). We are going to abuse task->__state, so I think
> it should check task->__state == TASK_TRACED directly. Say,
> 
> 	if (READ_ONCE(task->__state) == TASK_TRACED && ...) {
> 		WRITE_ONCE(task->__state, __TASK_TRACED);
> 		WARN_ON_ONCE(!task_is_traced(task));
> 		ret = true;
> 	}
> 
> looks more clean to me. What do you think?

Agreed on this.

> > > @@ -2307,13 +2313,14 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
> > >  		 */
> > >  		if (gstop_done)
> > >  			do_notify_parent_cldstop(current, false, why);
> > > +		clear_traced_xxx();
> > > +		read_unlock(&tasklist_lock);
> > >
> > > -		/* tasklist protects us from ptrace_freeze_traced() */
> > > +		/* JOBCTL_TRACED_XXX protects us from ptrace_freeze_traced() */
> >
> > But... TRACED_XXX has just been cleared ?!
> 
> Cough ;) OK, I'll move __set_current_state() back under tasklist.
> 
> And in this case we do not need wake_up(parent), so we can shift it from
> clear_traced_xxx() into another branch.
> 
> OK, so far it seems that this patch needs a couple of simple fixes you
> pointed out, but before I send V2:
> 
> 	- do you agree we can avoid JOBCTL_TRACED_FROZEN in 1-2 ?

We can for the sake of 2 avoid TRACED_FROZEN, but as explained at the
start, the point of 1 was to ensure there is no unique state in __state,
and I think in that respect we can keep it, hmm?

> 	- will you agree if I change ptrace_freeze_traced() to rely
> 	  on __state == TASK_TRACED rather than task_is_traced() ?

Yes.
Peter Zijlstra April 15, 2022, 12:01 p.m. UTC | #13
On Fri, Apr 15, 2022 at 12:57:56PM +0200, Oleg Nesterov wrote:
> On 04/15, Oleg Nesterov wrote:
> >
> > OK, so far it seems that this patch needs a couple of simple fixes you
> > pointed out, but before I send V2:
> >
> > 	- do you agree we can avoid JOBCTL_TRACED_FROZEN in 1-2 ?
> >
> > 	- will you agree if I change ptrace_freeze_traced() to rely
> > 	  on __state == TASK_TRACED rather than task_is_traced() ?
> >
> 
> Forgot to say, I think 1/5 needs some changes in any case...
> 
> ptrace_resume() does  wake_up_state(child, __TASK_TRACED) but doesn't
> clear JOBCTL_TRACED. The "else" branch in ptrace_stop() leaks this flag
> too.

Urgh, yes, I seemed to have missed that one :-(
Oleg Nesterov April 15, 2022, 12:56 p.m. UTC | #14
On 04/15, Peter Zijlstra wrote:
>
> On Fri, Apr 15, 2022 at 12:16:44PM +0200, Oleg Nesterov wrote:
> >
> > Lets forget about 3-5 which I didn't read carefully yet. So why do we
> > need TRACED_FROZEN?
>
> The purpose of 1/5 was to not have any unique state in __state. To at
> all times be able to reconstruct __state from outside information (where
> needed).
>
> Agreed that this particular piece of state isn't needed until 5/5, but
> the concept is independent (also 5/5 is insanely large already).

OK, so in my opinion it would be more clean if TRACED_FROZEN comes in a
separate (and simple) patch before 5/5.

I won't really argue, but to me this flag looks confusing and unnecessary
in 1-2 (which btw look like a -stable material to me).

> > Can't we simply change signal_wake_up_state(),
> >
> > 	void signal_wake_up_state(struct task_struct *t, unsigned int state)
> > 	{
> > 		set_tsk_thread_flag(t, TIF_SIGPENDING);
> > 		/*
> > 		 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
> > 		 * case. We don't check t->state here because there is a race with it
> > 		 * executing another processor and just now entering stopped state.
> > 		 * By using wake_up_state, we ensure the process will wake up and
> > 		 * handle its death signal.
> > 		 */
> > 		if (wake_up_state(t, state | TASK_INTERRUPTIBLE))
> > 			t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED);
> > 		else
> > 			kick_process(t);
> > 	}
> >
> > ?
>
> This would be broken when we so signal_wake_up_state() when state
> doesn't match. Does that happen? I'm thikning siglock protects us from
> the most obvious races, but still.

Yes, even set_tsk_thread_flag(TIF_SIGPENDING) is not safe without siglock.

> Also, signal_wake_up_state() really can do with that
> lockdep_assert_held() as well ;-)

OK, lets add lockdep_assert_held() at the start of signal_wake_up_state ?

Agreed, this probably needs a comment, but this looks much simpler and
more understandable than 2 additional "if (resume)" checks in
signal_wake_up() and ptrace_signal_wake_up().

> > > > -	spin_lock_irq(&task->sighand->siglock);
> > > >  	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
> > > >  	    !__fatal_signal_pending(task)) {
> > > >  		task->jobctl |= JOBCTL_TRACED_FROZEN;
> > > >  		WRITE_ONCE(task->__state, __TASK_TRACED);
> > > >  		ret = true;
> > > >  	}
> > >
> > > I would feel much better if this were still a task_func_call()
> > > validating !->on_rq && !->on_cpu.
> >
> > Well, but "on_rq || on_cpu" would mean that wait_task_inactive() is buggy ?
>
> Yes, but I'm starting to feel a little paranoid here. Better safe than
> sorry etc..

OK, can we simply add

	WARN_ON_ONCE(ret && (on_rq || on_cpu));

after unlock_task_sighand() ? this is racy without rq_lock but should
catch the possible problems.

> > 	- do you agree we can avoid JOBCTL_TRACED_FROZEN in 1-2 ?
>
> We can for the sake of 2 avoid TRACED_FROZEN, but as explained at the
> start, the point of 1 was to ensure there is no unique state in __state,
> and I think in that respect we can keep it, hmm?

See above... I understand the purpose of TRACED_FROZEN (I hope ;),
but not in 1-2, and unless I missed something the change in signal_wake_up
above simply looks better to me, but of course this is subjective.

> > 	- will you agree if I change ptrace_freeze_traced() to rely
> > 	  on __state == TASK_TRACED rather than task_is_traced() ?
>
> Yes.

Great, thanks. I'll return tomorrow.

Oleg.
Oleg Nesterov April 18, 2022, 5:01 p.m. UTC | #15
On 04/15, Peter Zijlstra wrote:
>
> On Fri, Apr 15, 2022 at 12:57:56PM +0200, Oleg Nesterov wrote:
> > On 04/15, Oleg Nesterov wrote:
> > >
> > > OK, so far it seems that this patch needs a couple of simple fixes you
> > > pointed out, but before I send V2:
> > >
> > > 	- do you agree we can avoid JOBCTL_TRACED_FROZEN in 1-2 ?
> > >
> > > 	- will you agree if I change ptrace_freeze_traced() to rely
> > > 	  on __state == TASK_TRACED rather than task_is_traced() ?
> > >
> >
> > Forgot to say, I think 1/5 needs some changes in any case...
> >
> > ptrace_resume() does  wake_up_state(child, __TASK_TRACED) but doesn't
> > clear JOBCTL_TRACED. The "else" branch in ptrace_stop() leaks this flag
> > too.
>
> Urgh, yes, I seemed to have missed that one :-(

OK, I'll wait for updated version and send V3 on top of it.

I think the fixes are simple. ptrace_resume() should simply remove the
minor "need_siglock" optimization and clear JOBCTL_TRACED. As for the
"else" branch in ptrace_stop(), perhaps 1/5 can introduce the trivial

	static void clear_traced_jobctl_flags(unsigned long flags)
	{
		spin_lock_irq(&current->sighand->siglock);
		current->jobctl &= ~flags;
		spin_unlock_irq(&current->sighand->siglock);
	}

which can be reused by 2/5 to clear JOBCTL_TRACED_XXX. And, Peter, feel
free to join 1/5 and this patch if you think this makes any sense.

Meanwhile see V2 on top of the current version.

Oleg.


diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
index ec8b312f7506..1b5a57048e13 100644
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -22,7 +22,8 @@ struct task_struct;
 
 #define JOBCTL_STOPPED_BIT	24
 #define JOBCTL_TRACED_BIT	25
-#define JOBCTL_TRACED_FROZEN_BIT 26
+#define JOBCTL_TRACED_XXX_BIT	25
+#define JOBCTL_TRACED_FROZEN_BIT 27
 
 #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
 #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
@@ -35,6 +36,7 @@ struct task_struct;
 
 #define JOBCTL_STOPPED		(1UL << JOBCTL_STOPPED_BIT)
 #define JOBCTL_TRACED		(1UL << JOBCTL_TRACED_BIT)
+#define JOBCTL_TRACED_XXX	(1UL << JOBCTL_TRACED_XXX_BIT)
 #define JOBCTL_TRACED_FROZEN	(1UL << JOBCTL_TRACED_FROZEN_BIT)
 
 #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 626f96d275c7..ec104bae4e21 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -193,20 +193,24 @@ static bool looks_like_a_spurious_pid(struct task_struct *task)
  */
 static bool ptrace_freeze_traced(struct task_struct *task)
 {
+	unsigned long flags;
 	bool ret = false;
 
 	/* Lockless, nobody but us can set this flag */
 	if (task->jobctl & JOBCTL_LISTENING)
 		return ret;
+	if (!lock_task_sighand(task, &flags))
+		return ret;
 
-	spin_lock_irq(&task->sighand->siglock);
-	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
+	if (READ_ONCE(task->__state) == TASK_TRACED &&
+	    !looks_like_a_spurious_pid(task) &&
 	    !__fatal_signal_pending(task)) {
 		task->jobctl |= JOBCTL_TRACED_FROZEN;
 		WRITE_ONCE(task->__state, __TASK_TRACED);
+		WARN_ON_ONCE(!task_is_traced(task));
 		ret = true;
 	}
-	spin_unlock_irq(&task->sighand->siglock);
+	unlock_task_sighand(task, &flags);
 
 	return ret;
 }
@@ -253,40 +257,42 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
  */
 static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 {
-	int ret = -ESRCH;
-
+	int traced;
 	/*
 	 * We take the read lock around doing both checks to close a
-	 * possible race where someone else was tracing our child and
-	 * detached between these two checks.  After this locked check,
-	 * we are sure that this is our traced child and that can only
-	 * be changed by us so it's not changing right after this.
+	 * possible race where someone else attaches or detaches our
+	 * natural child.
 	 */
 	read_lock(&tasklist_lock);
-	if (child->ptrace && child->parent == current) {
-		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-		/*
-		 * child->sighand can't be NULL, release_task()
-		 * does ptrace_unlink() before __exit_signal().
-		 */
-		if (ignore_state || ptrace_freeze_traced(child))
-			ret = 0;
-	}
+	traced = child->ptrace && child->parent == current;
 	read_unlock(&tasklist_lock);
 
-	if (!ret && !ignore_state) {
-		if (!wait_task_inactive(child, __TASK_TRACED)) {
-			/*
-			 * This can only happen if may_ptrace_stop() fails and
-			 * ptrace_stop() changes ->state back to TASK_RUNNING,
-			 * so we should not worry about leaking __TASK_TRACED.
-			 */
-			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-			ret = -ESRCH;
+	if (!traced)
+		return -ESRCH;
+
+	WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
+	if (ignore_state)
+		return 0;
+
+	if (!task_is_traced(child))
+		return -ESRCH;
+
+	for (;;) {
+		if (fatal_signal_pending(current))
+			return -EINTR;
+		set_current_state(TASK_KILLABLE);
+		if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED_XXX)) {
+			__set_current_state(TASK_RUNNING);
+			break;
 		}
+		schedule();
 	}
 
-	return ret;
+	if (!wait_task_inactive(child, TASK_TRACED) ||
+	    !ptrace_freeze_traced(child))
+		return -ESRCH;
+
+	return 0;
 }
 
 static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
diff --git a/kernel/signal.c b/kernel/signal.c
index 0aea3f0a8002..c7a89904cc4a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2182,6 +2182,13 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
 	spin_unlock_irqrestore(&sighand->siglock, flags);
 }
 
+static void clear_traced_xxx(void)
+{
+	spin_lock_irq(&current->sighand->siglock);
+	current->jobctl &= ~JOBCTL_TRACED_XXX;
+	spin_unlock_irq(&current->sighand->siglock);
+}
+
 /*
  * This must be called with current->sighand->siglock held.
  *
@@ -2220,7 +2227,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 	 * schedule() will not sleep if there is a pending signal that
 	 * can awaken the task.
 	 */
-	current->jobctl |= JOBCTL_TRACED;
+	current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_XXX;
 	set_special_state(TASK_TRACED);
 
 	/*
@@ -2282,6 +2289,8 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 		if (gstop_done && ptrace_reparented(current))
 			do_notify_parent_cldstop(current, false, why);
 
+		clear_traced_xxx();
+		wake_up_state(current->parent, TASK_KILLABLE);
 		/*
 		 * Don't want to allow preemption here, because
 		 * sys_ptrace() needs this task to be inactive.
@@ -2297,8 +2306,12 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 	} else {
 		/*
 		 * By the time we got the lock, our tracer went away.
-		 * Don't drop the lock yet, another tracer may come.
-		 *
+		 * Don't drop the lock yet, another tracer may come,
+		 * tasklist protects us from ptrace_freeze_traced().
+		 */
+		__set_current_state(TASK_RUNNING);
+		clear_traced_xxx();
+		/*
 		 * If @gstop_done, the ptracer went away between group stop
 		 * completion and here.  During detach, it would have set
 		 * JOBCTL_STOP_PENDING on us and we'll re-enter
@@ -2307,13 +2320,11 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
 		 */
 		if (gstop_done)
 			do_notify_parent_cldstop(current, false, why);
+		read_unlock(&tasklist_lock);
 
-		/* tasklist protects us from ptrace_freeze_traced() */
-		__set_current_state(TASK_RUNNING);
 		read_code = false;
 		if (clear_code)
 			exit_code = 0;
-		read_unlock(&tasklist_lock);
 	}
 
 	/*
Oleg Nesterov April 18, 2022, 5:19 p.m. UTC | #16
On 04/18, Oleg Nesterov wrote:
>
>  static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>  {
...
> +	if (!traced)
> +		return -ESRCH;
> +
> +	WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> +	if (ignore_state)
> +		return 0;
> +
> +	if (!task_is_traced(child))
> +		return -ESRCH;

This is the new check V1 didn't have, we need it to unsure that check_attach
can't miss the change in child->jobctl and call wait_task_inactive() before
the child marks itself as "traced" and clears JOBCTL_TRACED_XXX.

Oleg.
Peter Zijlstra April 20, 2022, 10:20 a.m. UTC | #17
On Fri, Apr 15, 2022 at 12:57:56PM +0200, Oleg Nesterov wrote:
> On 04/15, Oleg Nesterov wrote:
> >
> > OK, so far it seems that this patch needs a couple of simple fixes you
> > pointed out, but before I send V2:
> >
> > 	- do you agree we can avoid JOBCTL_TRACED_FROZEN in 1-2 ?
> >
> > 	- will you agree if I change ptrace_freeze_traced() to rely
> > 	  on __state == TASK_TRACED rather than task_is_traced() ?
> >
> 
> Forgot to say, I think 1/5 needs some changes in any case...
> 
> ptrace_resume() does  wake_up_state(child, __TASK_TRACED) but doesn't
> clear JOBCTL_TRACED. The "else" branch in ptrace_stop() leaks this flag
> too. Perhaps I missed something, I'll reread 1/5 again, but the main
> question to me is whether 1-2 actually need the JOBCTL_TRACED_FROZEN flag.

Ok, getting back to this. So I did the change to ptrace_resume(), but
I'm not entirely sure I understand the issue with the else branch of
ptrace_stop().

My understanding is that if we hit that else branch, we've raced wth
__ptrace_unlink(), and that will have done:

  if (... || task_is_traced(child))
	ptrace_signal_wake_up(child, true);

Which will have done that wakeup and cleared both __state and jobctl.
Oleg Nesterov April 20, 2022, 11:35 a.m. UTC | #18
On 04/20, Peter Zijlstra wrote:
>
> On Fri, Apr 15, 2022 at 12:57:56PM +0200, Oleg Nesterov wrote:
> >
> > ptrace_resume() does  wake_up_state(child, __TASK_TRACED) but doesn't
> > clear JOBCTL_TRACED. The "else" branch in ptrace_stop() leaks this flag
> > too. Perhaps I missed something, I'll reread 1/5 again, but the main
> > question to me is whether 1-2 actually need the JOBCTL_TRACED_FROZEN flag.
>
> Ok, getting back to this. So I did the change to ptrace_resume(), but
> I'm not entirely sure I understand the issue with the else branch of
> ptrace_stop().
>
> My understanding is that if we hit that else branch, we've raced wth
> __ptrace_unlink(), and that will have done:

Yes, thanks for correcting me, I forgot that may_ptrace_stop() has gone.

Oleg.
Peter Zijlstra April 20, 2022, 1:17 p.m. UTC | #19
On Mon, Apr 18, 2022 at 07:01:05PM +0200, Oleg Nesterov wrote:

> diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
> index ec8b312f7506..1b5a57048e13 100644
> --- a/include/linux/sched/jobctl.h
> +++ b/include/linux/sched/jobctl.h
> @@ -22,7 +22,8 @@ struct task_struct;
>  
>  #define JOBCTL_STOPPED_BIT	24
>  #define JOBCTL_TRACED_BIT	25
> +#define JOBCTL_TRACED_XXX_BIT	25

26, also we must come up with a better name than tripple-x. In my head
it's started to be called TRACED_OLEG, but that can't be right either
;-)

Does something like:

#define JOBCTL_TRACED_BIT		25
#define JOBCTL_TRACED_QUIESCE_BIT	26

work?

> diff --git a/kernel/signal.c b/kernel/signal.c
> index 0aea3f0a8002..c7a89904cc4a 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2182,6 +2182,13 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
>  	spin_unlock_irqrestore(&sighand->siglock, flags);
>  }
>  
> +static void clear_traced_xxx(void)
> +{
> +	spin_lock_irq(&current->sighand->siglock);
> +	current->jobctl &= ~JOBCTL_TRACED_XXX;
> +	spin_unlock_irq(&current->sighand->siglock);
> +}
> +
>  /*
>   * This must be called with current->sighand->siglock held.
>   *
> @@ -2220,7 +2227,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  	 * schedule() will not sleep if there is a pending signal that
>  	 * can awaken the task.
>  	 */
> -	current->jobctl |= JOBCTL_TRACED;
> +	current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_XXX;
>  	set_special_state(TASK_TRACED);
>  
>  	/*
> @@ -2282,6 +2289,8 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  		if (gstop_done && ptrace_reparented(current))
>  			do_notify_parent_cldstop(current, false, why);
>  
> +		clear_traced_xxx();
> +		wake_up_state(current->parent, TASK_KILLABLE);
>  		/*
>  		 * Don't want to allow preemption here, because
>  		 * sys_ptrace() needs this task to be inactive.
> @@ -2297,8 +2306,12 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  	} else {
>  		/*
>  		 * By the time we got the lock, our tracer went away.
> -		 * Don't drop the lock yet, another tracer may come.
> -		 *
> +		 * Don't drop the lock yet, another tracer may come,
> +		 * tasklist protects us from ptrace_freeze_traced().
> +		 */
> +		__set_current_state(TASK_RUNNING);
> +		clear_traced_xxx();
> +		/*
>  		 * If @gstop_done, the ptracer went away between group stop
>  		 * completion and here.  During detach, it would have set
>  		 * JOBCTL_STOP_PENDING on us and we'll re-enter

This is that same else clause again; perhaps make signal_wake_up_state()
also clear TRACED_XXX instead?
Oleg Nesterov April 20, 2022, 6:03 p.m. UTC | #20
On 04/20, Peter Zijlstra wrote:
>
> Does something like:
>
> #define JOBCTL_TRACED_BIT		25
> #define JOBCTL_TRACED_QUIESCE_BIT	26
>
> work?

Agreed! ;)

> >  	} else {
> >  		/*
> >  		 * By the time we got the lock, our tracer went away.
> > -		 * Don't drop the lock yet, another tracer may come.
> > -		 *
> > +		 * Don't drop the lock yet, another tracer may come,
> > +		 * tasklist protects us from ptrace_freeze_traced().
> > +		 */
> > +		__set_current_state(TASK_RUNNING);
> > +		clear_traced_xxx();
> > +		/*
> >  		 * If @gstop_done, the ptracer went away between group stop
> >  		 * completion and here.  During detach, it would have set
> >  		 * JOBCTL_STOP_PENDING on us and we'll re-enter
>
> This is that same else clause again; perhaps make signal_wake_up_state()
> also clear TRACED_XXX instead?

I swear, I too thought about this after my last email. Yes, I think this
makes sense. Thanks,

Oleg.
Peter Zijlstra April 21, 2022, 7:21 a.m. UTC | #21
On Wed, Apr 20, 2022 at 03:54:15PM -0500, Eric W. Biederman wrote:
> 
> I was thinking about this and I have an approach from a different
> direction.  In particular it removes the need for ptrace_freeze_attach
> and ptrace_unfreeze_attach to change __state.  Instead a jobctl
> bit is used to suppress waking up a process with TASK_WAKEKILL.
> 
> I think this would be a good technique to completely decouple
> PREEMPT_RT from the work that ptrace_freeze_attach does.
> 
> Comments?

On first read-through, I like it! A few comments down below..

> @@ -216,13 +217,11 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
>  	 * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely.
>  	 * Recheck state under the lock to close this race.
>  	 */
> -	spin_lock_irq(&task->sighand->siglock);
> -	if (READ_ONCE(task->__state) == __TASK_TRACED) {
> -		if (__fatal_signal_pending(task))
> -			wake_up_state(task, __TASK_TRACED);
> -		else
> -			WRITE_ONCE(task->__state, TASK_TRACED);
> -	}
> +	spin_unlock_irq(&task->sighand->siglock);

  ^^^^ this should be spin_lock_irq(...)

> +	WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL));
> +	task->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
> +	if (fatal_signal_pending(task))
> +		wake_up_state(task, TASK_WAKEKILL);
>  	spin_unlock_irq(&task->sighand->siglock);
>  }
>  
> @@ -256,7 +255,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>  	 */
>  	read_lock(&tasklist_lock);
>  	if (child->ptrace && child->parent == current) {
> -		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> +		WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL);
>  		/*
>  		 * child->sighand can't be NULL, release_task()
>  		 * does ptrace_unlink() before __exit_signal().
> @@ -267,13 +266,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>  	read_unlock(&tasklist_lock);
>  
>  	if (!ret && !ignore_state) {
> -		if (!wait_task_inactive(child, __TASK_TRACED)) {
> +		if (!wait_task_inactive(child, TASK_TRACED)) {

This is still very dubious, there are spinlocks between
set_current_state(TASK_TRACED) and schedule(), so wait_task_inactive()
can fail where we don't want it to due to TASK_TRACED being temporarily
held in ->saved_state.

>  			/*
>  			 * This can only happen if may_ptrace_stop() fails and
>  			 * ptrace_stop() changes ->state back to TASK_RUNNING,
> -			 * so we should not worry about leaking __TASK_TRACED.
> +			 * so we should not worry about leaking JOBCTL_DELAY_WAKEKILL.
>  			 */
> +			WARN_ON(!(child->jobctl & JOBCTL_DELAY_WAKEKILL));
>  			ret = -ESRCH;
>  		}
>  	}
Oleg Nesterov April 21, 2022, 9:46 a.m. UTC | #22
On 04/20, Eric W. Biederman wrote:
>
> I was thinking about this and I have an approach from a different
> direction.  In particular it removes the need for ptrace_freeze_attach
> and ptrace_unfreeze_attach to change __state.  Instead a jobctl
> bit is used to suppress waking up a process with TASK_WAKEKILL.

I think this can work, but we still need something like 1/5 + 2/5?

> I think this would be a good technique to completely decouple
> PREEMPT_RT from the work that ptrace_freeze_attach does.

If CONFIG_RT=y we can't rely on the ->__state check in task_is_traced(),
and wait_task_inactive() can wrongly fail if the tracee sleeps waiting
for tasklist_lock.

A couple of comments after a quick glance,

>  static void ptrace_unfreeze_traced(struct task_struct *task)
>  {
> -	if (READ_ONCE(task->__state) != __TASK_TRACED)
> +	if (!task_is_traced(task))
>  		return;
>
>  	WARN_ON(!task->ptrace || task->parent != current);
> @@ -216,13 +217,11 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
>  	 * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely.
>  	 * Recheck state under the lock to close this race.
>  	 */
> -	spin_lock_irq(&task->sighand->siglock);
> -	if (READ_ONCE(task->__state) == __TASK_TRACED) {
> -		if (__fatal_signal_pending(task))
> -			wake_up_state(task, __TASK_TRACED);
> -		else
> -			WRITE_ONCE(task->__state, TASK_TRACED);
> -	}
> +	spin_unlock_irq(&task->sighand->siglock);
> +	WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL));
> +	task->jobctl &= ~JOBCTL_DELAY_WAKEKILL;

We can't rely on the lockless task_is_traced() check above... probably this
is fine, but I need to re-chesk. But at least you need to remove the comment
about PTRACE_LISTEN above.

Another problem is that WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL))
doesn't look right if ignore_state in ptrace_check_attach() was true, the
tracee could stop before ptrace_unfreeze_traced().

> @@ -892,7 +891,6 @@ static int ptrace_resume(struct task_struct *child, long request,
>  	 * status and clears the code too; this can't race with the tracee, it
>  	 * takes siglock after resume.
>  	 */
> -	need_siglock = data && !thread_group_empty(current);
>  	if (need_siglock)
>  		spin_lock_irq(&child->sighand->siglock);

Hmm?

Oleg.
Peter Zijlstra April 21, 2022, 10:26 a.m. UTC | #23
On Thu, Apr 21, 2022 at 09:21:38AM +0200, Peter Zijlstra wrote:
> > @@ -267,13 +266,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
> >  	read_unlock(&tasklist_lock);
> >  
> >  	if (!ret && !ignore_state) {
> > -		if (!wait_task_inactive(child, __TASK_TRACED)) {
> > +		if (!wait_task_inactive(child, TASK_TRACED)) {
> 
> This is still very dubious, there are spinlocks between
> set_current_state(TASK_TRACED) and schedule(), so wait_task_inactive()
> can fail where we don't want it to due to TASK_TRACED being temporarily
> held in ->saved_state.

As such, I've taken the liberty of munging yours and Oleg's approach
together. I've yet to actually test this but it looks feasible.

---
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -19,9 +19,11 @@ struct task_struct;
 #define JOBCTL_TRAPPING_BIT	21	/* switching to TRACED */
 #define JOBCTL_LISTENING_BIT	22	/* ptracer is listening for events */
 #define JOBCTL_TRAP_FREEZE_BIT	23	/* trap for cgroup freezer */
+#define JOBCTL_DELAY_WAKEKILL_BIT 24	/* delay killable wakeups */
 
-#define JOBCTL_STOPPED_BIT	24
-#define JOBCTL_TRACED_BIT	25
+#define JOBCTL_STOPPED_BIT	25
+#define JOBCTL_TRACED_BIT	26
+#define JOBCTL_TRACED_QUIESCE_BIT 27
 
 #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
 #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
@@ -31,9 +33,11 @@ struct task_struct;
 #define JOBCTL_TRAPPING		(1UL << JOBCTL_TRAPPING_BIT)
 #define JOBCTL_LISTENING	(1UL << JOBCTL_LISTENING_BIT)
 #define JOBCTL_TRAP_FREEZE	(1UL << JOBCTL_TRAP_FREEZE_BIT)
+#define JOBCTL_DELAY_WAKEKILL	(1UL << JOBCTL_DELAY_WAKEKILL_BIT)
 
 #define JOBCTL_STOPPED		(1UL << JOBCTL_STOPPED_BIT)
 #define JOBCTL_TRACED		(1UL << JOBCTL_TRACED_BIT)
+#define JOBCTL_TRACED_QUIESCE	(1UL << JOBCTL_TRACED_QUIESCE_BIT)
 
 #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
 #define JOBCTL_PENDING_MASK	(JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK)
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -193,26 +193,32 @@ static bool looks_like_a_spurious_pid(st
  */
 static bool ptrace_freeze_traced(struct task_struct *task)
 {
+	unsigned long flags;
 	bool ret = false;
 
 	/* Lockless, nobody but us can set this flag */
 	if (task->jobctl & JOBCTL_LISTENING)
 		return ret;
 
-	spin_lock_irq(&task->sighand->siglock);
-	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
+	if (!lock_task_sighand(task, &flags))
+		return ret;
+
+	if (READ_ONCE(task->__state) == TASK_TRACED &&
+	    !looks_like_a_spurious_pid(task) &&
 	    !__fatal_signal_pending(task)) {
-		WRITE_ONCE(task->__state, __TASK_TRACED);
+		WARN_ON_ONCE(!task_is_traced(task));
+		WARN_ON_ONCE(task->jobctl & JOBCTL_DELAY_WAKEKILL);
+		task->jobctl |= JOBCTL_DELAY_WAKEKILL;
 		ret = true;
 	}
-	spin_unlock_irq(&task->sighand->siglock);
+	unlock_task_sighand(task, &flags);
 
 	return ret;
 }
 
 static void ptrace_unfreeze_traced(struct task_struct *task)
 {
-	if (READ_ONCE(task->__state) != __TASK_TRACED)
+	if (!task_is_traced(task))
 		return;
 
 	WARN_ON(!task->ptrace || task->parent != current);
@@ -222,12 +228,11 @@ static void ptrace_unfreeze_traced(struc
 	 * Recheck state under the lock to close this race.
 	 */
 	spin_lock_irq(&task->sighand->siglock);
-	if (READ_ONCE(task->__state) == __TASK_TRACED) {
-		if (__fatal_signal_pending(task)) {
-			task->jobctl &= ~JOBCTL_TRACED;
-			wake_up_state(task, __TASK_TRACED);
-		} else
-			WRITE_ONCE(task->__state, TASK_TRACED);
+//	WARN_ON_ONCE(!(task->jobctl & JOBCTL_DELAY_WAKEKILL));
+	task->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
+	if (__fatal_signal_pending(task)) {
+		task->jobctl &= ~JOBCTL_TRACED;
+		wake_up_state(task, TASK_WAKEKILL);
 	}
 	spin_unlock_irq(&task->sighand->siglock);
 }
@@ -251,40 +256,46 @@ static void ptrace_unfreeze_traced(struc
  */
 static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 {
-	int ret = -ESRCH;
+	int traced;
 
 	/*
 	 * We take the read lock around doing both checks to close a
-	 * possible race where someone else was tracing our child and
-	 * detached between these two checks.  After this locked check,
-	 * we are sure that this is our traced child and that can only
-	 * be changed by us so it's not changing right after this.
+	 * possible race where someone else attaches or detaches our
+	 * natural child.
 	 */
 	read_lock(&tasklist_lock);
-	if (child->ptrace && child->parent == current) {
-		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-		/*
-		 * child->sighand can't be NULL, release_task()
-		 * does ptrace_unlink() before __exit_signal().
-		 */
-		if (ignore_state || ptrace_freeze_traced(child))
-			ret = 0;
-	}
+	traced = child->ptrace && child->parent == current;
 	read_unlock(&tasklist_lock);
+	if (!traced)
+		return -ESRCH;
 
-	if (!ret && !ignore_state) {
-		if (!wait_task_inactive(child, __TASK_TRACED)) {
-			/*
-			 * This can only happen if may_ptrace_stop() fails and
-			 * ptrace_stop() changes ->state back to TASK_RUNNING,
-			 * so we should not worry about leaking __TASK_TRACED.
-			 */
-			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-			ret = -ESRCH;
-		}
+	WARN_ON_ONCE(READ_ONCE(child->__state) == __TASK_TRACED);
+	WARN_ON_ONCE(READ_ONCE(child->jobctl) & JOBCTL_DELAY_WAKEKILL);
+
+	if (ignore_state)
+		return 0;
+
+	if (!task_is_traced(child))
+		return -ESRCH;
+
+	/* wait for JOBCTL_TRACED_QUIESCE to go away, see ptrace_stop() */
+	for (;;) {
+		if (fatal_signal_pending(current))
+			return -EINTR;
+
+		set_current_state(TASK_KILLABLE);
+		if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED_QUIESCE))
+			break;
+
+		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 
-	return ret;
+	if (!wait_task_inactive(child, TASK_TRACED) ||
+	    !ptrace_freeze_traced(child))
+		return -ESRCH;
+
+	return 0;
 }
 
 static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
@@ -1329,8 +1340,7 @@ SYSCALL_DEFINE4(ptrace, long, request, l
 		goto out_put_task_struct;
 
 	ret = arch_ptrace(child, request, addr, data);
-	if (ret || request != PTRACE_DETACH)
-		ptrace_unfreeze_traced(child);
+	ptrace_unfreeze_traced(child);
 
  out_put_task_struct:
 	put_task_struct(child);
@@ -1472,8 +1482,7 @@ COMPAT_SYSCALL_DEFINE4(ptrace, compat_lo
 				  request == PTRACE_INTERRUPT);
 	if (!ret) {
 		ret = compat_arch_ptrace(child, request, addr, data);
-		if (ret || request != PTRACE_DETACH)
-			ptrace_unfreeze_traced(child);
+		ptrace_unfreeze_traced(child);
 	}
 
  out_put_task_struct:
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -764,6 +764,10 @@ void signal_wake_up_state(struct task_st
 {
 	lockdep_assert_held(&t->sighand->siglock);
 
+	/* Suppress wakekill? */
+	if (t->jobctl & JOBCTL_DELAY_WAKEKILL)
+		state &= ~TASK_WAKEKILL;
+
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 
 	/*
@@ -774,7 +778,7 @@ void signal_wake_up_state(struct task_st
 	 * handle its death signal.
 	 */
 	if (wake_up_state(t, state | TASK_INTERRUPTIBLE))
-		t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED);
+		t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE);
 	else
 		kick_process(t);
 }
@@ -2187,6 +2191,15 @@ static void do_notify_parent_cldstop(str
 	spin_unlock_irqrestore(&sighand->siglock, flags);
 }
 
+static void clear_traced_quiesce(void)
+{
+	spin_lock_irq(&current->sighand->siglock);
+	WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE));
+	current->jobctl &= ~JOBCTL_TRACED_QUIESCE;
+	wake_up_state(current->parent, TASK_KILLABLE);
+	spin_unlock_irq(&current->sighand->siglock);
+}
+
 /*
  * This must be called with current->sighand->siglock held.
  *
@@ -2225,7 +2238,7 @@ static int ptrace_stop(int exit_code, in
 	 * schedule() will not sleep if there is a pending signal that
 	 * can awaken the task.
 	 */
-	current->jobctl |= JOBCTL_TRACED;
+	current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE;
 	set_special_state(TASK_TRACED);
 
 	/*
@@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in
 		/*
 		 * Don't want to allow preemption here, because
 		 * sys_ptrace() needs this task to be inactive.
-		 *
-		 * XXX: implement read_unlock_no_resched().
 		 */
 		preempt_disable();
 		read_unlock(&tasklist_lock);
-		cgroup_enter_frozen();
+		cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!!
+
+		/*
+		 * JOBCTL_TRACE_QUIESCE bridges the gap between
+		 * set_current_state(TASK_TRACED) above and schedule() below.
+		 * There must not be any blocking (specifically anything that
+		 * touched ->saved_state on PREEMPT_RT) between here and
+		 * schedule().
+		 *
+		 * ptrace_check_attach() relies on this with its
+		 * wait_task_inactive() usage.
+		 */
+		clear_traced_quiesce();
+
 		preempt_enable_no_resched();
 		freezable_schedule();
+
 		cgroup_leave_frozen(true);
 	} else {
 		/*
@@ -2335,6 +2360,7 @@ static int ptrace_stop(int exit_code, in
 
 	/* LISTENING can be set only during STOP traps, clear it */
 	current->jobctl &= ~JOBCTL_LISTENING;
+	current->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
 
 	/*
 	 * Queued signals ignored us while we were stopped for tracing.
Oleg Nesterov April 21, 2022, 10:49 a.m. UTC | #24
On 04/21, Peter Zijlstra wrote:
>
> As such, I've taken the liberty of munging yours and Oleg's approach
> together. I've yet to actually test this but it looks feasible.

Agreed, but I am a but confused. Is it on top of 1/5? Doesn't look so,
and I can't apply it with or without 1/5. So I am not sure I understand
what exactly it does.

Oleg.
Peter Zijlstra April 21, 2022, 11:50 a.m. UTC | #25
On Thu, Apr 21, 2022 at 12:49:43PM +0200, Oleg Nesterov wrote:
> On 04/21, Peter Zijlstra wrote:
> >
> > As such, I've taken the liberty of munging yours and Oleg's approach
> > together. I've yet to actually test this but it looks feasible.
> 
> Agreed, but I am a but confused. Is it on top of 1/5? Doesn't look so,
> and I can't apply it with or without 1/5. So I am not sure I understand
> what exactly it does.

Yes, it is on top of a modified 1, I mean to post an updated and tested
series later today.

Basically it does the TRACED_QUIESCE thing from your patch, and then
does the DELAY_WAKEKILL thing from Eric's patch.
Eric W. Biederman April 21, 2022, 2:45 p.m. UTC | #26
Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Apr 20, 2022 at 03:54:15PM -0500, Eric W. Biederman wrote:
>> 
>> I was thinking about this and I have an approach from a different
>> direction.  In particular it removes the need for ptrace_freeze_attach
>> and ptrace_unfreeze_attach to change __state.  Instead a jobctl
>> bit is used to suppress waking up a process with TASK_WAKEKILL.
>> 
>> I think this would be a good technique to completely decouple
>> PREEMPT_RT from the work that ptrace_freeze_attach does.
>> 
>> Comments?
>
> On first read-through, I like it! A few comments down below..
>
>> @@ -216,13 +217,11 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
>>  	 * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely.
>>  	 * Recheck state under the lock to close this race.
>>  	 */
>> -	spin_lock_irq(&task->sighand->siglock);
>> -	if (READ_ONCE(task->__state) == __TASK_TRACED) {
>> -		if (__fatal_signal_pending(task))
>> -			wake_up_state(task, __TASK_TRACED);
>> -		else
>> -			WRITE_ONCE(task->__state, TASK_TRACED);
>> -	}
>> +	spin_unlock_irq(&task->sighand->siglock);
>
>   ^^^^ this should be spin_lock_irq(...)

Doh!

Thank you for spotting that.  That solves my nasty splat in
__send_signal.


>
>> +	WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL));
>> +	task->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
>> +	if (fatal_signal_pending(task))
>> +		wake_up_state(task, TASK_WAKEKILL);
>>  	spin_unlock_irq(&task->sighand->siglock);
>>  }
>>  
>> @@ -256,7 +255,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>>  	 */
>>  	read_lock(&tasklist_lock);
>>  	if (child->ptrace && child->parent == current) {
>> -		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
>> +		WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL);
>>  		/*
>>  		 * child->sighand can't be NULL, release_task()
>>  		 * does ptrace_unlink() before __exit_signal().
>> @@ -267,13 +266,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>>  	read_unlock(&tasklist_lock);
>>  
>>  	if (!ret && !ignore_state) {
>> -		if (!wait_task_inactive(child, __TASK_TRACED)) {
>> +		if (!wait_task_inactive(child, TASK_TRACED)) {
>
> This is still very dubious, there are spinlocks between
> set_current_state(TASK_TRACED) and schedule(), so wait_task_inactive()
> can fail where we don't want it to due to TASK_TRACED being temporarily
> held in ->saved_state.

When it comes to PREEMPT_RT yes.

I think we might be able to remove the wait_task_inactive, I am
not certain what it gets us.

All this change gets us is the removal of playing with __state.

Eric
Eric W. Biederman April 21, 2022, 3:01 p.m. UTC | #27
Oleg Nesterov <oleg@redhat.com> writes:

> On 04/20, Eric W. Biederman wrote:
>> @@ -892,7 +891,6 @@ static int ptrace_resume(struct task_struct *child, long request,
>>  	 * status and clears the code too; this can't race with the tracee, it
>>  	 * takes siglock after resume.
>>  	 */
>> -	need_siglock = data && !thread_group_empty(current);
>>  	if (need_siglock)
>>  		spin_lock_irq(&child->sighand->siglock);
>
> Hmm?

A half backed out change (I thought ptrace_resume would need to clear
JOBCTL_DELAY_WAKEKILL) in ptrace_resume.  I somehow failed to
restore the need_siglock line.

Eric
diff mbox series

Patch

--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -186,31 +186,13 @@  static bool looks_like_a_spurious_pid(st
 }
 
 /*
- * Ensure that nothing can wake it up, even SIGKILL
+ * Ptrace operation is complete, re-allow TASK_WAKEKILL.
  *
- * A task is switched to this state while a ptrace operation is in progress;
- * such that the ptrace operation is uninterruptible.
+ * Unfreeze is easy, since ptrace_check_attach() guarantees the task is off the
+ * runqueue and __TASK_TRACED. If it's still __TASK_TRACED holding
+ * sighand->siglock serializes against ptrace_signal_wake_up() and we can
+ * simply put TASK_WAKEKILL back (or wake because there's a pending fatal).
  */
-static bool ptrace_freeze_traced(struct task_struct *task)
-{
-	bool ret = false;
-
-	/* Lockless, nobody but us can set this flag */
-	if (task->jobctl & JOBCTL_LISTENING)
-		return ret;
-
-	spin_lock_irq(&task->sighand->siglock);
-	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
-	    !__fatal_signal_pending(task)) {
-		task->jobctl |= JOBCTL_TRACED_FROZEN;
-		WRITE_ONCE(task->__state, __TASK_TRACED);
-		ret = true;
-	}
-	spin_unlock_irq(&task->sighand->siglock);
-
-	return ret;
-}
-
 static void ptrace_unfreeze_traced(struct task_struct *task)
 {
 	if (READ_ONCE(task->__state) != __TASK_TRACED)
@@ -234,6 +216,94 @@  static void ptrace_unfreeze_traced(struc
 	spin_unlock_irq(&task->sighand->siglock);
 }
 
+/*
+ * In order to start a ptrace operation the victim task must be off the
+ * runqueue in state __TASK_TRACED.
+ */
+static int __ptrace_freeze_cond(struct task_struct *p)
+{
+	if (!task_is_traced(p))
+		return -ESRCH;
+
+	if (task_curr(p))
+		return -EINPROGRESS;
+
+	if (p->on_rq)
+		return -EWOULDBLOCK;
+
+	/*
+	 * Due to PREEMPT_RT it is possible the task is blocked on a spinlock
+	 * in state TASK_RTLOCK_WAIT, if so, gotta wait until that goes away.
+	 */
+	if (!(READ_ONCE(p->__state) & __TASK_TRACED))
+		return -EWOULDBLOCK;
+
+	return 0;
+}
+
+/*
+ * Returns:
+ *  0:		  task is off runqueue in TASK_TRACED
+ *  -ESRCH:	  not traced
+ *  -EINPROGRESS: still running
+ *  -EWOULDBLOCK: not running
+ */
+static int __ptrace_pre_freeze(struct task_struct *p, void *arg)
+{
+	int ret;
+
+	ret = __ptrace_freeze_cond(p);
+	if (ret)
+		return ret;
+
+	*(unsigned long *)arg = p->nvcsw;
+
+	return 0;
+}
+
+/*
+ * Returns:
+ *  0:		  task is off runqueue, now __TASK_TRACED
+ *  -ESRCH:	  not traced, or scheduled since pre_freeze
+ *  -GAIN:	  still running
+ *  -EWOULDBLOCK: not running
+ */
+static int __ptrace_freeze(struct task_struct *p, void *arg)
+{
+	int ret;
+
+	ret = __ptrace_freeze_cond(p);
+	if (ret)
+		return ret;
+
+	/*
+	 * Task scheduled between __ptrace_pre_freeze() and here, not our task
+	 * anymore.
+	 */
+	if (*(unsigned long *)arg != p->nvcsw)
+		return -ESRCH;
+
+	if (looks_like_a_spurious_pid(p))
+		return -ESRCH;
+
+	if (__fatal_signal_pending(p))
+		return -ESRCH;
+
+	/*
+	 * we hold:
+	 *
+	 *  - tasklist_lock       (avoids ptrace_detach)
+	 *  - p->sighand->siglock (avoids ptrace_signal_wake_up)
+	 *  - p->pi_lock          (avoids anything scheduler)
+	 *
+	 * task is absolutely frozen, now we can safely take out
+	 * TASK_WAKEKILL.
+	 */
+	p->jobctl |= JOBCTL_TRACED_FROZEN;
+	WRITE_ONCE(p->__state, __TASK_TRACED);
+	return 0;
+}
+
 /**
  * ptrace_check_attach - check whether ptracee is ready for ptrace operation
  * @child: ptracee to check for
@@ -253,7 +323,36 @@  static void ptrace_unfreeze_traced(struc
  */
 static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 {
-	int ret = -ESRCH;
+	ktime_t to = TICK_NSEC;
+	unsigned long nvcsw;
+	int ret;
+
+	if (child == current)
+		return -ESRCH;
+
+	if (!ignore_state) for (;;) {
+		/*
+		 * Ensure this child is a quiescent TASK_TRACED task.
+		 */
+		ret = task_call_func(child, __ptrace_pre_freeze, &nvcsw);
+		switch (ret) {
+		case 0:
+			break;
+		case -ESRCH:
+			return ret;
+		case -EINPROGRESS:
+			while (task_curr(child))
+				cpu_relax();
+			continue;
+		case -EWOULDBLOCK:
+			/*
+			 * XXX horrible, randomly wait 1 tick and try again.
+			 */
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			schedule_hrtimeout(&to, HRTIMER_MODE_REL_HARD);
+			continue;
+		}
+	}
 
 	/*
 	 * We take the read lock around doing both checks to close a
@@ -262,29 +361,35 @@  static int ptrace_check_attach(struct ta
 	 * we are sure that this is our traced child and that can only
 	 * be changed by us so it's not changing right after this.
 	 */
+	ret = -ESRCH;
 	read_lock(&tasklist_lock);
 	if (child->ptrace && child->parent == current) {
 		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
+		if (ignore_state) {
+			ret = 0;
+			goto unlock;
+		}
+
+		if (child->jobctl & JOBCTL_LISTENING)
+			goto unlock;
+
 		/*
 		 * child->sighand can't be NULL, release_task()
 		 * does ptrace_unlink() before __exit_signal().
 		 */
-		if (ignore_state || ptrace_freeze_traced(child))
-			ret = 0;
-	}
-	read_unlock(&tasklist_lock);
-
-	if (!ret && !ignore_state) {
-		if (!wait_task_inactive(child, __TASK_TRACED)) {
+		spin_lock_irq(&child->sighand->siglock);
+		ret = task_call_func(child, __ptrace_freeze, &nvcsw);
+		if (ret) {
 			/*
-			 * This can only happen if may_ptrace_stop() fails and
-			 * ptrace_stop() changes ->state back to TASK_RUNNING,
-			 * so we should not worry about leaking __TASK_TRACED.
+			 * If *anything* changed since __ptrace_pre_freeze()
+			 * this fails.
 			 */
-			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
 			ret = -ESRCH;
 		}
+		spin_unlock_irq(&child->sighand->siglock);
 	}
+unlock:
+	read_unlock(&tasklist_lock);
 
 	return ret;
 }
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6310,10 +6310,7 @@  static void __sched notrace __schedule(u
 
 	/*
 	 * We must load prev->state once (task_struct::state is volatile), such
-	 * that:
-	 *
-	 *  - we form a control dependency vs deactivate_task() below.
-	 *  - ptrace_{,un}freeze_traced() can change ->state underneath us.
+	 * that we form a control dependency vs deactivate_task() below.
 	 */
 	prev_state = READ_ONCE(prev->__state);
 	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {