sched: Fix race against ptrace_freeze_trace()
authorPeter Zijlstra <peterz@infradead.org>
Mon, 20 Jul 2020 15:20:21 +0000 (17:20 +0200)
committerPeter Zijlstra <peterz@infradead.org>
Wed, 22 Jul 2020 08:22:00 +0000 (10:22 +0200)
There is apparently one site that violates the rule that only current
and ttwu() will modify task->state, namely ptrace_{,un}freeze_traced()
will change task->state for a remote task.

Oleg explains:

  "TASK_TRACED/TASK_STOPPED was always protected by siglock. In
particular, ttwu(__TASK_TRACED) must be always called with siglock
held. That is why ptrace_freeze_traced() assumes it can safely do
s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock)."

This breaks the ordering scheme introduced by commit:

  dbfb089d360b ("sched: Fix loadavg accounting race")

Specifically, the reload not matching no longer implies we don't have
to block.

Simply things by noting that what we need is a LOAD->STORE ordering
and this can be provided by a control dependency.

So replace:

prev_state = prev->state;
raw_spin_lock(&rq->lock);
smp_mb__after_spinlock(); /* SMP-MB */
if (... && prev_state && prev_state == prev->state)
deactivate_task();

with:

prev_state = prev->state;
if (... && prev_state) /* CTRL-DEP */
deactivate_task();

Since that already implies the 'prev->state' load must be complete
before allowing the 'prev->on_rq = 0' store to become visible.

Fixes: dbfb089d360b ("sched: Fix loadavg accounting race")
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Tested-by: Christian Brauner <christian.brauner@ubuntu.com>
kernel/sched/core.c

index e15543c..5dece9b 100644 (file)
@@ -4119,9 +4119,6 @@ static void __sched notrace __schedule(bool preempt)
        local_irq_disable();
        rcu_note_context_switch(preempt);
 
-       /* See deactivate_task() below. */
-       prev_state = prev->state;
-
        /*
         * Make sure that signal_pending_state()->signal_pending() below
         * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
@@ -4145,11 +4142,16 @@ static void __sched notrace __schedule(bool preempt)
        update_rq_clock(rq);
 
        switch_count = &prev->nivcsw;
+
        /*
-        * We must re-load prev->state in case ttwu_remote() changed it
-        * before we acquired rq->lock.
+        * We must load prev->state once (task_struct::state is volatile), such
+        * that:
+        *
+        *  - we form a control dependency vs deactivate_task() below.
+        *  - ptrace_{,un}freeze_traced() can change ->state underneath us.
         */
-       if (!preempt && prev_state && prev_state == prev->state) {
+       prev_state = prev->state;
+       if (!preempt && prev_state) {
                if (signal_pending_state(prev_state, prev)) {
                        prev->state = TASK_RUNNING;
                } else {
@@ -4163,10 +4165,12 @@ static void __sched notrace __schedule(bool preempt)
 
                        /*
                         * __schedule()                 ttwu()
-                        *   prev_state = prev->state;    if (READ_ONCE(p->on_rq) && ...)
-                        *   LOCK rq->lock                  goto out;
-                        *   smp_mb__after_spinlock();    smp_acquire__after_ctrl_dep();
-                        *   p->on_rq = 0;                p->state = TASK_WAKING;
+                        *   prev_state = prev->state;    if (p->on_rq && ...)
+                        *   if (prev_state)                goto out;
+                        *     p->on_rq = 0;              smp_acquire__after_ctrl_dep();
+                        *                                p->state = TASK_WAKING
+                        *
+                        * Where __schedule() and ttwu() have matching control dependencies.
                         *
                         * After this, schedule() must not care about p->state any more.
                         */