fs/namespace: Boost the mount_lock.lock owner instead of spinning on PREEMPT_RT.
authorSebastian Andrzej Siewior <bigeasy@linutronix.de>
Thu, 25 Nov 2021 12:07:11 +0000 (13:07 +0100)
committerChristian Brauner <christian.brauner@ubuntu.com>
Fri, 26 Nov 2021 11:09:09 +0000 (12:09 +0100)
The MNT_WRITE_HOLD flag is used to hold back any new writers while the
mount point is about to be made read-only. __mnt_want_write() then loops
with disabled preemption until this flag disappears. Callers of
mnt_hold_writers() (which sets the flag) hold the spinlock_t of
mount_lock (seqlock_t) which disables preemption on !PREEMPT_RT and
ensures the task is not scheduled away so that the spinning side spins
for a long time.

On PREEMPT_RT the spinlock_t does not disable preemption and so it is
possible that the task setting MNT_WRITE_HOLD is preempted by task with
higher priority which then spins infinitely waiting for MNT_WRITE_HOLD
to get removed.

Acquire mount_lock::lock which is held by setter of MNT_WRITE_HOLD. This
will PI-boost the owner and wait until the lock is dropped and which
means that MNT_WRITE_HOLD is cleared again.

Link: https://lore.kernel.org/r/20211025152218.opvcqfku2lhqvp4o@linutronix.de
Link: https://lore.kernel.org/r/20211125120711.dgbsienyrsxfzpoi@linutronix.de
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
fs/namespace.c

index 659a8f3..3ab45b4 100644 (file)
@@ -343,8 +343,24 @@ int __mnt_want_write(struct vfsmount *m)
         * incremented count after it has set MNT_WRITE_HOLD.
         */
        smp_mb();
-       while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD)
-               cpu_relax();
+       might_lock(&mount_lock.lock);
+       while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD) {
+               if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
+                       cpu_relax();
+               } else {
+                       /*
+                        * This prevents priority inversion, if the task
+                        * setting MNT_WRITE_HOLD got preempted on a remote
+                        * CPU, and it prevents life lock if the task setting
+                        * MNT_WRITE_HOLD has a lower priority and is bound to
+                        * the same CPU as the task that is spinning here.
+                        */
+                       preempt_enable();
+                       lock_mount_hash();
+                       unlock_mount_hash();
+                       preempt_disable();
+               }
+       }
        /*
         * After the slowpath clears MNT_WRITE_HOLD, mnt_is_readonly will
         * be set to match its requirements. So we must not load that until