Documentation/admin-guide/hw-vuln/core-scheduling.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 ===============
   4 Core Scheduling
   5 ===============
   6 Core scheduling support allows userspace to define groups of tasks that can
   7 share a core. These groups can be specified either for security usecases (one
   8 group of tasks don't trust another), or for performance usecases (some
   9 workloads may benefit from running on the same core as they don't need the same
  10 hardware resources of the shared core, or may prefer different cores if they
  11 do share hardware resource needs). This document only describes the security
  12 usecase.
  13
  14 Security usecase
  15 ----------------
  16 A cross-HT attack involves the attacker and victim running on different Hyper
  17 Threads of the same core. MDS and L1TF are examples of such attacks.  The only
  18 full mitigation of cross-HT attacks is to disable Hyper Threading (HT). Core
  19 scheduling is a scheduler feature that can mitigate some (not all) cross-HT
  20 attacks. It allows HT to be turned on safely by ensuring that only tasks in a
  21 user-designated trusted group can share a core. This increase in core sharing
  22 can also improve performance, however it is not guaranteed that performance
  23 will always improve, though that is seen to be the case with a number of real
  24 world workloads. In theory, core scheduling aims to perform at least as good as
  25 when Hyper Threading is disabled. In practice, this is mostly the case though
  26 not always: as synchronizing scheduling decisions across 2 or more CPUs in a
  27 core involves additional overhead - especially when the system is lightly
  28 loaded. When ``total_threads <= N_CPUS/2``, the extra overhead may cause core
  29 scheduling to perform more poorly compared to SMT-disabled, where N_CPUS is the
  30 total number of CPUs. Please measure the performance of your workloads always.
  31
  32 Usage
  33 -----
  34 Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
  35 Using this feature, userspace defines groups of tasks that can be co-scheduled
  36 on the same core. The core scheduler uses this information to make sure that
  37 tasks that are not in the same group never run simultaneously on a core, while
  38 doing its best to satisfy the system's scheduling requirements.
  39
  40 Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface.
  41 This interface provides support for the creation of core scheduling groups, as
  42 well as admission and removal of tasks from created groups::
  43
  44     #include <sys/prctl.h>
  45
  46     int prctl(int option, unsigned long arg2, unsigned long arg3,
  47             unsigned long arg4, unsigned long arg5);
  48
  49 option:
  50     ``PR_SCHED_CORE``
  51
  52 arg2:
  53     Command for operation, must be one off:
  54
  55     - ``PR_SCHED_CORE_GET`` -- get core_sched cookie of ``pid``.
  56     - ``PR_SCHED_CORE_CREATE`` -- create a new unique cookie for ``pid``.
  57     - ``PR_SCHED_CORE_SHARE_TO`` -- push core_sched cookie to ``pid``.
  58     - ``PR_SCHED_CORE_SHARE_FROM`` -- pull core_sched cookie from ``pid``.
  59
  60 arg3:
  61     ``pid`` of the task for which the operation applies.
  62
  63 arg4:
  64     ``pid_type`` for which the operation applies. It is of type ``enum pid_type``.
  65     For example, if arg4 is ``PIDTYPE_TGID``, then the operation of this command
  66     will be performed for all tasks in the task group of ``pid``.
  67
  68 arg5:
  69     userspace pointer to an unsigned long for storing the cookie returned by
  70     ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands.
  71
  72 In order for a process to push a cookie to, or pull a cookie from a process, it
  73 is required to have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to the
  74 process.
  75
  76 Building hierarchies of tasks
  77 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  78 The simplest way to build hierarchies of threads/processes which share a
  79 cookie and thus a core is to rely on the fact that the core-sched cookie is
  80 inherited across forks/clones and execs, thus setting a cookie for the
  81 'initial' script/executable/daemon will place every spawned child in the
  82 same core-sched group.
  83
  84 Cookie Transferral
  85 ~~~~~~~~~~~~~~~~~~
  86 Transferring a cookie between the current and other tasks is possible using
  87 PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
  88 specified task or a share a cookie with a task. In combination this allows a
  89 simple helper program to pull a cookie from a task in an existing core
  90 scheduling group and share it with already running tasks.
  91
  92 Design/Implementation
  93 ---------------------
  94 Each task that is tagged is assigned a cookie internally in the kernel. As
  95 mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
  96 each other and share a core.
  97
  98 The basic idea is that, every schedule event tries to select tasks for all the
  99 siblings of a core such that all the selected tasks running on a core are
 100 trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
 101 The idle task is considered special, as it trusts everything and everything
 102 trusts it.
 103
 104 During a schedule() event on any sibling of a core, the highest priority task on
 105 the sibling's core is picked and assigned to the sibling calling schedule(), if
 106 the sibling has the task enqueued. For rest of the siblings in the core,
 107 highest priority task with the same cookie is selected if there is one runnable
 108 in their individual run queues. If a task with same cookie is not available,
 109 the idle task is selected.  Idle task is globally trusted.
 110
 111 Once a task has been selected for all the siblings in the core, an IPI is sent to
 112 siblings for whom a new task was selected. Siblings on receiving the IPI will
 113 switch to the new task immediately. If an idle task is selected for a sibling,
 114 then the sibling is considered to be in a `forced idle` state. I.e., it may
 115 have tasks on its on runqueue to run, however it will still have to run idle.
 116 More on this in the next section.
 117
 118 Forced-idling of hyperthreads
 119 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 120 The scheduler tries its best to find tasks that trust each other such that all
 121 tasks selected to be scheduled are of the highest priority in a core.  However,
 122 it is possible that some runqueues had tasks that were incompatible with the
 123 highest priority ones in the core. Favoring security over fairness, one or more
 124 siblings could be forced to select a lower priority task if the highest
 125 priority task is not trusted with respect to the core wide highest priority
 126 task.  If a sibling does not have a trusted task to run, it will be forced idle
 127 by the scheduler (idle thread is scheduled to run).
 128
 129 When the highest priority task is selected to run, a reschedule-IPI is sent to
 130 the sibling to force it into idle. This results in 4 cases which need to be
 131 considered depending on whether a VM or a regular usermode process was running
 132 on either HT::
 133
 134           HT1 (attack)            HT2 (victim)
 135    A      idle -> user space      user space -> idle
 136    B      idle -> user space      guest -> idle
 137    C      idle -> guest           user space -> idle
 138    D      idle -> guest           guest -> idle
 139
 140 Note that for better performance, we do not wait for the destination CPU
 141 (victim) to enter idle mode. This is because the sending of the IPI would bring
 142 the destination CPU immediately into kernel mode from user space, or VMEXIT
 143 in the case of guests. At best, this would only leak some scheduler metadata
 144 which may not be worth protecting. It is also possible that the IPI is received
 145 too late on some architectures, but this has not been observed in the case of
 146 x86.
 147
 148 Trust model
 149 ~~~~~~~~~~~
 150 Core scheduling maintains trust relationships amongst groups of tasks by
 151 assigning them a tag that is the same cookie value.
 152 When a system with core scheduling boots, all tasks are considered to trust
 153 each other. This is because the core scheduler does not have information about
 154 trust relationships until userspace uses the above mentioned interfaces, to
 155 communicate them. In other words, all tasks have a default cookie value of 0.
 156 and are considered system-wide trusted. The forced-idling of siblings running
 157 cookie-0 tasks is also avoided.
 158
 159 Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
 160 within such groups are considered to trust each other, but do not trust those
 161 outside. Tasks outside the group also don't trust tasks within.
 162
 163 Limitations of core-scheduling
 164 ------------------------------
 165 Core scheduling tries to guarantee that only trusted tasks run concurrently on a
 166 core. But there could be small window of time during which untrusted tasks run
 167 concurrently or kernel could be running concurrently with a task not trusted by
 168 kernel.
 169
 170 IPI processing delays
 171 ~~~~~~~~~~~~~~~~~~~~~
 172 Core scheduling selects only trusted tasks to run together. IPI is used to notify
 173 the siblings to switch to the new task. But there could be hardware delays in
 174 receiving of the IPI on some arch (on x86, this has not been observed). This may
 175 cause an attacker task to start running on a CPU before its siblings receive the
 176 IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
 177 may populate data in the cache and micro architectural buffers after the attacker
 178 starts to run and this is a possibility for data leak.
 179
 180 Open cross-HT issues that core scheduling does not solve
 181 --------------------------------------------------------
 182 1. For MDS
 183 ~~~~~~~~~~
 184 Core scheduling cannot protect against MDS attacks between the siblings
 185 running in user mode and the others running in kernel mode. Even though all
 186 siblings run tasks which trust each other, when the kernel is executing
 187 code on behalf of a task, it cannot trust the code running in the
 188 sibling. Such attacks are possible for any combination of sibling CPU modes
 189 (host or guest mode).
 190
 191 2. For L1TF
 192 ~~~~~~~~~~~
 193 Core scheduling cannot protect against an L1TF guest attacker exploiting a
 194 guest or host victim. This is because the guest attacker can craft invalid
 195 PTEs which are not inverted due to a vulnerable guest kernel. The only
 196 solution is to disable EPT (Extended Page Tables).
 197
 198 For both MDS and L1TF, if the guest vCPU is configured to not trust each
 199 other (by tagging separately), then the guest to guest attacks would go away.
 200 Or it could be a system admin policy which considers guest to guest attacks as
 201 a guest problem.
 202
 203 Another approach to resolve these would be to make every untrusted task on the
 204 system to not trust every other untrusted task. While this could reduce
 205 parallelism of the untrusted tasks, it would still solve the above issues while
 206 allowing system processes (trusted tasks) to share a core.
 207
 208 3. Protecting the kernel (IRQ, syscall, VMEXIT)
 209 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 210 Unfortunately, core scheduling does not protect kernel contexts running on
 211 sibling hyperthreads from one another. Prototypes of mitigations have been posted
 212 to LKML to solve this, but it is debatable whether such windows are practically
 213 exploitable, and whether the performance overhead of the prototypes are worth
 214 it (not to mention, the added code complexity).
 215
 216 Other Use cases
 217 ---------------
 218 The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
 219 with SMT enabled. There are other use cases where this feature could be used:
 220
 221 - Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
 222   that uses SIMD instructions etc.
 223 - Gang scheduling: Requirements for a group of tasks that needs to be scheduled
 224   together could also be realized using core scheduling. One example is vCPUs of
 225   a VM.