Documentation/scheduler/sched-util-clamp.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 ====================
   4 Utilization Clamping
   5 ====================
   6
   7 1. Introduction
   8 ===============
   9
  10 Utilization clamping, also known as util clamp or uclamp, is a scheduler
  11 feature that allows user space to help in managing the performance requirement
  12 of tasks. It was introduced in v5.3 release. The CGroup support was merged in
  13 v5.4.
  14
  15 Uclamp is a hinting mechanism that allows the scheduler to understand the
  16 performance requirements and restrictions of the tasks, thus it helps the
  17 scheduler to make a better decision. And when schedutil cpufreq governor is
  18 used, util clamp will influence the CPU frequency selection as well.
  19
  20 Since the scheduler and schedutil are both driven by PELT (util_avg) signals,
  21 util clamp acts on that to achieve its goal by clamping the signal to a certain
  22 point; hence the name. That is, by clamping utilization we are making the
  23 system run at a certain performance point.
  24
  25 The right way to view util clamp is as a mechanism to make request or hint on
  26 performance constraints. It consists of two tunables:
  27
  28         * UCLAMP_MIN, which sets the lower bound.
  29         * UCLAMP_MAX, which sets the upper bound.
  30
  31 These two bounds will ensure a task will operate within this performance range
  32 of the system. UCLAMP_MIN implies boosting a task, while UCLAMP_MAX implies
  33 capping a task.
  34
  35 One can tell the system (scheduler) that some tasks require a minimum
  36 performance point to operate at to deliver the desired user experience. Or one
  37 can tell the system that some tasks should be restricted from consuming too
  38 much resources and should not go above a specific performance point. Viewing
  39 the uclamp values as performance points rather than utilization is a better
  40 abstraction from user space point of view.
  41
  42 As an example, a game can use util clamp to form a feedback loop with its
  43 perceived Frames Per Second (FPS). It can dynamically increase the minimum
  44 performance point required by its display pipeline to ensure no frame is
  45 dropped. It can also dynamically 'prime' up these tasks if it knows in the
  46 coming few hundred milliseconds a computationally intensive scene is about to
  47 happen.
  48
  49 On mobile hardware where the capability of the devices varies a lot, this
  50 dynamic feedback loop offers a great flexibility to ensure best user experience
  51 given the capabilities of any system.
  52
  53 Of course a static configuration is possible too. The exact usage will depend
  54 on the system, application and the desired outcome.
  55
  56 Another example is in Android where tasks are classified as background,
  57 foreground, top-app, etc. Util clamp can be used to constrain how much
  58 resources background tasks are consuming by capping the performance point they
  59 can run at. This constraint helps reserve resources for important tasks, like
  60 the ones belonging to the currently active app (top-app group). Beside this
  61 helps in limiting how much power they consume. This can be more obvious in
  62 heterogeneous systems (e.g. Arm big.LITTLE); the constraint will help bias the
  63 background tasks to stay on the little cores which will ensure that:
  64
  65         1. The big cores are free to run top-app tasks immediately. top-app
  66            tasks are the tasks the user is currently interacting with, hence
  67            the most important tasks in the system.
  68         2. They don't run on a power hungry core and drain battery even if they
  69            are CPU intensive tasks.
  70
  71 .. note::
  72   **little cores**:
  73     CPUs with capacity < 1024
  74
  75   **big cores**:
  76     CPUs with capacity = 1024
  77
  78 By making these uclamp performance requests, or rather hints, user space can
  79 ensure system resources are used optimally to deliver the best possible user
  80 experience.
  81
  82 Another use case is to help with **overcoming the ramp up latency inherit in
  83 how scheduler utilization signal is calculated**.
  84
  85 On the other hand, a busy task for instance that requires to run at maximum
  86 performance point will suffer a delay of ~200ms (PELT HALFIFE = 32ms) for the
  87 scheduler to realize that. This is known to affect workloads like gaming on
  88 mobile devices where frames will drop due to slow response time to select the
  89 higher frequency required for the tasks to finish their work in time. Setting
  90 UCLAMP_MIN=1024 will ensure such tasks will always see the highest performance
  91 level when they start running.
  92
  93 The overall visible effect goes beyond better perceived user
  94 experience/performance and stretches to help achieve a better overall
  95 performance/watt if used effectively.
  96
  97 User space can form a feedback loop with the thermal subsystem too to ensure
  98 the device doesn't heat up to the point where it will throttle.
  99
 100 Both SCHED_NORMAL/OTHER and SCHED_FIFO/RR honour uclamp requests/hints.
 101
 102 In the SCHED_FIFO/RR case, uclamp gives the option to run RT tasks at any
 103 performance point rather than being tied to MAX frequency all the time. Which
 104 can be useful on general purpose systems that run on battery powered devices.
 105
 106 Note that by design RT tasks don't have per-task PELT signal and must always
 107 run at a constant frequency to combat undeterministic DVFS rampup delays.
 108
 109 Note that using schedutil always implies a single delay to modify the frequency
 110 when an RT task wakes up. This cost is unchanged by using uclamp. Uclamp only
 111 helps picking what frequency to request instead of schedutil always requesting
 112 MAX for all RT tasks.
 113
 114 See :ref:`section 3.4 <uclamp-default-values>` for default values and
 115 :ref:`3.4.1 <sched-util-clamp-min-rt-default>` on how to change RT tasks
 116 default value.
 117
 118 2. Design
 119 =========
 120
 121 Util clamp is a property of every task in the system. It sets the boundaries of
 122 its utilization signal; acting as a bias mechanism that influences certain
 123 decisions within the scheduler.
 124
 125 The actual utilization signal of a task is never clamped in reality. If you
 126 inspect PELT signals at any point of time you should continue to see them as
 127 they are intact. Clamping happens only when needed, e.g: when a task wakes up
 128 and the scheduler needs to select a suitable CPU for it to run on.
 129
 130 Since the goal of util clamp is to allow requesting a minimum and maximum
 131 performance point for a task to run on, it must be able to influence the
 132 frequency selection as well as task placement to be most effective. Both of
 133 which have implications on the utilization value at CPU runqueue (rq for short)
 134 level, which brings us to the main design challenge.
 135
 136 When a task wakes up on an rq, the utilization signal of the rq will be
 137 affected by the uclamp settings of all the tasks enqueued on it. For example if
 138 a task requests to run at UTIL_MIN = 512, then the util signal of the rq needs
 139 to respect to this request as well as all other requests from all of the
 140 enqueued tasks.
 141
 142 To be able to aggregate the util clamp value of all the tasks attached to the
 143 rq, uclamp must do some housekeeping at every enqueue/dequeue, which is the
 144 scheduler hot path. Hence care must be taken since any slow down will have
 145 significant impact on a lot of use cases and could hinder its usability in
 146 practice.
 147
 148 The way this is handled is by dividing the utilization range into buckets
 149 (struct uclamp_bucket) which allows us to reduce the search space from every
 150 task on the rq to only a subset of tasks on the top-most bucket.
 151
 152 When a task is enqueued, the counter in the matching bucket is incremented,
 153 and on dequeue it is decremented. This makes keeping track of the effective
 154 uclamp value at rq level a lot easier.
 155
 156 As tasks are enqueued and dequeued, we keep track of the current effective
 157 uclamp value of the rq. See :ref:`section 2.1 <uclamp-buckets>` for details on
 158 how this works.
 159
 160 Later at any path that wants to identify the effective uclamp value of the rq,
 161 it will simply need to read this effective uclamp value of the rq at that exact
 162 moment of time it needs to take a decision.
 163
 164 For task placement case, only Energy Aware and Capacity Aware Scheduling
 165 (EAS/CAS) make use of uclamp for now, which implies that it is applied on
 166 heterogeneous systems only.
 167 When a task wakes up, the scheduler will look at the current effective uclamp
 168 value of every rq and compare it with the potential new value if the task were
 169 to be enqueued there. Favoring the rq that will end up with the most energy
 170 efficient combination.
 171
 172 Similarly in schedutil, when it needs to make a frequency update it will look
 173 at the current effective uclamp value of the rq which is influenced by the set
 174 of tasks currently enqueued there and select the appropriate frequency that
 175 will satisfy constraints from requests.
 176
 177 Other paths like setting overutilization state (which effectively disables EAS)
 178 make use of uclamp as well. Such cases are considered necessary housekeeping to
 179 allow the 2 main use cases above and will not be covered in detail here as they
 180 could change with implementation details.
 181
 182 .. _uclamp-buckets:
 183
 184 2.1. Buckets
 185 ------------
 186
 187 ::
 188
 189                            [struct rq]
 190
 191   (bottom)                                                    (top)
 192
 193     0                                                          1024
 194     |                                                           |
 195     +-----------+-----------+-----------+----   ----+-----------+
 196     |  Bucket 0 |  Bucket 1 |  Bucket 2 |    ...    |  Bucket N |
 197     +-----------+-----------+-----------+----   ----+-----------+
 198        :           :                                   :
 199        +- p0       +- p3                               +- p4
 200        :                                               :
 201        +- p1                                           +- p5
 202        :
 203        +- p2
 204
 205
 206 .. note::
 207   The diagram above is an illustration rather than a true depiction of the
 208   internal data structure.
 209
 210 To reduce the search space when trying to decide the effective uclamp value of
 211 an rq as tasks are enqueued/dequeued, the whole utilization range is divided
 212 into N buckets where N is configured at compile time by setting
 213 CONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5.
 214
 215 The rq has a bucket for each uclamp_id tunables: [UCLAMP_MIN, UCLAMP_MAX].
 216
 217 The range of each bucket is 1024/N. For example, for the default value of
 218 5 there will be 5 buckets, each of which will cover the following range:
 219
 220 ::
 221
 222         DELTA = round_closest(1024/5) = 204.8 = 205
 223
 224         Bucket 0: [0:204]
 225         Bucket 1: [205:409]
 226         Bucket 2: [410:614]
 227         Bucket 3: [615:819]
 228         Bucket 4: [820:1024]
 229
 230 When a task p with following tunable parameters
 231
 232 ::
 233
 234         p->uclamp[UCLAMP_MIN] = 300
 235         p->uclamp[UCLAMP_MAX] = 1024
 236
 237 is enqueued into the rq, bucket 1 will be incremented for UCLAMP_MIN and bucket
 238 4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task in
 239 this range.
 240
 241 The rq then keeps track of its current effective uclamp value for each
 242 uclamp_id.
 243
 244 When a task p is enqueued, the rq value changes to:
 245
 246 ::
 247
 248         // update bucket logic goes here
 249         rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN])
 250         // repeat for UCLAMP_MAX
 251
 252 Similarly, when p is dequeued the rq value changes to:
 253
 254 ::
 255
 256         // update bucket logic goes here
 257         rq->uclamp[UCLAMP_MIN] = search_top_bucket_for_highest_value()
 258         // repeat for UCLAMP_MAX
 259
 260 When all buckets are empty, the rq uclamp values are reset to system defaults.
 261 See :ref:`section 3.4 <uclamp-default-values>` for details on default values.
 262
 263
 264 2.2. Max aggregation
 265 --------------------
 266
 267 Util clamp is tuned to honour the request for the task that requires the
 268 highest performance point.
 269
 270 When multiple tasks are attached to the same rq, then util clamp must make sure
 271 the task that needs the highest performance point gets it even if there's
 272 another task that doesn't need it or is disallowed from reaching this point.
 273
 274 For example, if there are multiple tasks attached to an rq with the following
 275 values:
 276
 277 ::
 278
 279         p0->uclamp[UCLAMP_MIN] = 300
 280         p0->uclamp[UCLAMP_MAX] = 900
 281
 282         p1->uclamp[UCLAMP_MIN] = 500
 283         p1->uclamp[UCLAMP_MAX] = 500
 284
 285 then assuming both p0 and p1 are enqueued to the same rq, both UCLAMP_MIN
 286 and UCLAMP_MAX become:
 287
 288 ::
 289
 290         rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500
 291         rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900
 292
 293 As we shall see in :ref:`section 5.1 <uclamp-capping-fail>`, this max
 294 aggregation is the cause of one of limitations when using util clamp, in
 295 particular for UCLAMP_MAX hint when user space would like to save power.
 296
 297 2.3. Hierarchical aggregation
 298 -----------------------------
 299
 300 As stated earlier, util clamp is a property of every task in the system. But
 301 the actual applied (effective) value can be influenced by more than just the
 302 request made by the task or another actor on its behalf (middleware library).
 303
 304 The effective util clamp value of any task is restricted as follows:
 305
 306   1. By the uclamp settings defined by the cgroup CPU controller it is attached
 307      to, if any.
 308   2. The restricted value in (1) is then further restricted by the system wide
 309      uclamp settings.
 310
 311 :ref:`Section 3 <uclamp-interfaces>` discusses the interfaces and will expand
 312 further on that.
 313
 314 For now suffice to say that if a task makes a request, its actual effective
 315 value will have to adhere to some restrictions imposed by cgroup and system
 316 wide settings.
 317
 318 The system will still accept the request even if effectively will be beyond the
 319 constraints, but as soon as the task moves to a different cgroup or a sysadmin
 320 modifies the system settings, the request will be satisfied only if it is
 321 within new constraints.
 322
 323 In other words, this aggregation will not cause an error when a task changes
 324 its uclamp values, but rather the system may not be able to satisfy requests
 325 based on those factors.
 326
 327 2.4. Range
 328 ----------
 329
 330 Uclamp performance request has the range of 0 to 1024 inclusive.
 331
 332 For cgroup interface percentage is used (that is 0 to 100 inclusive).
 333 Just like other cgroup interfaces, you can use 'max' instead of 100.
 334
 335 .. _uclamp-interfaces:
 336
 337 3. Interfaces
 338 =============
 339
 340 3.1. Per task interface
 341 -----------------------
 342
 343 sched_setattr() syscall was extended to accept two new fields:
 344
 345 * sched_util_min: requests the minimum performance point the system should run
 346   at when this task is running. Or lower performance bound.
 347 * sched_util_max: requests the maximum performance point the system should run
 348   at when this task is running. Or upper performance bound.
 349
 350 For example, the following scenario have 40% to 80% utilization constraints:
 351
 352 ::
 353
 354         attr->sched_util_min = 40% * 1024;
 355         attr->sched_util_max = 80% * 1024;
 356
 357 When task @p is running, **the scheduler should try its best to ensure it
 358 starts at 40% performance level**. If the task runs for a long enough time so
 359 that its actual utilization goes above 80%, the utilization, or performance
 360 level, will be capped.
 361
 362 The special value -1 is used to reset the uclamp settings to the system
 363 default.
 364
 365 Note that resetting the uclamp value to system default using -1 is not the same
 366 as manually setting uclamp value to system default. This distinction is
 367 important because as we shall see in system interfaces, the default value for
 368 RT could be changed. SCHED_NORMAL/OTHER might gain similar knobs too in the
 369 future.
 370
 371 3.2. cgroup interface
 372 ---------------------
 373
 374 There are two uclamp related values in the CPU cgroup controller:
 375
 376 * cpu.uclamp.min
 377 * cpu.uclamp.max
 378
 379 When a task is attached to a CPU controller, its uclamp values will be impacted
 380 as follows:
 381
 382 * cpu.uclamp.min is a protection as described in :ref:`section 3-3 of cgroup
 383   v2 documentation <cgroupv2-protections-distributor>`.
 384
 385   If a task uclamp_min value is lower than cpu.uclamp.min, then the task will
 386   inherit the cgroup cpu.uclamp.min value.
 387
 388   In a cgroup hierarchy, effective cpu.uclamp.min is the max of (child,
 389   parent).
 390
 391 * cpu.uclamp.max is a limit as described in :ref:`section 3-2 of cgroup v2
 392   documentation <cgroupv2-limits-distributor>`.
 393
 394   If a task uclamp_max value is higher than cpu.uclamp.max, then the task will
 395   inherit the cgroup cpu.uclamp.max value.
 396
 397   In a cgroup hierarchy, effective cpu.uclamp.max is the min of (child,
 398   parent).
 399
 400 For example, given following parameters:
 401
 402 ::
 403
 404         p0->uclamp[UCLAMP_MIN] = // system default;
 405         p0->uclamp[UCLAMP_MAX] = // system default;
 406
 407         p1->uclamp[UCLAMP_MIN] = 40% * 1024;
 408         p1->uclamp[UCLAMP_MAX] = 50% * 1024;
 409
 410         cgroup0->cpu.uclamp.min = 20% * 1024;
 411         cgroup0->cpu.uclamp.max = 60% * 1024;
 412
 413         cgroup1->cpu.uclamp.min = 60% * 1024;
 414         cgroup1->cpu.uclamp.max = 100% * 1024;
 415
 416 when p0 and p1 are attached to cgroup0, the values become:
 417
 418 ::
 419
 420         p0->uclamp[UCLAMP_MIN] = cgroup0->cpu.uclamp.min = 20% * 1024;
 421         p0->uclamp[UCLAMP_MAX] = cgroup0->cpu.uclamp.max = 60% * 1024;
 422
 423         p1->uclamp[UCLAMP_MIN] = 40% * 1024; // intact
 424         p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact
 425
 426 when p0 and p1 are attached to cgroup1, these instead become:
 427
 428 ::
 429
 430         p0->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024;
 431         p0->uclamp[UCLAMP_MAX] = cgroup1->cpu.uclamp.max = 100% * 1024;
 432
 433         p1->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024;
 434         p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact
 435
 436 Note that cgroup interfaces allows cpu.uclamp.max value to be lower than
 437 cpu.uclamp.min. Other interfaces don't allow that.
 438
 439 3.3. System interface
 440 ---------------------
 441
 442 3.3.1 sched_util_clamp_min
 443 --------------------------
 444
 445 System wide limit of allowed UCLAMP_MIN range. By default it is set to 1024,
 446 which means that permitted effective UCLAMP_MIN range for tasks is [0:1024].
 447 By changing it to 512 for example the range reduces to [0:512]. This is useful
 448 to restrict how much boosting tasks are allowed to acquire.
 449
 450 Requests from tasks to go above this knob value will still succeed, but
 451 they won't be satisfied until it is more than p->uclamp[UCLAMP_MIN].
 452
 453 The value must be smaller than or equal to sched_util_clamp_max.
 454
 455 3.3.2 sched_util_clamp_max
 456 --------------------------
 457
 458 System wide limit of allowed UCLAMP_MAX range. By default it is set to 1024,
 459 which means that permitted effective UCLAMP_MAX range for tasks is [0:1024].
 460
 461 By changing it to 512 for example the effective allowed range reduces to
 462 [0:512]. This means is that no task can run above 512, which implies that all
 463 rqs are restricted too. IOW, the whole system is capped to half its performance
 464 capacity.
 465
 466 This is useful to restrict the overall maximum performance point of the system.
 467 For example, it can be handy to limit performance when running low on battery
 468 or when the system wants to limit access to more energy hungry performance
 469 levels when it's in idle state or screen is off.
 470
 471 Requests from tasks to go above this knob value will still succeed, but they
 472 won't be satisfied until it is more than p->uclamp[UCLAMP_MAX].
 473
 474 The value must be greater than or equal to sched_util_clamp_min.
 475
 476 .. _uclamp-default-values:
 477
 478 3.4. Default values
 479 -------------------
 480
 481 By default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to:
 482
 483 ::
 484
 485         p_fair->uclamp[UCLAMP_MIN] = 0
 486         p_fair->uclamp[UCLAMP_MAX] = 1024
 487
 488 That is, by default they're boosted to run at the maximum performance point of
 489 changed at boot or runtime. No argument was made yet as to why we should
 490 provide this, but can be added in the future.
 491
 492 For SCHED_FIFO/SCHED_RR tasks:
 493
 494 ::
 495
 496         p_rt->uclamp[UCLAMP_MIN] = 1024
 497         p_rt->uclamp[UCLAMP_MAX] = 1024
 498
 499 That is by default they're boosted to run at the maximum performance point of
 500 the system which retains the historical behavior of the RT tasks.
 501
 502 RT tasks default uclamp_min value can be modified at boot or runtime via
 503 sysctl. See below section.
 504
 505 .. _sched-util-clamp-min-rt-default:
 506
 507 3.4.1 sched_util_clamp_min_rt_default
 508 -------------------------------------
 509
 510 Running RT tasks at maximum performance point is expensive on battery powered
 511 devices and not necessary. To allow system developer to offer good performance
 512 guarantees for these tasks without pushing it all the way to maximum
 513 performance point, this sysctl knob allows tuning the best boost value to
 514 address the system requirement without burning power running at maximum
 515 performance point all the time.
 516
 517 Application developer are encouraged to use the per task util clamp interface
 518 to ensure they are performance and power aware. Ideally this knob should be set
 519 to 0 by system designers and leave the task of managing performance
 520 requirements to the apps.
 521
 522 4. How to use util clamp
 523 ========================
 524
 525 Util clamp promotes the concept of user space assisted power and performance
 526 management. At the scheduler level there is no info required to make the best
 527 decision. However, with util clamp user space can hint to the scheduler to make
 528 better decision about task placement and frequency selection.
 529
 530 Best results are achieved by not making any assumptions about the system the
 531 application is running on and to use it in conjunction with a feedback loop to
 532 dynamically monitor and adjust. Ultimately this will allow for a better user
 533 experience at a better perf/watt.
 534
 535 For some systems and use cases, static setup will help to achieve good results.
 536 Portability will be a problem in this case. How much work one can do at 100,
 537 200 or 1024 is different for each system. Unless there's a specific target
 538 system, static setup should be avoided.
 539
 540 There are enough possibilities to create a whole framework based on util clamp
 541 or self contained app that makes use of it directly.
 542
 543 4.1. Boost important and DVFS-latency-sensitive tasks
 544 -----------------------------------------------------
 545
 546 A GUI task might not be busy to warrant driving the frequency high when it
 547 wakes up. However, it requires to finish its work within a specific time window
 548 to deliver the desired user experience. The right frequency it requires at
 549 wakeup will be system dependent. On some underpowered systems it will be high,
 550 on other overpowered ones it will be low or 0.
 551
 552 This task can increase its UCLAMP_MIN value every time it misses the deadline
 553 to ensure on next wake up it runs at a higher performance point. It should try
 554 to approach the lowest UCLAMP_MIN value that allows to meet its deadline on any
 555 particular system to achieve the best possible perf/watt for that system.
 556
 557 On heterogeneous systems, it might be important for this task to run on
 558 a faster CPU.
 559
 560 **Generally it is advised to perceive the input as performance level or point
 561 which will imply both task placement and frequency selection**.
 562
 563 4.2. Cap background tasks
 564 -------------------------
 565
 566 Like explained for Android case in the introduction. Any app can lower
 567 UCLAMP_MAX for some background tasks that don't care about performance but
 568 could end up being busy and consume unnecessary system resources on the system.
 569
 570 4.3. Powersave mode
 571 -------------------
 572
 573 sched_util_clamp_max system wide interface can be used to limit all tasks from
 574 operating at the higher performance points which are usually energy
 575 inefficient.
 576
 577 This is not unique to uclamp as one can achieve the same by reducing max
 578 frequency of the cpufreq governor. It can be considered a more convenient
 579 alternative interface.
 580
 581 4.4. Per-app performance restriction
 582 ------------------------------------
 583
 584 Middleware/Utility can provide the user an option to set UCLAMP_MIN/MAX for an
 585 app every time it is executed to guarantee a minimum performance point and/or
 586 limit it from draining system power at the cost of reduced performance for
 587 these apps.
 588
 589 If you want to prevent your laptop from heating up while on the go from
 590 compiling the kernel and happy to sacrifice performance to save power, but
 591 still would like to keep your browser performance intact, uclamp makes it
 592 possible.
 593
 594 5. Limitations
 595 ==============
 596
 597 .. _uclamp-capping-fail:
 598
 599 5.1. Capping frequency with uclamp_max fails under certain conditions
 600 ---------------------------------------------------------------------
 601
 602 If task p0 is capped to run at 512:
 603
 604 ::
 605
 606         p0->uclamp[UCLAMP_MAX] = 512
 607
 608 and it shares the rq with p1 which is free to run at any performance point:
 609
 610 ::
 611
 612         p1->uclamp[UCLAMP_MAX] = 1024
 613
 614 then due to max aggregation the rq will be allowed to reach max performance
 615 point:
 616
 617 ::
 618
 619         rq->uclamp[UCLAMP_MAX] = max(512, 1024) = 1024
 620
 621 Assuming both p0 and p1 have UCLAMP_MIN = 0, then the frequency selection for
 622 the rq will depend on the actual utilization value of the tasks.
 623
 624 If p1 is a small task but p0 is a CPU intensive task, then due to the fact that
 625 both are running at the same rq, p1 will cause the frequency capping to be left
 626 from the rq although p1, which is allowed to run at any performance point,
 627 doesn't actually need to run at that frequency.
 628
 629 5.2. UCLAMP_MAX can break PELT (util_avg) signal
 630 ------------------------------------------------
 631
 632 PELT assumes that frequency will always increase as the signals grow to ensure
 633 there's always some idle time on the CPU. But with UCLAMP_MAX, this frequency
 634 increase will be prevented which can lead to no idle time in some
 635 circumstances. When there's no idle time, a task will stuck in a busy loop,
 636 which would result in util_avg being 1024.
 637
 638 Combing with issue described below, this can lead to unwanted frequency spikes
 639 when severely capped tasks share the rq with a small non capped task.
 640
 641 As an example if task p, which have:
 642
 643 ::
 644
 645         p0->util_avg = 300
 646         p0->uclamp[UCLAMP_MAX] = 0
 647
 648 wakes up on an idle CPU, then it will run at min frequency (Fmin) this
 649 CPU is capable of. The max CPU frequency (Fmax) matters here as well,
 650 since it designates the shortest computational time to finish the task's
 651 work on this CPU.
 652
 653 ::
 654
 655         rq->uclamp[UCLAMP_MAX] = 0
 656
 657 If the ratio of Fmax/Fmin is 3, then maximum value will be:
 658
 659 ::
 660
 661         300 * (Fmax/Fmin) = 900
 662
 663 which indicates the CPU will still see idle time since 900 is < 1024. The
 664 _actual_ util_avg will not be 900 though, but somewhere between 300 and 900. As
 665 long as there's idle time, p->util_avg updates will be off by a some margin,
 666 but not proportional to Fmax/Fmin.
 667
 668 ::
 669
 670         p0->util_avg = 300 + small_error
 671
 672 Now if the ratio of Fmax/Fmin is 4, the maximum value becomes:
 673
 674 ::
 675
 676         300 * (Fmax/Fmin) = 1200
 677
 678 which is higher than 1024 and indicates that the CPU has no idle time. When
 679 this happens, then the _actual_ util_avg will become:
 680
 681 ::
 682
 683         p0->util_avg = 1024
 684
 685 If task p1 wakes up on this CPU, which have:
 686
 687 ::
 688
 689         p1->util_avg = 200
 690         p1->uclamp[UCLAMP_MAX] = 1024
 691
 692 then the effective UCLAMP_MAX for the CPU will be 1024 according to max
 693 aggregation rule. But since the capped p0 task was running and throttled
 694 severely, then the rq->util_avg will be:
 695
 696 ::
 697
 698         p0->util_avg = 1024
 699         p1->util_avg = 200
 700
 701         rq->util_avg = 1024
 702         rq->uclamp[UCLAMP_MAX] = 1024
 703
 704 Hence lead to a frequency spike since if p0 wasn't throttled we should get:
 705
 706 ::
 707
 708         p0->util_avg = 300
 709         p1->util_avg = 200
 710
 711         rq->util_avg = 500
 712
 713 and run somewhere near mid performance point of that CPU, not the Fmax we get.
 714
 715 5.3. Schedutil response time issues
 716 -----------------------------------
 717
 718 schedutil has three limitations:
 719
 720         1. Hardware takes non-zero time to respond to any frequency change
 721            request. On some platforms can be in the order of few ms.
 722         2. Non fast-switch systems require a worker deadline thread to wake up
 723            and perform the frequency change, which adds measurable overhead.
 724         3. schedutil rate_limit_us drops any requests during this rate_limit_us
 725            window.
 726
 727 If a relatively small task is doing critical job and requires a certain
 728 performance point when it wakes up and starts running, then all these
 729 limitations will prevent it from getting what it wants in the time scale it
 730 expects.
 731
 732 This limitation is not only impactful when using uclamp, but will be more
 733 prevalent as we no longer gradually ramp up or down. We could easily be
 734 jumping between frequencies depending on the order tasks wake up, and their
 735 respective uclamp values.
 736
 737 We regard that as a limitation of the capabilities of the underlying system
 738 itself.
 739
 740 There is room to improve the behavior of schedutil rate_limit_us, but not much
 741 to be done for 1 or 2. They are considered hard limitations of the system.