Documentation/filesystems/idmappings.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 Idmappings
   4 ==========
   5
   6 Most filesystem developers will have encountered idmappings. They are used when
   7 reading from or writing ownership to disk, reporting ownership to userspace, or
   8 for permission checking. This document is aimed at filesystem developers that
   9 want to know how idmappings work.
  10
  11 Formal notes
  12 ------------
  13
  14 An idmapping is essentially a translation of a range of ids into another or the
  15 same range of ids. The notational convention for idmappings that is widely used
  16 in userspace is::
  17
  18  u:k:r
  19
  20 ``u`` indicates the first element in the upper idmapset ``U`` and ``k``
  21 indicates the first element in the lower idmapset ``K``. The ``r`` parameter
  22 indicates the range of the idmapping, i.e. how many ids are mapped. From now
  23 on, we will always prefix ids with ``u`` or ``k`` to make it clear whether
  24 we're talking about an id in the upper or lower idmapset.
  25
  26 To see what this looks like in practice, let's take the following idmapping::
  27
  28  u22:k10000:r3
  29
  30 and write down the mappings it will generate::
  31
  32  u22 -> k10000
  33  u23 -> k10001
  34  u24 -> k10002
  35
  36 From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an
  37 idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are
  38 order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of
  39 the set of all possible ids useable on a given system.
  40
  41 Looking at this mathematically briefly will help us highlight some properties
  42 that make it easier to understand how we can translate between idmappings. For
  43 example, we know that the inverse idmapping is an order isomorphism as well::
  44
  45  k10000 -> u22
  46  k10001 -> u23
  47  k10002 -> u24
  48
  49 Given that we are dealing with order isomorphisms plus the fact that we're
  50 dealing with subsets we can embedd idmappings into each other, i.e. we can
  51 sensibly translate between different idmappings. For example, assume we've been
  52 given the three idmappings::
  53
  54  1. u0:k10000:r10000
  55  2. u0:k20000:r10000
  56  3. u0:k30000:r10000
  57
  58 and id ``k11000`` which has been generated by the first idmapping by mapping
  59 ``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset.
  60
  61 Because we're dealing with order isomorphic subsets it is meaningful to ask
  62 what id ``k11000`` corresponds to in the second or third idmapping. The
  63 straightfoward algorithm to use is to apply the inverse of the first idmapping,
  64 mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using
  65 either the second idmapping mapping or third idmapping mapping. The second
  66 idmapping would map ``u1000`` down to ``21000``. The third idmapping would map
  67 ``u1000`` down to ``u31000``.
  68
  69 If we were given the same task for the following three idmappings::
  70
  71  1. u0:k10000:r10000
  72  2. u0:k20000:r200
  73  3. u0:k30000:r300
  74
  75 we would fail to translate as the sets aren't order isomorphic over the full
  76 range of the first idmapping anymore (However they are order isomorphic over
  77 the full range of the second idmapping.). Neither the second or third idmapping
  78 contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having
  79 an id mapped. We can simply say that ``u1000`` is unmapped in the second and
  80 third idmapping. The kernel will report unmapped ids as the overflowuid
  81 ``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace.
  82
  83 The algorithm to calculate what a given id maps to is pretty simple. First, we
  84 need to verify that the range can contain our target id. We will skip this step
  85 for simplicity. After that if we want to know what ``id`` maps to we can do
  86 simple calculations:
  87
  88 - If we want to map from left to right::
  89
  90    u:k:r
  91    id - u + k = n
  92
  93 - If we want to map from right to left::
  94
  95    u:k:r
  96    id - k + u = n
  97
  98 Instead of "left to right" we can also say "down" and instead of "right to
  99 left" we can also say "up". Obviously mapping down and up invert each other.
 100
 101 To see whether the simple formulas above work, consider the following two
 102 idmappings::
 103
 104  1. u0:k20000:r10000
 105  2. u500:k30000:r10000
 106
 107 Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We
 108 want to know what id this was mapped from in the upper idmapset of the first
 109 idmapping. So we're mapping up in the first idmapping::
 110
 111  id     - k      + u  = n
 112  k21000 - k20000 + u0 = u1000
 113
 114 Now assume we are given the id ``u1100`` in the upper idmapset of the second
 115 idmapping and we want to know what this id maps down to in the lower idmapset
 116 of the second idmapping. This means we're mapping down in the second
 117 idmapping::
 118
 119  id    - u    + k      = n
 120  u1100 - u500 + k30000 = k30600
 121
 122 General notes
 123 -------------
 124
 125 In the context of the kernel an idmapping can be interpreted as mapping a range
 126 of userspace ids into a range of kernel ids::
 127
 128  userspace-id:kernel-id:range
 129
 130 A userspace id is always an element in the upper idmapset of an idmapping of
 131 type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower
 132 idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on
 133 "userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t``
 134 types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``.
 135
 136 The kernel is mostly concerned with kernel ids. They are used when performing
 137 permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field.
 138 A userspace id on the other hand is an id that is reported to userspace by the
 139 kernel, or is passed by userspace to the kernel, or a raw device id that is
 140 written or read from disk.
 141
 142 Note that we are only concerned with idmappings as the kernel stores them not
 143 how userspace would specify them.
 144
 145 For the rest of this document we will prefix all userspace ids with ``u`` and
 146 all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So
 147 an idmapping will be written as ``u0:k10000:r10000``.
 148
 149 For example, the id ``u1000`` is an id in the upper idmapset or "userspace
 150 idmapset" starting with ``u1000``. And it is mapped to ``k11000`` which is a
 151 kernel id in the lower idmapset or "kernel idmapset" starting with ``k10000``.
 152
 153 A kernel id is always created by an idmapping. Such idmappings are associated
 154 with user namespaces. Since we mainly care about how idmappings work we're not
 155 going to be concerned with how idmappings are created nor how they are used
 156 outside of the filesystem context. This is best left to an explanation of user
 157 namespaces.
 158
 159 The initial user namespace is special. It always has an idmapping of the
 160 following form::
 161
 162  u0:k0:r4294967295
 163
 164 which is an identity idmapping over the full range of ids available on this
 165 system.
 166
 167 Other user namespaces usually have non-identity idmappings such as::
 168
 169  u0:k10000:r10000
 170
 171 When a process creates or wants to change ownership of a file, or when the
 172 ownership of a file is read from disk by a filesystem, the userspace id is
 173 immediately translated into a kernel id according to the idmapping associated
 174 with the relevant user namespace.
 175
 176 For instance, consider a file that is stored on disk by a filesystem as being
 177 owned by ``u1000``:
 178
 179 - If a filesystem were to be mounted in the initial user namespaces (as most
 180   filesystems are) then the initial idmapping will be used. As we saw this is
 181   simply the identity idmapping. This would mean id ``u1000`` read from disk
 182   would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field
 183   would contain ``k1000``.
 184
 185 - If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000``
 186   then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's
 187   ``i_uid`` and ``i_gid`` would contain ``k11000``.
 188
 189 Translation algorithms
 190 ----------------------
 191
 192 We've already seen briefly that it is possible to translate between different
 193 idmappings. We'll now take a closer look how that works.
 194
 195 Crossmapping
 196 ~~~~~~~~~~~~
 197
 198 This translation algorithm is used by the kernel in quite a few places. For
 199 example, it is used when reporting back the ownership of a file to userspace
 200 via the ``stat()`` system call family.
 201
 202 If we've been given ``k11000`` from one idmapping we can map that id up in
 203 another idmapping. In order for this to work both idmappings need to contain
 204 the same kernel id in their kernel idmapsets. For example, consider the
 205 following idmappings::
 206
 207  1. u0:k10000:r10000
 208  2. u20000:k10000:r10000
 209
 210 and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can
 211 then translate ``k11000`` into a userspace id in the second idmapping using the
 212 kernel idmapset of the second idmapping::
 213
 214  /* Map the kernel id up into a userspace id in the second idmapping. */
 215  from_kuid(u20000:k10000:r10000, k11000) = u21000
 216
 217 Note, how we can get back to the kernel id in the first idmapping by inverting
 218 the algorithm::
 219
 220  /* Map the userspace id down into a kernel id in the second idmapping. */
 221  make_kuid(u20000:k10000:r10000, u21000) = k11000
 222
 223  /* Map the kernel id up into a userspace id in the first idmapping. */
 224  from_kuid(u0:k10000:r10000, k11000) = u1000
 225
 226 This algorithm allows us to answer the question what userspace id a given
 227 kernel id corresponds to in a given idmapping. In order to be able to answer
 228 this question both idmappings need to contain the same kernel id in their
 229 respective kernel idmapsets.
 230
 231 For example, when the kernel reads a raw userspace id from disk it maps it down
 232 into a kernel id according to the idmapping associated with the filesystem.
 233 Let's assume the filesystem was mounted with an idmapping of
 234 ``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This
 235 means ``u1000`` will be mapped to ``k21000`` which is what will be stored in
 236 the inode's ``i_uid`` and ``i_gid`` field.
 237
 238 When someone in userspace calls ``stat()`` or a related function to get
 239 ownership information about the file the kernel can't simply map the id back up
 240 according to the filesystem's idmapping as this would give the wrong owner if
 241 the caller is using an idmapping.
 242
 243 So the kernel will map the id back up in the idmapping of the caller. Let's
 244 assume the caller has the slighly unconventional idmapping
 245 ``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``.
 246 Consequently the user would see that this file is owned by ``u4000``.
 247
 248 Remapping
 249 ~~~~~~~~~
 250
 251 It is possible to translate a kernel id from one idmapping to another one via
 252 the userspace idmapset of the two idmappings. This is equivalent to remapping
 253 a kernel id.
 254
 255 Let's look at an example. We are given the following two idmappings::
 256
 257  1. u0:k10000:r10000
 258  2. u0:k20000:r10000
 259
 260 and we are given ``k11000`` in the first idmapping. In order to translate this
 261 kernel id in the first idmapping into a kernel id in the second idmapping we
 262 need to perform two steps:
 263
 264 1. Map the kernel id up into a userspace id in the first idmapping::
 265
 266     /* Map the kernel id up into a userspace id in the first idmapping. */
 267     from_kuid(u0:k10000:r10000, k11000) = u1000
 268
 269 2. Map the userspace id down into a kernel id in the second idmapping::
 270
 271     /* Map the userspace id down into a kernel id in the second idmapping. */
 272     make_kuid(u0:k20000:r10000, u1000) = k21000
 273
 274 As you can see we used the userspace idmapset in both idmappings to translate
 275 the kernel id in one idmapping to a kernel id in another idmapping.
 276
 277 This allows us to answer the question what kernel id we would need to use to
 278 get the same userspace id in another idmapping. In order to be able to answer
 279 this question both idmappings need to contain the same userspace id in their
 280 respective userspace idmapsets.
 281
 282 Note, how we can easily get back to the kernel id in the first idmapping by
 283 inverting the algorithm:
 284
 285 1. Map the kernel id up into a userspace id in the second idmapping::
 286
 287     /* Map the kernel id up into a userspace id in the second idmapping. */
 288     from_kuid(u0:k20000:r10000, k21000) = u1000
 289
 290 2. Map the userspace id down into a kernel id in the first idmapping::
 291
 292     /* Map the userspace id down into a kernel id in the first idmapping. */
 293     make_kuid(u0:k10000:r10000, u1000) = k11000
 294
 295 Another way to look at this translation is to treat it as inverting one
 296 idmapping and applying another idmapping if both idmappings have the relevant
 297 userspace id mapped. This will come in handy when working with idmapped mounts.
 298
 299 Invalid translations
 300 ~~~~~~~~~~~~~~~~~~~~
 301
 302 It is never valid to use an id in the kernel idmapset of one idmapping as the
 303 id in the userspace idmapset of another or the same idmapping. While the kernel
 304 idmapset always indicates an idmapset in the kernel id space the userspace
 305 idmapset indicates a userspace id. So the following translations are forbidden::
 306
 307  /* Map the userspace id down into a kernel id in the first idmapping. */
 308  make_kuid(u0:k10000:r10000, u1000) = k11000
 309
 310  /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */
 311  make_kuid(u10000:k20000:r10000, k110000) = k21000
 312                                  ~~~~~~~
 313
 314 and equally wrong::
 315
 316  /* Map the kernel id up into a userspace id in the first idmapping. */
 317  from_kuid(u0:k10000:r10000, k11000) = u1000
 318
 319  /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */
 320  from_kuid(u20000:k0:r10000, u1000) = k21000
 321                              ~~~~~
 322
 323 Idmappings when creating filesystem objects
 324 -------------------------------------------
 325
 326 The concepts of mapping an id down or mapping an id up are expressed in the two
 327 kernel functions filesystem developers are rather familiar with and which we've
 328 already used in this document::
 329
 330  /* Map the userspace id down into a kernel id. */
 331  make_kuid(idmapping, uid)
 332
 333  /* Map the kernel id up into a userspace id. */
 334  from_kuid(idmapping, kuid)
 335
 336 We will take an abbreviated look into how idmappings figure into creating
 337 filesystem objects. For simplicity we will only look at what happens when the
 338 VFS has already completed path lookup right before it calls into the filesystem
 339 itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is
 340 called. We will also assume that the directory we're creating filesystem
 341 objects in is readable and writable for everyone.
 342
 343 When creating a filesystem object the caller will look at the caller's
 344 filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids
 345 but they are exclusively used when determining file ownership which is why they
 346 are called "filesystem ids". They are usually identical to the uid and gid of
 347 the caller but can differ. We will just assume they are always identical to not
 348 get lost in too many details.
 349
 350 When the caller enters the kernel two things happen:
 351
 352 1. Map the caller's userspace ids down into kernel ids in the caller's
 353    idmapping.
 354    (To be precise, the kernel will simply look at the kernel ids stashed in the
 355    credentials of the current task but for our education we'll pretend this
 356    translation happens just in time.)
 357 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
 358    filesystem's idmapping.
 359
 360 The second step is important as regular filesystem will ultimately need to map
 361 the kernel id back up into a userspace id when writing to disk.
 362 So with the second step the kernel guarantees that a valid userspace id can be
 363 written to disk. If it can't the kernel will refuse the creation request to not
 364 even remotely risk filesystem corruption.
 365
 366 The astute reader will have realized that this is simply a varation of the
 367 crossmapping algorithm we mentioned above in a previous section. First, the
 368 kernel maps the caller's userspace id down into a kernel id according to the
 369 caller's idmapping and then maps that kernel id up according to the
 370 filesystem's idmapping.
 371
 372 Let's see some examples with caller/filesystem idmapping but without mount
 373 idmappings. This will exhibit some problems we can hit. After that we will
 374 revisit/reconsider these examples, this time using mount idmappings, to see how
 375 they can solve the problems we observed before.
 376
 377 Example 1
 378 ~~~~~~~~~
 379
 380 ::
 381
 382  caller id:            u1000
 383  caller idmapping:     u0:k0:r4294967295
 384  filesystem idmapping: u0:k0:r4294967295
 385
 386 Both the caller and the filesystem use the identity idmapping:
 387
 388 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
 389
 390     make_kuid(u0:k0:r4294967295, u1000) = k1000
 391
 392 2. Verify that the caller's kernel ids can be mapped to userspace ids in the
 393    filesystem's idmapping.
 394
 395    For this second step the kernel will call the function
 396    ``fsuidgid_has_mapping()`` which ultimately boils down to calling
 397    ``from_kuid()``::
 398
 399     from_kuid(u0:k0:r4294967295, k1000) = u1000
 400
 401 In this example both idmappings are the same so there's nothing exciting going
 402 on. Ultimately the userspace id that lands on disk will be ``u1000``.
 403
 404 Example 2
 405 ~~~~~~~~~
 406
 407 ::
 408
 409  caller id:            u1000
 410  caller idmapping:     u0:k10000:r10000
 411  filesystem idmapping: u0:k20000:r10000
 412
 413 1. Map the caller's userspace ids down into kernel ids in the caller's
 414    idmapping::
 415
 416     make_kuid(u0:k10000:r10000, u1000) = k11000
 417
 418 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
 419    filesystem's idmapping::
 420
 421     from_kuid(u0:k20000:r10000, k11000) = u-1
 422
 423 It's immediately clear that while the caller's userspace id could be
 424 successfully mapped down into kernel ids in the caller's idmapping the kernel
 425 ids could not be mapped up according to the filesystem's idmapping. So the
 426 kernel will deny this creation request.
 427
 428 Note that while this example is less common, because most filesystem can't be
 429 mounted with non-initial idmappings this is a general problem as we can see in
 430 the next examples.
 431
 432 Example 3
 433 ~~~~~~~~~
 434
 435 ::
 436
 437  caller id:            u1000
 438  caller idmapping:     u0:k10000:r10000
 439  filesystem idmapping: u0:k0:r4294967295
 440
 441 1. Map the caller's userspace ids down into kernel ids in the caller's
 442    idmapping::
 443
 444     make_kuid(u0:k10000:r10000, u1000) = k11000
 445
 446 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
 447    filesystem's idmapping::
 448
 449     from_kuid(u0:k0:r4294967295, k11000) = u11000
 450
 451 We can see that the translation always succeeds. The userspace id that the
 452 filesystem will ultimately put to disk will always be identical to the value of
 453 the kernel id that was created in the caller's idmapping. This has mainly two
 454 consequences.
 455
 456 First, that we can't allow a caller to ultimately write to disk with another
 457 userspace id. We could only do this if we were to mount the whole fileystem
 458 with the caller's or another idmapping. But that solution is limited to a few
 459 filesystems and not very flexible. But this is a use-case that is pretty
 460 important in containerized workloads.
 461
 462 Second, the caller will usually not be able to create any files or access
 463 directories that have stricter permissions because none of the filesystem's
 464 kernel ids map up into valid userspace ids in the caller's idmapping
 465
 466 1. Map raw userspace ids down to kernel ids in the filesystem's idmapping::
 467
 468     make_kuid(u0:k0:r4294967295, u1000) = k1000
 469
 470 2. Map kernel ids up to userspace ids in the caller's idmapping::
 471
 472     from_kuid(u0:k10000:r10000, k1000) = u-1
 473
 474 Example 4
 475 ~~~~~~~~~
 476
 477 ::
 478
 479  file id:              u1000
 480  caller idmapping:     u0:k10000:r10000
 481  filesystem idmapping: u0:k0:r4294967295
 482
 483 In order to report ownership to userspace the kernel uses the crossmapping
 484 algorithm introduced in a previous section:
 485
 486 1. Map the userspace id on disk down into a kernel id in the filesystem's
 487    idmapping::
 488
 489     make_kuid(u0:k0:r4294967295, u1000) = k1000
 490
 491 2. Map the kernel id up into a userspace id in the caller's idmapping::
 492
 493     from_kuid(u0:k10000:r10000, k1000) = u-1
 494
 495 The crossmapping algorithm fails in this case because the kernel id in the
 496 filesystem idmapping cannot be mapped up to a userspace id in the caller's
 497 idmapping. Thus, the kernel will report the ownership of this file as the
 498 overflowid.
 499
 500 Example 5
 501 ~~~~~~~~~
 502
 503 ::
 504
 505  file id:              u1000
 506  caller idmapping:     u0:k10000:r10000
 507  filesystem idmapping: u0:k20000:r10000
 508
 509 In order to report ownership to userspace the kernel uses the crossmapping
 510 algorithm introduced in a previous section:
 511
 512 1. Map the userspace id on disk down into a kernel id in the filesystem's
 513    idmapping::
 514
 515     make_kuid(u0:k20000:r10000, u1000) = k21000
 516
 517 2. Map the kernel id up into a userspace id in the caller's idmapping::
 518
 519     from_kuid(u0:k10000:r10000, k21000) = u-1
 520
 521 Again, the crossmapping algorithm fails in this case because the kernel id in
 522 the filesystem idmapping cannot be mapped to a userspace id in the caller's
 523 idmapping. Thus, the kernel will report the ownership of this file as the
 524 overflowid.
 525
 526 Note how in the last two examples things would be simple if the caller would be
 527 using the initial idmapping. For a filesystem mounted with the initial
 528 idmapping it would be trivial. So we only consider a filesystem with an
 529 idmapping of ``u0:k20000:r10000``:
 530
 531 1. Map the userspace id on disk down into a kernel id in the filesystem's
 532    idmapping::
 533
 534     make_kuid(u0:k20000:r10000, u1000) = k21000
 535
 536 2. Map the kernel id up into a userspace id in the caller's idmapping::
 537
 538     from_kuid(u0:k0:r4294967295, k21000) = u21000
 539
 540 Idmappings on idmapped mounts
 541 -----------------------------
 542
 543 The examples we've seen in the previous section where the caller's idmapping
 544 and the filesystem's idmapping are incompatible causes various issues for
 545 workloads. For a more complex but common example, consider two containers
 546 started on the host. To completely prevent the two containers from affecting
 547 each other, an administrator may often use different non-overlapping idmappings
 548 for the two containers::
 549
 550  container1 idmapping:  u0:k10000:r10000
 551  container2 idmapping:  u0:k20000:r10000
 552  filesystem idmapping:  u0:k30000:r10000
 553
 554 An administrator wanting to provide easy read-write access to the following set
 555 of files::
 556
 557  dir id:       u0
 558  dir/file1 id: u1000
 559  dir/file2 id: u2000
 560
 561 to both containers currently can't.
 562
 563 Of course the administrator has the option to recursively change ownership via
 564 ``chown()``. For example, they could change ownership so that ``dir`` and all
 565 files below it can be crossmapped from the filesystem's into the container's
 566 idmapping. Let's assume they change ownership so it is compatible with the
 567 first container's idmapping::
 568
 569  dir id:       u10000
 570  dir/file1 id: u11000
 571  dir/file2 id: u12000
 572
 573 This would still leave ``dir`` rather useless to the second container. In fact,
 574 ``dir`` and all files below it would continue to appear owned by the overflowid
 575 for the second container.
 576
 577 Or consider another increasingly popular example. Some service managers such as
 578 systemd implement a concept called "portable home directories". A user may want
 579 to use their home directories on different machines where they are assigned
 580 different login userspace ids. Most users will have ``u1000`` as the login id
 581 on their machine at home and all files in their home directory will usually be
 582 owned by ``u1000``. At uni or at work they may have another login id such as
 583 ``u1125``. This makes it rather difficult to interact with their home directory
 584 on their work machine.
 585
 586 In both cases changing ownership recursively has grave implications. The most
 587 obvious one is that ownership is changed globally and permanently. In the home
 588 directory case this change in ownership would even need to happen everytime the
 589 user switches from their home to their work machine. For really large sets of
 590 files this becomes increasingly costly.
 591
 592 If the user is lucky, they are dealing with a filesystem that is mountable
 593 inside user namespaces. But this would also change ownership globally and the
 594 change in ownership is tied to the lifetime of the filesystem mount, i.e. the
 595 superblock. The only way to change ownership is to completely unmount the
 596 filesystem and mount it again in another user namespace. This is usually
 597 impossible because it would mean that all users currently accessing the
 598 filesystem can't anymore. And it means that ``dir`` still can't be shared
 599 between two containers with different idmappings.
 600 But usually the user doesn't even have this option since most filesystems
 601 aren't mountable inside containers. And not having them mountable might be
 602 desirable as it doesn't require the filesystem to deal with malicious
 603 filesystem images.
 604
 605 But the usecases mentioned above and more can be handled by idmapped mounts.
 606 They allow to expose the same set of dentries with different ownership at
 607 different mounts. This is achieved by marking the mounts with a user namespace
 608 through the ``mount_setattr()`` system call. The idmapping associated with it
 609 is then used to translate from the caller's idmapping to the filesystem's
 610 idmapping and vica versa using the remapping algorithm we introduced above.
 611
 612 Idmapped mounts make it possible to change ownership in a temporary and
 613 localized way. The ownership changes are restricted to a specific mount and the
 614 ownership changes are tied to the lifetime of the mount. All other users and
 615 locations where the filesystem is exposed are unaffected.
 616
 617 Filesystems that support idmapped mounts don't have any real reason to support
 618 being mountable inside user namespaces. A filesystem could be exposed
 619 completely under an idmapped mount to get the same effect. This has the
 620 advantage that filesystems can leave the creation of the superblock to
 621 privileged users in the initial user namespace.
 622
 623 However, it is perfectly possible to combine idmapped mounts with filesystems
 624 mountable inside user namespaces. We will touch on this further below.
 625
 626 Remapping helpers
 627 ~~~~~~~~~~~~~~~~~
 628
 629 Idmapping functions were added that translate between idmappings. They make use
 630 of the remapping algorithm we've introduced earlier. We're going to look at
 631 two:
 632
 633 - ``i_uid_into_mnt()`` and ``i_gid_into_mnt()``
 634
 635   The ``i_*id_into_mnt()`` functions translate filesystem's kernel ids into
 636   kernel ids in the mount's idmapping::
 637
 638    /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */
 639    from_kuid(filesystem, kid) = uid
 640
 641    /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */
 642    make_kuid(mount, uid) = kuid
 643
 644 - ``mapped_fsuid()`` and ``mapped_fsgid()``
 645
 646   The ``mapped_fs*id()`` functions translate the caller's kernel ids into
 647   kernel ids in the filesystem's idmapping. This translation is achieved by
 648   remapping the caller's kernel ids using the mount's idmapping::
 649
 650    /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */
 651    from_kuid(mount, kid) = uid
 652
 653    /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
 654    make_kuid(filesystem, uid) = kuid
 655
 656 Note that these two functions invert each other. Consider the following
 657 idmappings::
 658
 659  caller idmapping:     u0:k10000:r10000
 660  filesystem idmapping: u0:k20000:r10000
 661  mount idmapping:      u0:k10000:r10000
 662
 663 Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id
 664 to ``k21000`` according to it's idmapping. This is what is stored in the
 665 inode's ``i_uid`` and ``i_gid`` fields.
 666
 667 When the caller queries the ownership of this file via ``stat()`` the kernel
 668 would usually simply use the crossmapping algorithm and map the filesystem's
 669 kernel id up to a userspace id in the caller's idmapping.
 670
 671 But when the caller is accessing the file on an idmapped mount the kernel will
 672 first call ``i_uid_into_mnt()`` thereby translating the filesystem's kernel id
 673 into a kernel id in the mount's idmapping::
 674
 675  i_uid_into_mnt(k21000):
 676    /* Map the filesystem's kernel id up into a userspace id. */
 677    from_kuid(u0:k20000:r10000, k21000) = u1000
 678
 679    /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */
 680    make_kuid(u0:k10000:r10000, u1000) = k11000
 681
 682 Finally, when the kernel reports the owner to the caller it will turn the
 683 kernel id in the mount's idmapping into a userspace id in the caller's
 684 idmapping::
 685
 686   from_kuid(u0:k10000:r10000, k11000) = u1000
 687
 688 We can test whether this algorithm really works by verifying what happens when
 689 we create a new file. Let's say the user is creating a file with ``u1000``.
 690
 691 The kernel maps this to ``k11000`` in the caller's idmapping. Usually the
 692 kernel would now apply the crossmapping, verifying that ``k11000`` can be
 693 mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't
 694 be mapped up in the filesystem's idmapping directly this creation request
 695 fails.
 696
 697 But when the caller is accessing the file on an idmapped mount the kernel will
 698 first call ``mapped_fs*id()`` thereby translating the caller's kernel id into
 699 a kernel id according to the mount's idmapping::
 700
 701  mapped_fsuid(k11000):
 702     /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */
 703     from_kuid(u0:k10000:r10000, k11000) = u1000
 704
 705     /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
 706     make_kuid(u0:k20000:r10000, u1000) = k21000
 707
 708 When finally writing to disk the kernel will then map ``k21000`` up into a
 709 userspace id in the filesystem's idmapping::
 710
 711    from_kuid(u0:k20000:r10000, k21000) = u1000
 712
 713 As we can see, we end up with an invertible and therefore information
 714 preserving algorithm. A file created from ``u1000`` on an idmapped mount will
 715 also be reported as being owned by ``u1000`` and vica versa.
 716
 717 Let's now briefly reconsider the failing examples from earlier in the context
 718 of idmapped mounts.
 719
 720 Example 2 reconsidered
 721 ~~~~~~~~~~~~~~~~~~~~~~
 722
 723 ::
 724
 725  caller id:            u1000
 726  caller idmapping:     u0:k10000:r10000
 727  filesystem idmapping: u0:k20000:r10000
 728  mount idmapping:      u0:k10000:r10000
 729
 730 When the caller is using a non-initial idmapping the common case is to attach
 731 the same idmapping to the mount. We now perform three steps:
 732
 733 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
 734
 735     make_kuid(u0:k10000:r10000, u1000) = k11000
 736
 737 2. Translate the caller's kernel id into a kernel id in the filesystem's
 738    idmapping::
 739
 740     mapped_fsuid(k11000):
 741       /* Map the kernel id up into a userspace id in the mount's idmapping. */
 742       from_kuid(u0:k10000:r10000, k11000) = u1000
 743
 744       /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
 745       make_kuid(u0:k20000:r10000, u1000) = k21000
 746
 747 2. Verify that the caller's kernel ids can be mapped to userspace ids in the
 748    filesystem's idmapping::
 749
 750     from_kuid(u0:k20000:r10000, k21000) = u1000
 751
 752 So the ownership that lands on disk will be ``u1000``.
 753
 754 Example 3 reconsidered
 755 ~~~~~~~~~~~~~~~~~~~~~~
 756
 757 ::
 758
 759  caller id:            u1000
 760  caller idmapping:     u0:k10000:r10000
 761  filesystem idmapping: u0:k0:r4294967295
 762  mount idmapping:      u0:k10000:r10000
 763
 764 The same translation algorithm works with the third example.
 765
 766 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
 767
 768     make_kuid(u0:k10000:r10000, u1000) = k11000
 769
 770 2. Translate the caller's kernel id into a kernel id in the filesystem's
 771    idmapping::
 772
 773     mapped_fsuid(k11000):
 774        /* Map the kernel id up into a userspace id in the mount's idmapping. */
 775        from_kuid(u0:k10000:r10000, k11000) = u1000
 776
 777        /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
 778        make_kuid(u0:k0:r4294967295, u1000) = k1000
 779
 780 2. Verify that the caller's kernel ids can be mapped to userspace ids in the
 781    filesystem's idmapping::
 782
 783     from_kuid(u0:k0:r4294967295, k21000) = u1000
 784
 785 So the ownership that lands on disk will be ``u1000``.
 786
 787 Example 4 reconsidered
 788 ~~~~~~~~~~~~~~~~~~~~~~
 789
 790 ::
 791
 792  file id:              u1000
 793  caller idmapping:     u0:k10000:r10000
 794  filesystem idmapping: u0:k0:r4294967295
 795  mount idmapping:      u0:k10000:r10000
 796
 797 In order to report ownership to userspace the kernel now does three steps using
 798 the translation algorithm we introduced earlier:
 799
 800 1. Map the userspace id on disk down into a kernel id in the filesystem's
 801    idmapping::
 802
 803     make_kuid(u0:k0:r4294967295, u1000) = k1000
 804
 805 2. Translate the kernel id into a kernel id in the mount's idmapping::
 806
 807     i_uid_into_mnt(k1000):
 808       /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
 809       from_kuid(u0:k0:r4294967295, k1000) = u1000
 810
 811       /* Map the userspace id down into a kernel id in the mounts's idmapping. */
 812       make_kuid(u0:k10000:r10000, u1000) = k11000
 813
 814 3. Map the kernel id up into a userspace id in the caller's idmapping::
 815
 816     from_kuid(u0:k10000:r10000, k11000) = u1000
 817
 818 Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's
 819 idmapping. With the idmapped mount in place it now can be crossmapped into the
 820 filesystem's idmapping via the mount's idmapping. The file will now be created
 821 with ``u1000`` according to the mount's idmapping.
 822
 823 Example 5 reconsidered
 824 ~~~~~~~~~~~~~~~~~~~~~~
 825
 826 ::
 827
 828  file id:              u1000
 829  caller idmapping:     u0:k10000:r10000
 830  filesystem idmapping: u0:k20000:r10000
 831  mount idmapping:      u0:k10000:r10000
 832
 833 Again, in order to report ownership to userspace the kernel now does three
 834 steps using the translation algorithm we introduced earlier:
 835
 836 1. Map the userspace id on disk down into a kernel id in the filesystem's
 837    idmapping::
 838
 839     make_kuid(u0:k20000:r10000, u1000) = k21000
 840
 841 2. Translate the kernel id into a kernel id in the mount's idmapping::
 842
 843     i_uid_into_mnt(k21000):
 844       /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
 845       from_kuid(u0:k20000:r10000, k21000) = u1000
 846
 847       /* Map the userspace id down into a kernel id in the mounts's idmapping. */
 848       make_kuid(u0:k10000:r10000, u1000) = k11000
 849
 850 3. Map the kernel id up into a userspace id in the caller's idmapping::
 851
 852     from_kuid(u0:k10000:r10000, k11000) = u1000
 853
 854 Earlier, the file's kernel id couldn't be crossmapped in the filesystems's
 855 idmapping. With the idmapped mount in place it now can be crossmapped into the
 856 filesystem's idmapping via the mount's idmapping. The file is now owned by
 857 ``u1000`` according to the mount's idmapping.
 858
 859 Changing ownership on a home directory
 860 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 861
 862 We've seen above how idmapped mounts can be used to translate between
 863 idmappings when either the caller, the filesystem or both uses a non-initial
 864 idmapping. A wide range of usecases exist when the caller is using
 865 a non-initial idmapping. This mostly happens in the context of containerized
 866 workloads. The consequence is as we have seen that for both, filesystem's
 867 mounted with the initial idmapping and filesystems mounted with non-initial
 868 idmappings, access to the filesystem isn't working because the kernel ids can't
 869 be crossmapped between the caller's and the filesystem's idmapping.
 870
 871 As we've seen above idmapped mounts provide a solution to this by remapping the
 872 caller's or filesystem's idmapping according to the mount's idmapping.
 873
 874 Aside from containerized workloads, idmapped mounts have the advantage that
 875 they also work when both the caller and the filesystem use the initial
 876 idmapping which means users on the host can change the ownership of directories
 877 and files on a per-mount basis.
 878
 879 Consider our previous example where a user has their home directory on portable
 880 storage. At home they have id ``u1000`` and all files in their home directory
 881 are owned by ``u1000`` whereas at uni or work they have login id ``u1125``.
 882
 883 Taking their home directory with them becomes problematic. They can't easily
 884 access their files, they might not be able to write to disk without applying
 885 lax permissions or ACLs and even if they can, they will end up with an annoying
 886 mix of files and directories owned by ``u1000`` and ``u1125``.
 887
 888 Idmapped mounts allow to solve this problem. A user can create an idmapped
 889 mount for their home directory on their work computer or their computer at home
 890 depending on what ownership they would prefer to end up on the portable storage
 891 itself.
 892
 893 Let's assume they want all files on disk to belong to ``u1000``. When the user
 894 plugs in their portable storage at their work station they can setup a job that
 895 creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now
 896 when they create a file the kernel performs the following steps we already know
 897 from above:::
 898
 899  caller id:            u1125
 900  caller idmapping:     u0:k0:r4294967295
 901  filesystem idmapping: u0:k0:r4294967295
 902  mount idmapping:      u1000:k1125:r1
 903
 904 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
 905
 906     make_kuid(u0:k0:r4294967295, u1125) = k1125
 907
 908 2. Translate the caller's kernel id into a kernel id in the filesystem's
 909    idmapping::
 910
 911     mapped_fsuid(k1125):
 912       /* Map the kernel id up into a userspace id in the mount's idmapping. */
 913       from_kuid(u1000:k1125:r1, k1125) = u1000
 914
 915       /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
 916       make_kuid(u0:k0:r4294967295, u1000) = k1000
 917
 918 2. Verify that the caller's kernel ids can be mapped to userspace ids in the
 919    filesystem's idmapping::
 920
 921     from_kuid(u0:k0:r4294967295, k1000) = u1000
 922
 923 So ultimately the file will be created with ``u1000`` on disk.
 924
 925 Now let's briefly look at what ownership the caller with id ``u1125`` will see
 926 on their work computer:
 927
 928 ::
 929
 930  file id:              u1000
 931  caller idmapping:     u0:k0:r4294967295
 932  filesystem idmapping: u0:k0:r4294967295
 933  mount idmapping:      u1000:k1125:r1
 934
 935 1. Map the userspace id on disk down into a kernel id in the filesystem's
 936    idmapping::
 937
 938     make_kuid(u0:k0:r4294967295, u1000) = k1000
 939
 940 2. Translate the kernel id into a kernel id in the mount's idmapping::
 941
 942     i_uid_into_mnt(k1000):
 943       /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
 944       from_kuid(u0:k0:r4294967295, k1000) = u1000
 945
 946       /* Map the userspace id down into a kernel id in the mounts's idmapping. */
 947       make_kuid(u1000:k1125:r1, u1000) = k1125
 948
 949 3. Map the kernel id up into a userspace id in the caller's idmapping::
 950
 951     from_kuid(u0:k0:r4294967295, k1125) = u1125
 952
 953 So ultimately the caller will be reported that the file belongs to ``u1125``
 954 which is the caller's userspace id on their workstation in our example.
 955
 956 The raw userspace id that is put on disk is ``u1000`` so when the user takes
 957 their home directory back to their home computer where they are assigned
 958 ``u1000`` using the initial idmapping and mount the filesystem with the initial
 959 idmapping they will see all those files owned by ``u1000``.