Documentation/filesystems/xfs-self-describing-metadata.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 ============================
   4 XFS Self Describing Metadata
   5 ============================
   6
   7 Introduction
   8 ============
   9
  10 The largest scalability problem facing XFS is not one of algorithmic
  11 scalability, but of verification of the filesystem structure. Scalabilty of the
  12 structures and indexes on disk and the algorithms for iterating them are
  13 adequate for supporting PB scale filesystems with billions of inodes, however it
  14 is this very scalability that causes the verification problem.
  15
  16 Almost all metadata on XFS is dynamically allocated. The only fixed location
  17 metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
  18 other metadata structures need to be discovered by walking the filesystem
  19 structure in different ways. While this is already done by userspace tools for
  20 validating and repairing the structure, there are limits to what they can
  21 verify, and this in turn limits the supportable size of an XFS filesystem.
  22
  23 For example, it is entirely possible to manually use xfs_db and a bit of
  24 scripting to analyse the structure of a 100TB filesystem when trying to
  25 determine the root cause of a corruption problem, but it is still mainly a
  26 manual task of verifying that things like single bit errors or misplaced writes
  27 weren't the ultimate cause of a corruption event. It may take a few hours to a
  28 few days to perform such forensic analysis, so for at this scale root cause
  29 analysis is entirely possible.
  30
  31 However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
  32 to analyse and so that analysis blows out towards weeks/months of forensic work.
  33 Most of the analysis work is slow and tedious, so as the amount of analysis goes
  34 up, the more likely that the cause will be lost in the noise.  Hence the primary
  35 concern for supporting PB scale filesystems is minimising the time and effort
  36 required for basic forensic analysis of the filesystem structure.
  37
  38
  39 Self Describing Metadata
  40 ========================
  41
  42 One of the problems with the current metadata format is that apart from the
  43 magic number in the metadata block, we have no other way of identifying what it
  44 is supposed to be. We can't even identify if it is the right place. Put simply,
  45 you can't look at a single metadata block in isolation and say "yes, it is
  46 supposed to be there and the contents are valid".
  47
  48 Hence most of the time spent on forensic analysis is spent doing basic
  49 verification of metadata values, looking for values that are in range (and hence
  50 not detected by automated verification checks) but are not correct. Finding and
  51 understanding how things like cross linked block lists (e.g. sibling
  52 pointers in a btree end up with loops in them) are the key to understanding what
  53 went wrong, but it is impossible to tell what order the blocks were linked into
  54 each other or written to disk after the fact.
  55
  56 Hence we need to record more information into the metadata to allow us to
  57 quickly determine if the metadata is intact and can be ignored for the purpose
  58 of analysis. We can't protect against every possible type of error, but we can
  59 ensure that common types of errors are easily detectable.  Hence the concept of
  60 self describing metadata.
  61
  62 The first, fundamental requirement of self describing metadata is that the
  63 metadata object contains some form of unique identifier in a well known
  64 location. This allows us to identify the expected contents of the block and
  65 hence parse and verify the metadata object. IF we can't independently identify
  66 the type of metadata in the object, then the metadata doesn't describe itself
  67 very well at all!
  68
  69 Luckily, almost all XFS metadata has magic numbers embedded already - only the
  70 AGFL, remote symlinks and remote attribute blocks do not contain identifying
  71 magic numbers. Hence we can change the on-disk format of all these objects to
  72 add more identifying information and detect this simply by changing the magic
  73 numbers in the metadata objects. That is, if it has the current magic number,
  74 the metadata isn't self identifying. If it contains a new magic number, it is
  75 self identifying and we can do much more expansive automated verification of the
  76 metadata object at runtime, during forensic analysis or repair.
  77
  78 As a primary concern, self describing metadata needs some form of overall
  79 integrity checking. We cannot trust the metadata if we cannot verify that it has
  80 not been changed as a result of external influences. Hence we need some form of
  81 integrity check, and this is done by adding CRC32c validation to the metadata
  82 block. If we can verify the block contains the metadata it was intended to
  83 contain, a large amount of the manual verification work can be skipped.
  84
  85 CRC32c was selected as metadata cannot be more than 64k in length in XFS and
  86 hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
  87 metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
  88 fast. So while CRC32c is not the strongest of possible integrity checks that
  89 could be used, it is more than sufficient for our needs and has relatively
  90 little overhead. Adding support for larger integrity fields and/or algorithms
  91 does really provide any extra value over CRC32c, but it does add a lot of
  92 complexity and so there is no provision for changing the integrity checking
  93 mechanism.
  94
  95 Self describing metadata needs to contain enough information so that the
  96 metadata block can be verified as being in the correct place without needing to
  97 look at any other metadata. This means it needs to contain location information.
  98 Just adding a block number to the metadata is not sufficient to protect against
  99 mis-directed writes - a write might be misdirected to the wrong LUN and so be
 100 written to the "correct block" of the wrong filesystem. Hence location
 101 information must contain a filesystem identifier as well as a block number.
 102
 103 Another key information point in forensic analysis is knowing who the metadata
 104 block belongs to. We already know the type, the location, that it is valid
 105 and/or corrupted, and how long ago that it was last modified. Knowing the owner
 106 of the block is important as it allows us to find other related metadata to
 107 determine the scope of the corruption. For example, if we have a extent btree
 108 object, we don't know what inode it belongs to and hence have to walk the entire
 109 filesystem to find the owner of the block. Worse, the corruption could mean that
 110 no owner can be found (i.e. it's an orphan block), and so without an owner field
 111 in the metadata we have no idea of the scope of the corruption. If we have an
 112 owner field in the metadata object, we can immediately do top down validation to
 113 determine the scope of the problem.
 114
 115 Different types of metadata have different owner identifiers. For example,
 116 directory, attribute and extent tree blocks are all owned by an inode, while
 117 freespace btree blocks are owned by an allocation group. Hence the size and
 118 contents of the owner field are determined by the type of metadata object we are
 119 looking at.  The owner information can also identify misplaced writes (e.g.
 120 freespace btree block written to the wrong AG).
 121
 122 Self describing metadata also needs to contain some indication of when it was
 123 written to the filesystem. One of the key information points when doing forensic
 124 analysis is how recently the block was modified. Correlation of set of corrupted
 125 metadata blocks based on modification times is important as it can indicate
 126 whether the corruptions are related, whether there's been multiple corruption
 127 events that lead to the eventual failure, and even whether there are corruptions
 128 present that the run-time verification is not detecting.
 129
 130 For example, we can determine whether a metadata object is supposed to be free
 131 space or still allocated if it is still referenced by its owner by looking at
 132 when the free space btree block that contains the block was last written
 133 compared to when the metadata object itself was last written.  If the free space
 134 block is more recent than the object and the object's owner, then there is a
 135 very good chance that the block should have been removed from the owner.
 136
 137 To provide this "written timestamp", each metadata block gets the Log Sequence
 138 Number (LSN) of the most recent transaction it was modified on written into it.
 139 This number will always increase over the life of the filesystem, and the only
 140 thing that resets it is running xfs_repair on the filesystem. Further, by use of
 141 the LSN we can tell if the corrupted metadata all belonged to the same log
 142 checkpoint and hence have some idea of how much modification occurred between
 143 the first and last instance of corrupt metadata on disk and, further, how much
 144 modification occurred between the corruption being written and when it was
 145 detected.
 146
 147 Runtime Validation
 148 ==================
 149
 150 Validation of self-describing metadata takes place at runtime in two places:
 151
 152         - immediately after a successful read from disk
 153         - immediately prior to write IO submission
 154
 155 The verification is completely stateless - it is done independently of the
 156 modification process, and seeks only to check that the metadata is what it says
 157 it is and that the metadata fields are within bounds and internally consistent.
 158 As such, we cannot catch all types of corruption that can occur within a block
 159 as there may be certain limitations that operational state enforces of the
 160 metadata, or there may be corruption of interblock relationships (e.g. corrupted
 161 sibling pointer lists). Hence we still need stateful checking in the main code
 162 body, but in general most of the per-field validation is handled by the
 163 verifiers.
 164
 165 For read verification, the caller needs to specify the expected type of metadata
 166 that it should see, and the IO completion process verifies that the metadata
 167 object matches what was expected. If the verification process fails, then it
 168 marks the object being read as EFSCORRUPTED. The caller needs to catch this
 169 error (same as for IO errors), and if it needs to take special action due to a
 170 verification error it can do so by catching the EFSCORRUPTED error value. If we
 171 need more discrimination of error type at higher levels, we can define new
 172 error numbers for different errors as necessary.
 173
 174 The first step in read verification is checking the magic number and determining
 175 whether CRC validating is necessary. If it is, the CRC32c is calculated and
 176 compared against the value stored in the object itself. Once this is validated,
 177 further checks are made against the location information, followed by extensive
 178 object specific metadata validation. If any of these checks fail, then the
 179 buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
 180
 181 Write verification is the opposite of the read verification - first the object
 182 is extensively verified and if it is OK we then update the LSN from the last
 183 modification made to the object, After this, we calculate the CRC and insert it
 184 into the object. Once this is done the write IO is allowed to continue. If any
 185 error occurs during this process, the buffer is again marked with a EFSCORRUPTED
 186 error for the higher layers to catch.
 187
 188 Structures
 189 ==========
 190
 191 A typical on-disk structure needs to contain the following information::
 192
 193     struct xfs_ondisk_hdr {
 194             __be32  magic;              /* magic number */
 195             __be32  crc;                /* CRC, not logged */
 196             uuid_t  uuid;               /* filesystem identifier */
 197             __be64  owner;              /* parent object */
 198             __be64  blkno;              /* location on disk */
 199             __be64  lsn;                /* last modification in log, not logged */
 200     };
 201
 202 Depending on the metadata, this information may be part of a header structure
 203 separate to the metadata contents, or may be distributed through an existing
 204 structure. The latter occurs with metadata that already contains some of this
 205 information, such as the superblock and AG headers.
 206
 207 Other metadata may have different formats for the information, but the same
 208 level of information is generally provided. For example:
 209
 210         - short btree blocks have a 32 bit owner (ag number) and a 32 bit block
 211           number for location. The two of these combined provide the same
 212           information as @owner and @blkno in eh above structure, but using 8
 213           bytes less space on disk.
 214
 215         - directory/attribute node blocks have a 16 bit magic number, and the
 216           header that contains the magic number has other information in it as
 217           well. hence the additional metadata headers change the overall format
 218           of the metadata.
 219
 220 A typical buffer read verifier is structured as follows::
 221
 222     #define XFS_FOO_CRC_OFF             offsetof(struct xfs_ondisk_hdr, crc)
 223
 224     static void
 225     xfs_foo_read_verify(
 226             struct xfs_buf      *bp)
 227     {
 228         struct xfs_mount *mp = bp->b_mount;
 229
 230             if ((xfs_sb_version_hascrc(&mp->m_sb) &&
 231                 !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
 232                                             XFS_FOO_CRC_OFF)) ||
 233                 !xfs_foo_verify(bp)) {
 234                     XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
 235                     xfs_buf_ioerror(bp, EFSCORRUPTED);
 236             }
 237     }
 238
 239 The code ensures that the CRC is only checked if the filesystem has CRCs enabled
 240 by checking the superblock of the feature bit, and then if the CRC verifies OK
 241 (or is not needed) it verifies the actual contents of the block.
 242
 243 The verifier function will take a couple of different forms, depending on
 244 whether the magic number can be used to determine the format of the block. In
 245 the case it can't, the code is structured as follows::
 246
 247     static bool
 248     xfs_foo_verify(
 249             struct xfs_buf              *bp)
 250     {
 251             struct xfs_mount    *mp = bp->b_mount;
 252             struct xfs_ondisk_hdr       *hdr = bp->b_addr;
 253
 254             if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
 255                     return false;
 256
 257             if (!xfs_sb_version_hascrc(&mp->m_sb)) {
 258                     if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
 259                             return false;
 260                     if (bp->b_bn != be64_to_cpu(hdr->blkno))
 261                             return false;
 262                     if (hdr->owner == 0)
 263                             return false;
 264             }
 265
 266             /* object specific verification checks here */
 267
 268             return true;
 269     }
 270
 271 If there are different magic numbers for the different formats, the verifier
 272 will look like::
 273
 274     static bool
 275     xfs_foo_verify(
 276             struct xfs_buf              *bp)
 277     {
 278             struct xfs_mount    *mp = bp->b_mount;
 279             struct xfs_ondisk_hdr       *hdr = bp->b_addr;
 280
 281             if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
 282                     if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
 283                             return false;
 284                     if (bp->b_bn != be64_to_cpu(hdr->blkno))
 285                             return false;
 286                     if (hdr->owner == 0)
 287                             return false;
 288             } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
 289                     return false;
 290
 291             /* object specific verification checks here */
 292
 293             return true;
 294     }
 295
 296 Write verifiers are very similar to the read verifiers, they just do things in
 297 the opposite order to the read verifiers. A typical write verifier::
 298
 299     static void
 300     xfs_foo_write_verify(
 301             struct xfs_buf      *bp)
 302     {
 303             struct xfs_mount    *mp = bp->b_mount;
 304             struct xfs_buf_log_item     *bip = bp->b_fspriv;
 305
 306             if (!xfs_foo_verify(bp)) {
 307                     XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
 308                     xfs_buf_ioerror(bp, EFSCORRUPTED);
 309                     return;
 310             }
 311
 312             if (!xfs_sb_version_hascrc(&mp->m_sb))
 313                     return;
 314
 315
 316             if (bip) {
 317                     struct xfs_ondisk_hdr       *hdr = bp->b_addr;
 318                     hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
 319             }
 320             xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
 321     }
 322
 323 This will verify the internal structure of the metadata before we go any
 324 further, detecting corruptions that have occurred as the metadata has been
 325 modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
 326 update the LSN field (when it was last modified) and calculate the CRC on the
 327 metadata. Once this is done, we can issue the IO.
 328
 329 Inodes and Dquots
 330 =================
 331
 332 Inodes and dquots are special snowflakes. They have per-object CRC and
 333 self-identifiers, but they are packed so that there are multiple objects per
 334 buffer. Hence we do not use per-buffer verifiers to do the work of per-object
 335 verification and CRC calculations. The per-buffer verifiers simply perform basic
 336 identification of the buffer - that they contain inodes or dquots, and that
 337 there are magic numbers in all the expected spots. All further CRC and
 338 verification checks are done when each inode is read from or written back to the
 339 buffer.
 340
 341 The structure of the verifiers and the identifiers checks is very similar to the
 342 buffer code described above. The only difference is where they are called. For
 343 example, inode read verification is done in xfs_inode_from_disk() when the inode
 344 is first read out of the buffer and the struct xfs_inode is instantiated. The
 345 inode is already extensively verified during writeback in xfs_iflush_int, so the
 346 only addition here is to add the LSN and CRC to the inode as it is copied back
 347 into the buffer.
 348
 349 XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
 350 the unlinked list modifications check or update CRCs, neither during unlink nor
 351 log recovery. So, it's gone unnoticed until now. This won't matter immediately -
 352 repair will probably complain about it - but it needs to be fixed.