Documentation/networking/fib_trie.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 ============================
   4 LC-trie implementation notes
   5 ============================
   6
   7 Node types
   8 ----------
   9 leaf
  10         An end node with data. This has a copy of the relevant key, along
  11         with 'hlist' with routing table entries sorted by prefix length.
  12         See struct leaf and struct leaf_info.
  13
  14 trie node or tnode
  15         An internal node, holding an array of child (leaf or tnode) pointers,
  16         indexed through a subset of the key. See Level Compression.
  17
  18 A few concepts explained
  19 ------------------------
  20 Bits (tnode)
  21         The number of bits in the key segment used for indexing into the
  22         child array - the "child index". See Level Compression.
  23
  24 Pos (tnode)
  25         The position (in the key) of the key segment used for indexing into
  26         the child array. See Path Compression.
  27
  28 Path Compression / skipped bits
  29         Any given tnode is linked to from the child array of its parent, using
  30         a segment of the key specified by the parent's "pos" and "bits"
  31         In certain cases, this tnode's own "pos" will not be immediately
  32         adjacent to the parent (pos+bits), but there will be some bits
  33         in the key skipped over because they represent a single path with no
  34         deviations. These "skipped bits" constitute Path Compression.
  35         Note that the search algorithm will simply skip over these bits when
  36         searching, making it necessary to save the keys in the leaves to
  37         verify that they actually do match the key we are searching for.
  38
  39 Level Compression / child arrays
  40         the trie is kept level balanced moving, under certain conditions, the
  41         children of a full child (see "full_children") up one level, so that
  42         instead of a pure binary tree, each internal node ("tnode") may
  43         contain an arbitrarily large array of links to several children.
  44         Conversely, a tnode with a mostly empty child array (see empty_children)
  45         may be "halved", having some of its children moved downwards one level,
  46         in order to avoid ever-increasing child arrays.
  47
  48 empty_children
  49         the number of positions in the child array of a given tnode that are
  50         NULL.
  51
  52 full_children
  53         the number of children of a given tnode that aren't path compressed.
  54         (in other words, they aren't NULL or leaves and their "pos" is equal
  55         to this tnode's "pos"+"bits").
  56
  57         (The word "full" here is used more in the sense of "complete" than
  58         as the opposite of "empty", which might be a tad confusing.)
  59
  60 Comments
  61 ---------
  62
  63 We have tried to keep the structure of the code as close to fib_hash as
  64 possible to allow verification and help up reviewing.
  65
  66 fib_find_node()
  67         A good start for understanding this code. This function implements a
  68         straightforward trie lookup.
  69
  70 fib_insert_node()
  71         Inserts a new leaf node in the trie. This is bit more complicated than
  72         fib_find_node(). Inserting a new node means we might have to run the
  73         level compression algorithm on part of the trie.
  74
  75 trie_leaf_remove()
  76         Looks up a key, deletes it and runs the level compression algorithm.
  77
  78 trie_rebalance()
  79         The key function for the dynamic trie after any change in the trie
  80         it is run to optimize and reorganize. It will walk the trie upwards
  81         towards the root from a given tnode, doing a resize() at each step
  82         to implement level compression.
  83
  84 resize()
  85         Analyzes a tnode and optimizes the child array size by either inflating
  86         or shrinking it repeatedly until it fulfills the criteria for optimal
  87         level compression. This part follows the original paper pretty closely
  88         and there may be some room for experimentation here.
  89
  90 inflate()
  91         Doubles the size of the child array within a tnode. Used by resize().
  92
  93 halve()
  94         Halves the size of the child array within a tnode - the inverse of
  95         inflate(). Used by resize();
  96
  97 fn_trie_insert(), fn_trie_delete(), fn_trie_select_default()
  98         The route manipulation functions. Should conform pretty closely to the
  99         corresponding functions in fib_hash.
 100
 101 fn_trie_flush()
 102         This walks the full trie (using nextleaf()) and searches for empty
 103         leaves which have to be removed.
 104
 105 fn_trie_dump()
 106         Dumps the routing table ordered by prefix length. This is somewhat
 107         slower than the corresponding fib_hash function, as we have to walk the
 108         entire trie for each prefix length. In comparison, fib_hash is organized
 109         as one "zone"/hash per prefix length.
 110
 111 Locking
 112 -------
 113
 114 fib_lock is used for an RW-lock in the same way that this is done in fib_hash.
 115 However, the functions are somewhat separated for other possible locking
 116 scenarios. It might conceivably be possible to run trie_rebalance via RCU
 117 to avoid read_lock in the fn_trie_lookup() function.
 118
 119 Main lookup mechanism
 120 ---------------------
 121 fn_trie_lookup() is the main lookup function.
 122
 123 The lookup is in its simplest form just like fib_find_node(). We descend the
 124 trie, key segment by key segment, until we find a leaf. check_leaf() does
 125 the fib_semantic_match in the leaf's sorted prefix hlist.
 126
 127 If we find a match, we are done.
 128
 129 If we don't find a match, we enter prefix matching mode. The prefix length,
 130 starting out at the same as the key length, is reduced one step at a time,
 131 and we backtrack upwards through the trie trying to find a longest matching
 132 prefix. The goal is always to reach a leaf and get a positive result from the
 133 fib_semantic_match mechanism.
 134
 135 Inside each tnode, the search for longest matching prefix consists of searching
 136 through the child array, chopping off (zeroing) the least significant "1" of
 137 the child index until we find a match or the child index consists of nothing but
 138 zeros.
 139
 140 At this point we backtrack (t->stats.backtrack++) up the trie, continuing to
 141 chop off part of the key in order to find the longest matching prefix.
 142
 143 At this point we will repeatedly descend subtries to look for a match, and there
 144 are some optimizations available that can provide us with "shortcuts" to avoid
 145 descending into dead ends. Look for "HL_OPTIMIZE" sections in the code.
 146
 147 To alleviate any doubts about the correctness of the route selection process,
 148 a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which
 149 gives userland access to fib_lookup().