Documentation/filesystems/netfs_library.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 =================================
   4 Network Filesystem Helper Library
   5 =================================
   6
   7 .. Contents:
   8
   9  - Overview.
  10  - Per-inode context.
  11    - Inode context helper functions.
  12  - Buffered read helpers.
  13    - Read helper functions.
  14    - Read helper structures.
  15    - Read helper operations.
  16    - Read helper procedure.
  17    - Read helper cache API.
  18
  19
  20 Overview
  21 ========
  22
  23 The network filesystem helper library is a set of functions designed to aid a
  24 network filesystem in implementing VM/VFS operations.  For the moment, that
  25 just includes turning various VM buffered read operations into requests to read
  26 from the server.  The helper library, however, can also interpose other
  27 services, such as local caching or local data encryption.
  28
  29 Note that the library module doesn't link against local caching directly, so
  30 access must be provided by the netfs.
  31
  32
  33 Per-Inode Context
  34 =================
  35
  36 The network filesystem helper library needs a place to store a bit of state for
  37 its use on each netfs inode it is helping to manage.  To this end, a context
  38 structure is defined::
  39
  40         struct netfs_inode {
  41                 struct inode inode;
  42                 const struct netfs_request_ops *ops;
  43                 struct fscache_cookie *cache;
  44         };
  45
  46 A network filesystem that wants to use netfs lib must place one of these in its
  47 inode wrapper struct instead of the VFS ``struct inode``.  This can be done in
  48 a way similar to the following::
  49
  50         struct my_inode {
  51                 struct netfs_inode netfs; /* Netfslib context and vfs inode */
  52                 ...
  53         };
  54
  55 This allows netfslib to find its state by using ``container_of()`` from the
  56 inode pointer, thereby allowing the netfslib helper functions to be pointed to
  57 directly by the VFS/VM operation tables.
  58
  59 The structure contains the following fields:
  60
  61  * ``inode``
  62
  63    The VFS inode structure.
  64
  65  * ``ops``
  66
  67    The set of operations provided by the network filesystem to netfslib.
  68
  69  * ``cache``
  70
  71    Local caching cookie, or NULL if no caching is enabled.  This field does not
  72    exist if fscache is disabled.
  73
  74
  75 Inode Context Helper Functions
  76 ------------------------------
  77
  78 To help deal with the per-inode context, a number helper functions are
  79 provided.  Firstly, a function to perform basic initialisation on a context and
  80 set the operations table pointer::
  81
  82         void netfs_inode_init(struct netfs_inode *ctx,
  83                               const struct netfs_request_ops *ops);
  84
  85 then a function to cast from the VFS inode structure to the netfs context::
  86
  87         struct netfs_inode *netfs_node(struct inode *inode);
  88
  89 and finally, a function to get the cache cookie pointer from the context
  90 attached to an inode (or NULL if fscache is disabled)::
  91
  92         struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx);
  93
  94
  95 Buffered Read Helpers
  96 =====================
  97
  98 The library provides a set of read helpers that handle the ->read_folio(),
  99 ->readahead() and much of the ->write_begin() VM operations and translate them
 100 into a common call framework.
 101
 102 The following services are provided:
 103
 104  * Handle folios that span multiple pages.
 105
 106  * Insulate the netfs from VM interface changes.
 107
 108  * Allow the netfs to arbitrarily split reads up into pieces, even ones that
 109    don't match folio sizes or folio alignments and that may cross folios.
 110
 111  * Allow the netfs to expand a readahead request in both directions to meet its
 112    needs.
 113
 114  * Allow the netfs to partially fulfil a read, which will then be resubmitted.
 115
 116  * Handle local caching, allowing cached data and server-read data to be
 117    interleaved for a single request.
 118
 119  * Handle clearing of bufferage that aren't on the server.
 120
 121  * Handle retrying of reads that failed, switching reads from the cache to the
 122    server as necessary.
 123
 124  * In the future, this is a place that other services can be performed, such as
 125    local encryption of data to be stored remotely or in the cache.
 126
 127 From the network filesystem, the helpers require a table of operations.  This
 128 includes a mandatory method to issue a read operation along with a number of
 129 optional methods.
 130
 131
 132 Read Helper Functions
 133 ---------------------
 134
 135 Three read helpers are provided::
 136
 137         void netfs_readahead(struct readahead_control *ractl);
 138         int netfs_read_folio(struct file *file,
 139                              struct folio *folio);
 140         int netfs_write_begin(struct netfs_inode *ctx,
 141                               struct file *file,
 142                               struct address_space *mapping,
 143                               loff_t pos,
 144                               unsigned int len,
 145                               struct folio **_folio,
 146                               void **_fsdata);
 147
 148 Each corresponds to a VM address space operation.  These operations use the
 149 state in the per-inode context.
 150
 151 For ->readahead() and ->read_folio(), the network filesystem just point directly
 152 at the corresponding read helper; whereas for ->write_begin(), it may be a
 153 little more complicated as the network filesystem might want to flush
 154 conflicting writes or track dirty data and needs to put the acquired folio if
 155 an error occurs after calling the helper.
 156
 157 The helpers manage the read request, calling back into the network filesystem
 158 through the suppplied table of operations.  Waits will be performed as
 159 necessary before returning for helpers that are meant to be synchronous.
 160
 161 If an error occurs, the ->free_request() will be called to clean up the
 162 netfs_io_request struct allocated.  If some parts of the request are in
 163 progress when an error occurs, the request will get partially completed if
 164 sufficient data is read.
 165
 166 Additionally, there is::
 167
 168   * void netfs_subreq_terminated(struct netfs_io_subrequest *subreq,
 169                                  ssize_t transferred_or_error,
 170                                  bool was_async);
 171
 172 which should be called to complete a read subrequest.  This is given the number
 173 of bytes transferred or a negative error code, plus a flag indicating whether
 174 the operation was asynchronous (ie. whether the follow-on processing can be
 175 done in the current context, given this may involve sleeping).
 176
 177
 178 Read Helper Structures
 179 ----------------------
 180
 181 The read helpers make use of a couple of structures to maintain the state of
 182 the read.  The first is a structure that manages a read request as a whole::
 183
 184         struct netfs_io_request {
 185                 struct inode            *inode;
 186                 struct address_space    *mapping;
 187                 struct netfs_cache_resources cache_resources;
 188                 void                    *netfs_priv;
 189                 loff_t                  start;
 190                 size_t                  len;
 191                 loff_t                  i_size;
 192                 const struct netfs_request_ops *netfs_ops;
 193                 unsigned int            debug_id;
 194                 ...
 195         };
 196
 197 The above fields are the ones the netfs can use.  They are:
 198
 199  * ``inode``
 200  * ``mapping``
 201
 202    The inode and the address space of the file being read from.  The mapping
 203    may or may not point to inode->i_data.
 204
 205  * ``cache_resources``
 206
 207    Resources for the local cache to use, if present.
 208
 209  * ``netfs_priv``
 210
 211    The network filesystem's private data.  The value for this can be passed in
 212    to the helper functions or set during the request.
 213
 214  * ``start``
 215  * ``len``
 216
 217    The file position of the start of the read request and the length.  These
 218    may be altered by the ->expand_readahead() op.
 219
 220  * ``i_size``
 221
 222    The size of the file at the start of the request.
 223
 224  * ``netfs_ops``
 225
 226    A pointer to the operation table.  The value for this is passed into the
 227    helper functions.
 228
 229  * ``debug_id``
 230
 231    A number allocated to this operation that can be displayed in trace lines
 232    for reference.
 233
 234
 235 The second structure is used to manage individual slices of the overall read
 236 request::
 237
 238         struct netfs_io_subrequest {
 239                 struct netfs_io_request *rreq;
 240                 loff_t                  start;
 241                 size_t                  len;
 242                 size_t                  transferred;
 243                 unsigned long           flags;
 244                 unsigned short          debug_index;
 245                 ...
 246         };
 247
 248 Each subrequest is expected to access a single source, though the helpers will
 249 handle falling back from one source type to another.  The members are:
 250
 251  * ``rreq``
 252
 253    A pointer to the read request.
 254
 255  * ``start``
 256  * ``len``
 257
 258    The file position of the start of this slice of the read request and the
 259    length.
 260
 261  * ``transferred``
 262
 263    The amount of data transferred so far of the length of this slice.  The
 264    network filesystem or cache should start the operation this far into the
 265    slice.  If a short read occurs, the helpers will call again, having updated
 266    this to reflect the amount read so far.
 267
 268  * ``flags``
 269
 270    Flags pertaining to the read.  There are two of interest to the filesystem
 271    or cache:
 272
 273    * ``NETFS_SREQ_CLEAR_TAIL``
 274
 275      This can be set to indicate that the remainder of the slice, from
 276      transferred to len, should be cleared.
 277
 278    * ``NETFS_SREQ_SEEK_DATA_READ``
 279
 280      This is a hint to the cache that it might want to try skipping ahead to
 281      the next data (ie. using SEEK_DATA).
 282
 283  * ``debug_index``
 284
 285    A number allocated to this slice that can be displayed in trace lines for
 286    reference.
 287
 288
 289 Read Helper Operations
 290 ----------------------
 291
 292 The network filesystem must provide the read helpers with a table of operations
 293 through which it can issue requests and negotiate::
 294
 295         struct netfs_request_ops {
 296                 void (*init_request)(struct netfs_io_request *rreq, struct file *file);
 297                 void (*free_request)(struct netfs_io_request *rreq);
 298                 int (*begin_cache_operation)(struct netfs_io_request *rreq);
 299                 void (*expand_readahead)(struct netfs_io_request *rreq);
 300                 bool (*clamp_length)(struct netfs_io_subrequest *subreq);
 301                 void (*issue_read)(struct netfs_io_subrequest *subreq);
 302                 bool (*is_still_valid)(struct netfs_io_request *rreq);
 303                 int (*check_write_begin)(struct file *file, loff_t pos, unsigned len,
 304                                          struct folio *folio, void **_fsdata);
 305                 void (*done)(struct netfs_io_request *rreq);
 306         };
 307
 308 The operations are as follows:
 309
 310  * ``init_request()``
 311
 312    [Optional] This is called to initialise the request structure.  It is given
 313    the file for reference.
 314
 315  * ``free_request()``
 316
 317    [Optional] This is called as the request is being deallocated so that the
 318    filesystem can clean up any state it has attached there.
 319
 320  * ``begin_cache_operation()``
 321
 322    [Optional] This is called to ask the network filesystem to call into the
 323    cache (if present) to initialise the caching state for this read.  The netfs
 324    library module cannot access the cache directly, so the cache should call
 325    something like fscache_begin_read_operation() to do this.
 326
 327    The cache gets to store its state in ->cache_resources and must set a table
 328    of operations of its own there (though of a different type).
 329
 330    This should return 0 on success and an error code otherwise.  If an error is
 331    reported, the operation may proceed anyway, just without local caching (only
 332    out of memory and interruption errors cause failure here).
 333
 334  * ``expand_readahead()``
 335
 336    [Optional] This is called to allow the filesystem to expand the size of a
 337    readahead read request.  The filesystem gets to expand the request in both
 338    directions, though it's not permitted to reduce it as the numbers may
 339    represent an allocation already made.  If local caching is enabled, it gets
 340    to expand the request first.
 341
 342    Expansion is communicated by changing ->start and ->len in the request
 343    structure.  Note that if any change is made, ->len must be increased by at
 344    least as much as ->start is reduced.
 345
 346  * ``clamp_length()``
 347
 348    [Optional] This is called to allow the filesystem to reduce the size of a
 349    subrequest.  The filesystem can use this, for example, to chop up a request
 350    that has to be split across multiple servers or to put multiple reads in
 351    flight.
 352
 353    This should return 0 on success and an error code on error.
 354
 355  * ``issue_read()``
 356
 357    [Required] The helpers use this to dispatch a subrequest to the server for
 358    reading.  In the subrequest, ->start, ->len and ->transferred indicate what
 359    data should be read from the server.
 360
 361    There is no return value; the netfs_subreq_terminated() function should be
 362    called to indicate whether or not the operation succeeded and how much data
 363    it transferred.  The filesystem also should not deal with setting folios
 364    uptodate, unlocking them or dropping their refs - the helpers need to deal
 365    with this as they have to coordinate with copying to the local cache.
 366
 367    Note that the helpers have the folios locked, but not pinned.  It is
 368    possible to use the ITER_XARRAY iov iterator to refer to the range of the
 369    inode that is being operated upon without the need to allocate large bvec
 370    tables.
 371
 372  * ``is_still_valid()``
 373
 374    [Optional] This is called to find out if the data just read from the local
 375    cache is still valid.  It should return true if it is still valid and false
 376    if not.  If it's not still valid, it will be reread from the server.
 377
 378  * ``check_write_begin()``
 379
 380    [Optional] This is called from the netfs_write_begin() helper once it has
 381    allocated/grabbed the folio to be modified to allow the filesystem to flush
 382    conflicting state before allowing it to be modified.
 383
 384    It should return 0 if everything is now fine, -EAGAIN if the folio should be
 385    regrabbed and any other error code to abort the operation.
 386
 387  * ``done``
 388
 389    [Optional] This is called after the folios in the request have all been
 390    unlocked (and marked uptodate if applicable).
 391
 392
 393
 394 Read Helper Procedure
 395 ---------------------
 396
 397 The read helpers work by the following general procedure:
 398
 399  * Set up the request.
 400
 401  * For readahead, allow the local cache and then the network filesystem to
 402    propose expansions to the read request.  This is then proposed to the VM.
 403    If the VM cannot fully perform the expansion, a partially expanded read will
 404    be performed, though this may not get written to the cache in its entirety.
 405
 406  * Loop around slicing chunks off of the request to form subrequests:
 407
 408    * If a local cache is present, it gets to do the slicing, otherwise the
 409      helpers just try to generate maximal slices.
 410
 411    * The network filesystem gets to clamp the size of each slice if it is to be
 412      the source.  This allows rsize and chunking to be implemented.
 413
 414    * The helpers issue a read from the cache or a read from the server or just
 415      clears the slice as appropriate.
 416
 417    * The next slice begins at the end of the last one.
 418
 419    * As slices finish being read, they terminate.
 420
 421  * When all the subrequests have terminated, the subrequests are assessed and
 422    any that are short or have failed are reissued:
 423
 424    * Failed cache requests are issued against the server instead.
 425
 426    * Failed server requests just fail.
 427
 428    * Short reads against either source will be reissued against that source
 429      provided they have transferred some more data:
 430
 431      * The cache may need to skip holes that it can't do DIO from.
 432
 433      * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the
 434        end of the slice instead of reissuing.
 435
 436  * Once the data is read, the folios that have been fully read/cleared:
 437
 438    * Will be marked uptodate.
 439
 440    * If a cache is present, will be marked with PG_fscache.
 441
 442    * Unlocked
 443
 444  * Any folios that need writing to the cache will then have DIO writes issued.
 445
 446  * Synchronous operations will wait for reading to be complete.
 447
 448  * Writes to the cache will proceed asynchronously and the folios will have the
 449    PG_fscache mark removed when that completes.
 450
 451  * The request structures will be cleaned up when everything has completed.
 452
 453
 454 Read Helper Cache API
 455 ---------------------
 456
 457 When implementing a local cache to be used by the read helpers, two things are
 458 required: some way for the network filesystem to initialise the caching for a
 459 read request and a table of operations for the helpers to call.
 460
 461 The network filesystem's ->begin_cache_operation() method is called to set up a
 462 cache and this must call into the cache to do the work.  If using fscache, for
 463 example, the cache would call::
 464
 465         int fscache_begin_read_operation(struct netfs_io_request *rreq,
 466                                          struct fscache_cookie *cookie);
 467
 468 passing in the request pointer and the cookie corresponding to the file.
 469
 470 The netfs_io_request object contains a place for the cache to hang its
 471 state::
 472
 473         struct netfs_cache_resources {
 474                 const struct netfs_cache_ops    *ops;
 475                 void                            *cache_priv;
 476                 void                            *cache_priv2;
 477         };
 478
 479 This contains an operations table pointer and two private pointers.  The
 480 operation table looks like the following::
 481
 482         struct netfs_cache_ops {
 483                 void (*end_operation)(struct netfs_cache_resources *cres);
 484
 485                 void (*expand_readahead)(struct netfs_cache_resources *cres,
 486                                          loff_t *_start, size_t *_len, loff_t i_size);
 487
 488                 enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
 489                                                        loff_t i_size);
 490
 491                 int (*read)(struct netfs_cache_resources *cres,
 492                             loff_t start_pos,
 493                             struct iov_iter *iter,
 494                             bool seek_data,
 495                             netfs_io_terminated_t term_func,
 496                             void *term_func_priv);
 497
 498                 int (*prepare_write)(struct netfs_cache_resources *cres,
 499                                      loff_t *_start, size_t *_len, loff_t i_size,
 500                                      bool no_space_allocated_yet);
 501
 502                 int (*write)(struct netfs_cache_resources *cres,
 503                              loff_t start_pos,
 504                              struct iov_iter *iter,
 505                              netfs_io_terminated_t term_func,
 506                              void *term_func_priv);
 507
 508                 int (*query_occupancy)(struct netfs_cache_resources *cres,
 509                                        loff_t start, size_t len, size_t granularity,
 510                                        loff_t *_data_start, size_t *_data_len);
 511         };
 512
 513 With a termination handler function pointer::
 514
 515         typedef void (*netfs_io_terminated_t)(void *priv,
 516                                               ssize_t transferred_or_error,
 517                                               bool was_async);
 518
 519 The methods defined in the table are:
 520
 521  * ``end_operation()``
 522
 523    [Required] Called to clean up the resources at the end of the read request.
 524
 525  * ``expand_readahead()``
 526
 527    [Optional] Called at the beginning of a netfs_readahead() operation to allow
 528    the cache to expand a request in either direction.  This allows the cache to
 529    size the request appropriately for the cache granularity.
 530
 531    The function is passed poiners to the start and length in its parameters,
 532    plus the size of the file for reference, and adjusts the start and length
 533    appropriately.  It should return one of:
 534
 535    * ``NETFS_FILL_WITH_ZEROES``
 536    * ``NETFS_DOWNLOAD_FROM_SERVER``
 537    * ``NETFS_READ_FROM_CACHE``
 538    * ``NETFS_INVALID_READ``
 539
 540    to indicate whether the slice should just be cleared or whether it should be
 541    downloaded from the server or read from the cache - or whether slicing
 542    should be given up at the current point.
 543
 544  * ``prepare_read()``
 545
 546    [Required] Called to configure the next slice of a request.  ->start and
 547    ->len in the subrequest indicate where and how big the next slice can be;
 548    the cache gets to reduce the length to match its granularity requirements.
 549
 550  * ``read()``
 551
 552    [Required] Called to read from the cache.  The start file offset is given
 553    along with an iterator to read to, which gives the length also.  It can be
 554    given a hint requesting that it seek forward from that start position for
 555    data.
 556
 557    Also provided is a pointer to a termination handler function and private
 558    data to pass to that function.  The termination function should be called
 559    with the number of bytes transferred or an error code, plus a flag
 560    indicating whether the termination is definitely happening in the caller's
 561    context.
 562
 563  * ``prepare_write()``
 564
 565    [Required] Called to prepare a write to the cache to take place.  This
 566    involves checking to see whether the cache has sufficient space to honour
 567    the write.  ``*_start`` and ``*_len`` indicate the region to be written; the
 568    region can be shrunk or it can be expanded to a page boundary either way as
 569    necessary to align for direct I/O.  i_size holds the size of the object and
 570    is provided for reference.  no_space_allocated_yet is set to true if the
 571    caller is certain that no data has been written to that region - for example
 572    if it tried to do a read from there already.
 573
 574  * ``write()``
 575
 576    [Required] Called to write to the cache.  The start file offset is given
 577    along with an iterator to write from, which gives the length also.
 578
 579    Also provided is a pointer to a termination handler function and private
 580    data to pass to that function.  The termination function should be called
 581    with the number of bytes transferred or an error code, plus a flag
 582    indicating whether the termination is definitely happening in the caller's
 583    context.
 584
 585  * ``query_occupancy()``
 586
 587    [Required] Called to find out where the next piece of data is within a
 588    particular region of the cache.  The start and length of the region to be
 589    queried are passed in, along with the granularity to which the answer needs
 590    to be aligned.  The function passes back the start and length of the data,
 591    if any, available within that region.  Note that there may be a hole at the
 592    front.
 593
 594    It returns 0 if some data was found, -ENODATA if there was no usable data
 595    within the region or -ENOBUFS if there is no caching on this file.
 596
 597 Note that these methods are passed a pointer to the cache resource structure,
 598 not the read request structure as they could be used in other situations where
 599 there isn't a read request structure as well, such as writing dirty data to the
 600 cache.
 601
 602
 603 API Function Reference
 604 ======================
 605
 606 .. kernel-doc:: include/linux/netfs.h
 607 .. kernel-doc:: fs/netfs/buffered_read.c
 608 .. kernel-doc:: fs/netfs/io.c