linux-2.6-microblaze.git
4 years agobtrfs: move variables for clustered allocation into find_free_extent_ctl
Naohiro Aota [Tue, 25 Feb 2020 03:56:18 +0000 (12:56 +0900)]
btrfs: move variables for clustered allocation into find_free_extent_ctl

Move "last_ptr" and "use_cluster" into struct find_free_extent_ctl, so
that hook functions for clustered allocator can use these variables.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: move hint_byte into find_free_extent_ctl
Naohiro Aota [Tue, 25 Feb 2020 03:56:17 +0000 (12:56 +0900)]
btrfs: move hint_byte into find_free_extent_ctl

This commit moves hint_byte into find_free_extent_ctl, so that we can
modify the hint_byte in the other functions. This will help us split
find_free_extent further. This commit also renames the function argument
"hint_byte" to "hint_byte_orig" to avoid misuse.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: introduce extent allocation policy
Naohiro Aota [Tue, 25 Feb 2020 03:56:16 +0000 (12:56 +0900)]
btrfs: introduce extent allocation policy

This commit introduces extent allocation policy for btrfs. This policy
controls how btrfs allocate an extents from block groups.  There is no
functional change introduced with this commit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: parameterize dev_extent_min for chunk allocation
Naohiro Aota [Tue, 25 Feb 2020 03:56:15 +0000 (12:56 +0900)]
btrfs: parameterize dev_extent_min for chunk allocation

Currently, we ignore a device whose available space is less than
"BTRFS_STRIPE_LEN * dev_stripes". This is a lower limit for current
allocation policy (to maximize the number of stripes). This commit
parameterizes dev_extent_min, so that other policies can set their own
lower limitat to ignore a device with insufficient space.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: factor out create_chunk()
Naohiro Aota [Tue, 25 Feb 2020 03:56:14 +0000 (12:56 +0900)]
btrfs: factor out create_chunk()

Factor out create_chunk() from __btrfs_alloc_chunk(). This function
finally creates a chunk. There is no functional changes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: factor out decide_stripe_size()
Naohiro Aota [Tue, 25 Feb 2020 03:56:13 +0000 (12:56 +0900)]
btrfs: factor out decide_stripe_size()

Factor out decide_stripe_size() from __btrfs_alloc_chunk(). This
function calculates the actual stripe size to allocate.
decide_stripe_size() handles the common case to round down the 'ndevs'
to 'devs_increment' and check the upper and lower limitation of 'ndevs'.
decide_stripe_size_regular() decides the size of a stripe and the size
of a chunk. The policy is to maximize the number of stripes.

This commit has no functional changes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: factor out gather_device_info()
Naohiro Aota [Tue, 25 Feb 2020 03:56:12 +0000 (12:56 +0900)]
btrfs: factor out gather_device_info()

Factor out gather_device_info() from __btrfs_alloc_chunk(). This
function iterates over devices list and gather information about
devices. This commit also introduces "max_avail" and
"dev_extent_min" to fold the same calculation to one variable.
This commit has no functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: factor out init_alloc_chunk_ctl
Naohiro Aota [Tue, 25 Feb 2020 03:56:11 +0000 (12:56 +0900)]
btrfs: factor out init_alloc_chunk_ctl

Factor out init_alloc_chunk_ctl() from __btrfs_alloc_chunk(). This
function initialises parameters of "struct alloc_chunk_ctl" for
allocation.  init_alloc_chunk_ctl() handles a common part of the
initialisation to load the RAID parameters from btrfs_raid_array.
init_alloc_chunk_ctl_policy_regular() decides some parameters for its
allocation.

The last "else" case in the original code is moved to
__btrfs_alloc_chunk() to handle the error case in the common code.
Replace the BUG_ON with ASSERT() and error return at the same time.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: introduce alloc_chunk_ctl
Naohiro Aota [Tue, 25 Feb 2020 03:56:10 +0000 (12:56 +0900)]
btrfs: introduce alloc_chunk_ctl

Introduce "struct alloc_chunk_ctl" to wrap needed parameters for the
chunk allocation.  This will be used to split __btrfs_alloc_chunk() into
smaller functions.

This commit folds a number of local variables in __btrfs_alloc_chunk()
into one "struct alloc_chunk_ctl ctl". There is no functional change.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: refactor find_free_dev_extent_start()
Naohiro Aota [Tue, 25 Feb 2020 03:56:09 +0000 (12:56 +0900)]
btrfs: refactor find_free_dev_extent_start()

Factor out two functions from find_free_dev_extent_start().
dev_extent_search_start() decides the starting position of the search.
dev_extent_hole_check() checks if a hole found is suitable for device
extent allocation.

These functions also have the switch-cases to change the allocation
behavior depending on the policy.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: introduce chunk allocation policy
Naohiro Aota [Tue, 25 Feb 2020 03:56:08 +0000 (12:56 +0900)]
btrfs: introduce chunk allocation policy

Introduce chunk allocation policy for btrfs. This policy controls how
chunks and device extents are allocated from devices.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: handle invalid profile in chunk allocation
Naohiro Aota [Tue, 25 Feb 2020 03:56:07 +0000 (12:56 +0900)]
btrfs: handle invalid profile in chunk allocation

Do not BUG_ON() when an invalid profile is passed to __btrfs_alloc_chunk().
Instead return -EINVAL with ASSERT() to catch a bug in the development
stage.

Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: change full_search to bool in find_free_extent_update_loop
Naohiro Aota [Tue, 25 Feb 2020 03:56:06 +0000 (12:56 +0900)]
btrfs: change full_search to bool in find_free_extent_update_loop

While the "full_search" variable defined in find_free_extent() is bool,
but the full_search argument of find_free_extent_update_loop() is
defined as int. Let's trivially fix the argument type.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: qgroup: Remove the unnecesaary spin lock for qgroup_rescan_running
Qu Wenruo [Fri, 7 Feb 2020 05:38:21 +0000 (13:38 +0800)]
btrfs: qgroup: Remove the unnecesaary spin lock for qgroup_rescan_running

After the previous patch, qgroup_rescan_running is protected by
btrfs_fs_info::qgroup_rescan_lock, thus no need for the extra spinlock.

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: qgroup: ensure qgroup_rescan_running is only set when the worker is at least...
Qu Wenruo [Fri, 7 Feb 2020 05:38:20 +0000 (13:38 +0800)]
btrfs: qgroup: ensure qgroup_rescan_running is only set when the worker is at least queued

[BUG]
There are some reports about btrfs wait forever to unmount itself, with
the following call trace:

  INFO: task umount:4631 blocked for more than 491 seconds.
        Tainted: G               X  5.3.8-2-default #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  umount          D    0  4631   3337 0x00000000
  Call Trace:
  ([<00000000174adf7a>] __schedule+0x342/0x748)
   [<00000000174ae3ca>] schedule+0x4a/0xd8
   [<00000000174b1f08>] schedule_timeout+0x218/0x420
   [<00000000174af10c>] wait_for_common+0x104/0x1d8
   [<000003ff804d6994>] btrfs_qgroup_wait_for_completion+0x84/0xb0 [btrfs]
   [<000003ff8044a616>] close_ctree+0x4e/0x380 [btrfs]
   [<0000000016fa3136>] generic_shutdown_super+0x8e/0x158
   [<0000000016fa34d6>] kill_anon_super+0x26/0x40
   [<000003ff8041ba88>] btrfs_kill_super+0x28/0xc8 [btrfs]
   [<0000000016fa39f8>] deactivate_locked_super+0x68/0x98
   [<0000000016fcb198>] cleanup_mnt+0xc0/0x140
   [<0000000016d6a846>] task_work_run+0xc6/0x110
   [<0000000016d04f76>] do_notify_resume+0xae/0xb8
   [<00000000174b30ae>] system_call+0xe2/0x2c8

[CAUSE]
The problem happens when we have called qgroup_rescan_init(), but
not queued the worker. It can be caused mostly by error handling.

Qgroup ioctl thread | Unmount thread
----------------------------------------+-----------------------------------
|
btrfs_qgroup_rescan() |
|- qgroup_rescan_init() |
|  |- qgroup_rescan_running = true; |
| |
|- trans = btrfs_join_transaction() |
|  Some error happened |
| |
|- btrfs_qgroup_rescan() returns error |
   But qgroup_rescan_running == true; |
| close_ctree()
| |- btrfs_qgroup_wait_for_completion()
|    |- running == true;
|    |- wait_for_completion();

btrfs_qgroup_rescan_worker is never queued, thus no one is going to wake
up close_ctree() and we get a deadlock.

All involved qgroup_rescan_init() callers are:

- btrfs_qgroup_rescan()
  The example above. It's possible to trigger the deadlock when error
  happened.

- btrfs_quota_enable()
  Not possible. Just after qgroup_rescan_init() we queue the work.

- btrfs_read_qgroup_config()
  It's possible to trigger the deadlock. It only init the work, the
  work queueing happens in btrfs_qgroup_rescan_resume().
  Thus if error happened in between, deadlock is possible.

We shouldn't set fs_info->qgroup_rescan_running just in
qgroup_rescan_init(), as at that stage we haven't yet queued qgroup
rescan worker to run.

[FIX]
Set qgroup_rescan_running before queueing the work, so that we ensure
the rescan work is queued when we wait for it.

Fixes: 8d9eddad1946 ("Btrfs: fix qgroup rescan worker initialization")
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
[ Change subject and cause analyse, use a smaller fix ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agouuid: Remove no more needed macro
Andy Shevchenko [Mon, 24 Feb 2020 15:37:52 +0000 (17:37 +0200)]
uuid: Remove no more needed macro

uuid_le_gen() is no used anymore, remove it for good.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: switch to use new generic UUID API
Andy Shevchenko [Mon, 24 Feb 2020 15:37:51 +0000 (17:37 +0200)]
btrfs: switch to use new generic UUID API

There are new types and helpers that are supposed to be used in new code.

As a preparation to get rid of legacy types and API functions do
the conversion here.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agouuid: Provide a GUID generator for raw buffer
Andy Shevchenko [Mon, 24 Feb 2020 15:37:50 +0000 (17:37 +0200)]
uuid: Provide a GUID generator for raw buffer

In some cases we would like to generate a GUID and export it.  Though it
would require either casting to internal kernel types or an intermediate
buffer. Instead we may achieve this by supplying a pointer to raw buffer
and make a complimentary API to existing one for UUIDs.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agouuid: Add inline helpers to import / export UUIDs
Andy Shevchenko [Mon, 24 Feb 2020 15:37:49 +0000 (17:37 +0200)]
uuid: Add inline helpers to import / export UUIDs

Sometimes we may need to import UUID from or export to the raw buffer,
which is provided outside of kernel and can't be declared as UUID type.
With current API this operation will require an explicit casting
to one of UUID types and length, that is always a constant derived as sizeof
the certain UUID type.

Provide a helpful set of inline helpers to minimize developer's effort
in the cases when raw buffers are involved.

Suggested-by: David Sterba <dsterba@suse.cz>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Don't submit any btree write bio if the fs has errors
Qu Wenruo [Wed, 12 Feb 2020 06:12:44 +0000 (14:12 +0800)]
btrfs: Don't submit any btree write bio if the fs has errors

[BUG]
There is a fuzzed image which could cause KASAN report at unmount time.

  BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
  Read of size 8 at addr ffff888067cf6848 by task umount/1922

  CPU: 0 PID: 1922 Comm: umount Tainted: G        W         5.0.21 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
  Call Trace:
   dump_stack+0x5b/0x8b
   print_address_description+0x70/0x280
   kasan_report+0x13a/0x19b
   btrfs_queue_work+0x2c1/0x390
   btrfs_wq_submit_bio+0x1cd/0x240
   btree_submit_bio_hook+0x18c/0x2a0
   submit_one_bio+0x1be/0x320
   flush_write_bio.isra.41+0x2c/0x70
   btree_write_cache_pages+0x3bb/0x7f0
   do_writepages+0x5c/0x130
   __writeback_single_inode+0xa3/0x9a0
   writeback_single_inode+0x23d/0x390
   write_inode_now+0x1b5/0x280
   iput+0x2ef/0x600
   close_ctree+0x341/0x750
   generic_shutdown_super+0x126/0x370
   kill_anon_super+0x31/0x50
   btrfs_kill_super+0x36/0x2b0
   deactivate_locked_super+0x80/0xc0
   deactivate_super+0x13c/0x150
   cleanup_mnt+0x9a/0x130
   task_work_run+0x11a/0x1b0
   exit_to_usermode_loop+0x107/0x130
   do_syscall_64+0x1e5/0x280
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

[CAUSE]
The fuzzed image has a completely screwd up extent tree:

  leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
  refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
          item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
                  extent refs 1 gen 9 flags 1
                  ref#0: extent data backref root 5 objectid 259 offset 0 count 1
          item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
                  extent refs 1 gen 9 flags 1
                  ref#0: extent data backref root 5 objectid 271 offset 0 count 1
          item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
                  extent refs 1 gen 9 flags 1
                  ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
          item 3 key (29360128 169 0) itemoff 3803 itemsize 33
                  extent refs 1 gen 9 flags 2
                  ref#0: tree block backref root 5
          item 4 key (29368320 169 1) itemoff 3770 itemsize 33
                  extent refs 1 gen 9 flags 2
                  ref#0: tree block backref root 5
          item 5 key (29372416 169 0) itemoff 3737 itemsize 33
                  extent refs 1 gen 9 flags 2
                  ref#0: tree block backref root 5

Note that leaf 29421568 doesn't have its backref in the extent tree.
Thus extent allocator can re-allocate leaf 29421568 for other trees.

In short, the bug is caused by:

- Existing tree block gets allocated to log tree
  This got its generation bumped.

- Log tree balance cleaned dirty bit of offending tree block
  It will not be written back to disk, thus no WRITTEN flag.

- Original owner of the tree block gets COWed
  Since the tree block has higher transid, no WRITTEN flag, it's reused,
  and not traced by transaction::dirty_pages.

- Transaction aborted
  Tree blocks get cleaned according to transaction::dirty_pages. But the
  offending tree block is not recorded at all.

- Filesystem unmount
  All pages are assumed to be are clean, destroying all workqueue, then
  call iput(btree_inode).
  But offending tree block is still dirty, which triggers writeback, and
  causes use-after-free bug.

The detailed sequence looks like this:

- Initial status
  eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
      not traced by any dirty extent_iot_tree.

- New tree block is allocated
  Since there is no backref for 29421568, it's re-allocated as new tree
  block.
  Keep in mind that tree block 29421568 is still referred by extent
  tree.

- Tree block 29421568 is filled for log tree
  eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
      traced by btrfs_root::dirty_log_pages

- Some log tree operations
  Since the fs is using node size 4096, the log tree can easily go a
  level higher.

- Log tree needs balance
  Tree block 29421568 gets all its content pushed to right, thus now
  it is empty, and we don't need it.
  btrfs_clean_tree_block() from __push_leaf_right() get called.

  eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
      traced by btrfs_root::dirty_log_pages

- Log tree write back
  btree_write_cache_pages() goes through dirty pages ranges, but since
  page of tree block 29421568 gets cleaned already, it's not written
  back to disk. Thus it doesn't have WRITTEN bit set.
  But ranges in dirty_log_pages are cleared.

  eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
      not traced by any dirty extent_iot_tree.

- Extent tree update when committing transaction
  Since tree block 29421568 has transid equal to running trans, and has
  no WRITTEN bit, should_cow_block() will use it directly without adding
  it to btrfs_transaction::dirty_pages.

  eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
      not traced by any dirty extent_iot_tree.

  At this stage, we're doomed. We have a dirty eb not tracked by any
  extent io tree.

- Transaction gets aborted due to corrupted extent tree
  Btrfs cleans up dirty pages according to transaction::dirty_pages and
  btrfs_root::dirty_log_pages.
  But since tree block 29421568 is not tracked by neither of them, it's
  still dirty.

  eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
      not traced by any dirty extent_iot_tree.

- Filesystem unmount
  Since all cleanup is assumed to be done, all workqueus are destroyed.
  Then iput(btree_inode) is called, expecting no dirty pages.
  But tree 29421568 is still dirty, thus triggering writeback.
  Since all workqueues are already freed, we cause use-after-free.

This shows us that, log tree blocks + bad extent tree can cause wild
dirty pages.

[FIX]
To fix the problem, don't submit any btree write bio if the filesytem
has any error.  This is the last safe net, just in case other cleanup
haven't caught catch it.

Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: ioctl: resize: only show message if size is changed
Marcos Paulo de Souza [Tue, 11 Feb 2020 13:55:26 +0000 (10:55 -0300)]
btrfs: ioctl: resize: only show message if size is changed

There is no point to inform the user about size change if there's none.
Update the message to conform to a commonly used format where the path
and devid are printed and also print old and new sizes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Marcos Paulo de Souza <marcos@mpdesouza.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance message ]
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: slightly simplify global block reserve calculations
Anand Jain [Tue, 4 Feb 2020 11:05:58 +0000 (19:05 +0800)]
btrfs: slightly simplify global block reserve calculations

In btrfs_update_global_block_rsv the lines:

  num_bytes = block_rsv->size - block_rsv->reserved;
  block_rsv->reserved += num_bytes;

imply:

  block_rsv->reserved = block_rsv->size;

Assign block_rsv->size to block_rsv->reserved directly and reorder lines
so they match the other branch.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: merge unlocking to common exit block in btrfs_commit_transaction
David Sterba [Thu, 28 Nov 2019 15:03:00 +0000 (16:03 +0100)]
btrfs: merge unlocking to common exit block in btrfs_commit_transaction

The tree_log_mutex and reloc_mutex locks are properly nested so we can
simplify error handling and add labels for them. This reduces line count
of the function.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: reduce pointer intdirections in btree_readpage_end_io_hook
David Sterba [Thu, 28 Nov 2019 15:15:04 +0000 (16:15 +0100)]
btrfs: reduce pointer intdirections in btree_readpage_end_io_hook

All we need to read is checksum size from fs_info superblock, and
fs_info is provided by extent buffer so we can get rid of the wild
pointer indirections from page/inode/root.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: adjust delayed refs message level
David Sterba [Thu, 28 Nov 2019 14:34:28 +0000 (15:34 +0100)]
btrfs: adjust delayed refs message level

The message seems to be for debugging and has little value for users.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: replace u_long type cast with unsigned long
David Sterba [Thu, 28 Nov 2019 14:31:46 +0000 (15:31 +0100)]
btrfs: replace u_long type cast with unsigned long

We don't use the u_XX types anywhere, though they're defined.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: raid56: simplify sort_parity_stripes
David Sterba [Thu, 28 Nov 2019 14:31:17 +0000 (15:31 +0100)]
btrfs: raid56: simplify sort_parity_stripes

Remove trivial comprator and open coded swap of two values.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: adjust message level for unrecognized mount option
David Sterba [Wed, 27 Nov 2019 15:02:05 +0000 (16:02 +0100)]
btrfs: adjust message level for unrecognized mount option

An unrecognized option is a failure that should get user/administrator
attention, the info level is often below what gets logged, so make it
error.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: simplify parameters of btrfs_set_disk_extent_flags
David Sterba [Wed, 20 Mar 2019 10:54:13 +0000 (11:54 +0100)]
btrfs: simplify parameters of btrfs_set_disk_extent_flags

All callers pass extent buffer start and length so the extent buffer
itself should work fine.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: open code trivial helper btrfs_header_chunk_tree_uuid
David Sterba [Wed, 20 Mar 2019 12:17:13 +0000 (13:17 +0100)]
btrfs: open code trivial helper btrfs_header_chunk_tree_uuid

The helper btrfs_header_chunk_tree_uuid follows naming convention of
other struct accessors but does something compeletly different. As the
offsetof calculation is clear in the context of extent buffer operations
we can remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: open code trivial helper btrfs_header_fsid
David Sterba [Wed, 20 Mar 2019 12:15:57 +0000 (13:15 +0100)]
btrfs: open code trivial helper btrfs_header_fsid

The helper btrfs_header_fsid follows naming convention of other struct
accessors but does something compeletly different. As the offsetof
calculation is clear in the context of extent buffer operations we can
remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: move mapping of block for discard to its caller
David Sterba [Thu, 2 Aug 2018 22:36:29 +0000 (00:36 +0200)]
btrfs: move mapping of block for discard to its caller

There's a simple forwarded call based on the operation that would better
fit the caller btrfs_map_block that's until now a trivial wrapper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: use struct_size to calculate size of raid hash table
David Sterba [Fri, 29 Mar 2019 01:07:02 +0000 (02:07 +0100)]
btrfs: use struct_size to calculate size of raid hash table

The struct_size macro does the same calculation and is safe regarding
overflows. Though we're not expecting them to happen, use the helper for
clarity.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: convert snapshot/nocow exlcusion to drew lock
Nikolay Borisov [Thu, 30 Jan 2020 12:59:45 +0000 (14:59 +0200)]
btrfs: convert snapshot/nocow exlcusion to drew lock

This patch removes all haphazard code implementing nocow writers
exclusion from pending snapshot creation and switches to using the drew
lock to ensure this invariant still holds.

'Readers' are snapshot creators from create_snapshot and 'writers' are
nocow writers from buffered write path or btrfs_setsize. This locking
scheme allows for multiple snapshots to happen while any nocow writers
are blocked, since writes to page cache in the nocow path will make
snapshots inconsistent.

So for performance reasons we'd like to have the ability to run multiple
concurrent snapshots and also favors readers in this case. And in case
there aren't pending snapshots (which will be the majority of the cases)
we rely on the percpu's writers counter to avoid cacheline contention.

The main gain from using the drew lock is it's now a lot easier to
reason about the guarantees of the locking scheme and whether there is
some silent breakage lurking.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Implement DREW lock
Nikolay Borisov [Thu, 30 Jan 2020 12:59:44 +0000 (14:59 +0200)]
btrfs: Implement DREW lock

A (D)ouble (R)eader (W)riter (E)xclustion lock is a locking primitive
that allows to have multiple readers or multiple writers but not
multiple readers and writers holding it concurrently.

The code is factored out from the existing open-coded locking scheme
used to exclude pending snapshots from nocow writers and vice-versa.
Current implementation actually favors Readers (that is snapshot
creaters) to writers (nocow writers of the filesystem).

The API provides lock/unlock/trylock for reads and writes.

Formal specification for TLA+ provided by Valentin Schneider is at
https://lore.kernel.org/linux-btrfs/2dcaf81c-f0d3-409e-cb29-733d8b3b4cc9@arm.com/

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: simplify error handling in __btrfs_write_out_cache()
Johannes Thumshirn [Tue, 11 Feb 2020 15:10:22 +0000 (00:10 +0900)]
btrfs: simplify error handling in __btrfs_write_out_cache()

The error cleanup gotos in __btrfs_write_out_cache() needlessly jump
back making the code less readable then needed.  Flatten them out so no
back-jump is necessary and the read flow is uninterrupted.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: use standard debug config option to enable free-space-cache debug prints
Johannes Thumshirn [Tue, 11 Feb 2020 15:10:21 +0000 (00:10 +0900)]
btrfs: use standard debug config option to enable free-space-cache debug prints

free-space-cache.c has it's own set of DEBUG ifdefs which need to be
turned on instead of the global CONFIG_BTRFS_DEBUG to print debug
messages about failed block-group writes.

Switch this over to CONFIG_BTRFS_DEBUG so we always see these messages
when running a debug kernel.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: make the uptodate argument of io_ctl_add_pages() boolean
Johannes Thumshirn [Tue, 11 Feb 2020 15:10:20 +0000 (00:10 +0900)]
btrfs: make the uptodate argument of io_ctl_add_pages() boolean

Make the uptodate argument of io_ctl_add_pages() boolean.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: use inode from io_ctl in io_ctl_prepare_pages
Johannes Thumshirn [Tue, 11 Feb 2020 15:10:19 +0000 (00:10 +0900)]
btrfs: use inode from io_ctl in io_ctl_prepare_pages

io_ctl_prepare_pages() gets a 'struct btrfs_io_ctl' as well as a 'struct
inode', but btrfs_io_ctl::inode points to the same struct inode as this is
assgined in io_ctl_init().

Use the inode form io_ctl to reduce the arguments of io_ctl_prepare_pages.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
Marcos Paulo de Souza [Fri, 7 Feb 2020 13:05:46 +0000 (10:05 -0300)]
btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl

This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:

/
@subvol1/
@subvol2/
@subvol_default/

If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.

This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.

Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.

The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.

The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: export helpers for subvolume name/id resolution
Marcos Paulo de Souza [Fri, 21 Feb 2020 13:56:12 +0000 (14:56 +0100)]
btrfs: export helpers for subvolume name/id resolution

The functions will be used outside of export.c and super.c to allow
resolving subvolume name from a given id, eg. for subvolume deletion by
id ioctl.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ split from the next patch ]
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: use ioctl args support mask for device delete
David Sterba [Fri, 21 Feb 2020 12:30:14 +0000 (13:30 +0100)]
btrfs: use ioctl args support mask for device delete

When the device remove v2 ioctl was added, the full support mask was
added to sanity check the flags. However this would allow to let the
subvolume related flags to be accepted. This is not supposed to happen.

Use the correct support mask, which means that now any of
BTRFS_SUBVOL_CREATE_ASYNC, BTRFS_SUBVOL_RDONLY or
BTRFS_SUBVOL_QGROUP_INHERIT will be rejected as ENOTSUPP. Though this is
a user-visible change, specifying subvolume flags for device deletion
does not make sense and there are hopefully no applications doing that.

Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: use ioctl args support mask for subvolume create/delete
David Sterba [Fri, 21 Feb 2020 12:24:37 +0000 (13:24 +0100)]
btrfs: use ioctl args support mask for subvolume create/delete

Using the defined mask instead of flag enumeration in the ioctl handler
is preferred. No functional changes.

Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: define support masks for ioctl volume args v2
David Sterba [Fri, 21 Feb 2020 12:16:33 +0000 (13:16 +0100)]
btrfs: define support masks for ioctl volume args v2

The ioctl data for devices or subvolumes can be passed via
btrfs_ioctl_vol_args or btrfs_ioctl_vol_args_v2. The latter is more
versatile and needs some caution as some of the flags make sense only
for some ioctls.

As we're going to extend the flags, define support masks for each ioctl
class separately.

Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Add missing lock annotation for release_extent_buffer()
Jules Irenge [Sun, 23 Feb 2020 23:16:42 +0000 (23:16 +0000)]
btrfs: Add missing lock annotation for release_extent_buffer()

Sparse reports a warning at release_extent_buffer()
warning: context imbalance in release_extent_buffer() - unexpected unlock

The root cause is the missing annotation at release_extent_buffer()
Add the missing __releases(&eb->refs_lock) annotation

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: set update the uuid generation as soon as possible
Josef Bacik [Fri, 14 Feb 2020 20:22:06 +0000 (15:22 -0500)]
btrfs: set update the uuid generation as soon as possible

In my EIO stress testing I noticed I was getting forced to rescan the
uuid tree pretty often, which was weird.  This is because my error
injection stuff would sometimes inject an error after log replay but
before we loaded the UUID tree.  If log replay committed the transaction
it wouldn't have updated the uuid tree generation, but the tree was
valid and didn't change, so there's no reason to not update the
generation here.

Fix this by setting the BTRFS_FS_UPDATE_UUID_TREE_GEN bit immediately
after reading all the fs roots if the uuid tree generation matches the
fs generation.  Then any transaction commits that happen during mount
won't screw up our uuid tree state, forcing us to do needless uuid
rescans.

Fixes: 70f801754728 ("Btrfs: check UUID tree during mount if required")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: bail out of uuid tree scanning if we're closing
Josef Bacik [Fri, 14 Feb 2020 20:05:01 +0000 (15:05 -0500)]
btrfs: bail out of uuid tree scanning if we're closing

In doing my fsstress+EIO stress testing I started running into issues
where umount would get stuck forever because the uuid checker was
chewing through the thousands of subvolumes I had created.

We shouldn't block umount on this, simply bail if we're unmounting the
fs.  We need to make sure we don't mark the UUID tree as ok, so we only
set that bit if we made it through the whole rescan operation, but
otherwise this is completely safe.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: make btrfs_check_uuid_tree private to disk-io.c
Nikolay Borisov [Tue, 18 Feb 2020 14:56:08 +0000 (16:56 +0200)]
btrfs: make btrfs_check_uuid_tree private to disk-io.c

It's used only during filesystem mount as such it can be made private to
disk-io.c file. Also use the occasion to move btrfs_uuid_rescan_kthread
as btrfs_check_uuid_tree is its sole caller.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: call btrfs_check_uuid_tree_entry directly in btrfs_uuid_tree_iterate
Nikolay Borisov [Tue, 18 Feb 2020 14:56:07 +0000 (16:56 +0200)]
btrfs: call btrfs_check_uuid_tree_entry directly in btrfs_uuid_tree_iterate

btrfs_uuid_tree_iterate is called from only once place and its 2nd
argument is always btrfs_check_uuid_tree_entry. Simplify
btrfs_uuid_tree_iterate's signature by removing its 2nd argument and
directly calling btrfs_check_uuid_tree_entry. Also move the latter into
uuid-tree.h. No functional changes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: raid56: simplify tracking of Q stripe presence
David Sterba [Wed, 19 Feb 2020 14:17:20 +0000 (15:17 +0100)]
btrfs: raid56: simplify tracking of Q stripe presence

There are temporary variables tracking the index of P and Q stripes, but
none of them is really used as such, merely for determining if the Q
stripe is present. This leads to compiler warnings with
-Wunused-but-set-variable and has been reported several times.

fs/btrfs/raid56.c: In function â€˜finish_rmw’:
fs/btrfs/raid56.c:1199:6: warning: variable â€˜p_stripe’ set but not used [-Wunused-but-set-variable]
 1199 |  int p_stripe = -1;
      |      ^~~~~~~~
fs/btrfs/raid56.c: In function â€˜finish_parity_scrub’:
fs/btrfs/raid56.c:2356:6: warning: variable â€˜p_stripe’ set but not used [-Wunused-but-set-variable]
 2356 |  int p_stripe = -1;
      |      ^~~~~~~~

Replace the two variables with one that has a clear meaning and also get
rid of the warnings. The logic that verifies that there are only 2
valid cases is unchanged.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: backref, use correct count to resolve normal data refs
ethanwu [Fri, 7 Feb 2020 09:38:18 +0000 (17:38 +0800)]
btrfs: backref, use correct count to resolve normal data refs

With the following patches:

- btrfs: backref, only collect file extent items matching backref offset
- btrfs: backref, not adding refs from shared block when resolving normal backref
- btrfs: backref, only search backref entries from leaves of the same root

we only collect the normal data refs we want, so the imprecise upper
bound total_refs of that EXTENT_ITEM could now be changed to the count
of the normal backref entry we want to search.

Background and how the patches fit together:

Btrfs has two types of data backref.
For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
exact block number. Therefore, we need to call resolve_indirect_refs.
It uses btrfs_search_slot to locate the leaf block. Then
we need to walk through the leaves to search for the EXTENT_DATA items
that have disk bytenr matching the extent item (add_all_parents).

When resolving indirect refs, we could take entries that don't
belong to the backref entry we are searching for right now.
For that reason when searching backref entry, we always use total
refs of that EXTENT_ITEM rather than individual count.

For example:
item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize
  extent refs 24 gen 7302 flags DATA
  shared data backref parent 394985472 count 10 #1
  extent data backref root 257 objectid 260 offset 1048576 count 3 #2
  extent data backref root 256 objectid 260 offset 65536 count 6 #3
  extent data backref root 257 objectid 260 offset 65536 count 5 #4

For example, when searching backref entry #4, we'll use total_refs
24, a very loose loop ending condition, instead of total_refs = 5.

But using total_refs = 24 is not accurate. Sometimes, we'll never find
all the refs from specific root.  As a result, the loop keeps on going
until we reach the end of that inode.

The first 3 patches, handle 3 different types refs we might encounter.
These refs do not belong to the normal backref we are searching, and
hence need to be skipped.

This patch changes the total_refs to correct number so that we could
end loop as soon as we find all the refs we want.

btrfs send uses backref to find possible clone sources, the following
is a simple test to compare the results with and without this patch:

 $ btrfs subvolume create /sub1
 $ for i in `seq 1 163840`; do
     dd if=/dev/zero of=/sub1/file bs=64K count=1 seek=$((i-1)) conv=notrunc oflag=direct
   done
 $ btrfs subvolume snapshot /sub1 /sub2
 $ for i in `seq 1 163840`; do
     dd if=/dev/zero of=/sub1/file bs=4K count=1 seek=$(((i-1)*16+10)) conv=notrunc oflag=direct
   done
 $ btrfs subvolume snapshot -r /sub1 /snap1
 $ time btrfs send /snap1 | btrfs receive /volume2

Without this patch:

real 69m48.124s
user 0m50.199s
sys  70m15.600s

With this patch:

real    1m59.683s
user    0m35.421s
sys     2m42.684s

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
[ add patchset cover letter with background and numbers ]
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: backref, only search backref entries from leaves of the same root
ethanwu [Fri, 7 Feb 2020 09:38:17 +0000 (17:38 +0800)]
btrfs: backref, only search backref entries from leaves of the same root

We could have some nodes/leaves in subvolume whose owner are not the
that subvolume. In this way, when we resolve normal backrefs of that
subvolume, we should avoid collecting those references from these blocks.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: backref, don't add refs from shared block when resolving normal backref
ethanwu [Fri, 7 Feb 2020 09:38:16 +0000 (17:38 +0800)]
btrfs: backref, don't add refs from shared block when resolving normal backref

All references from the block of SHARED_DATA_REF belong to that shared
block backref.

For example:

  item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 95
      extent refs 24 gen 7302 flags DATA
      extent data backref root 257 objectid 260 offset 65536 count 5
      extent data backref root 258 objectid 265 offset 0 count 9
      shared data backref parent 394985472 count 10

Block 394985472 might be leaf from root 257, and the item obejctid and
(file_pos - file_extent_item::offset) in that leaf just happens to be
260 and 65536 which is equal to the first extent data backref entry.

Before this patch, when we resolve backref:

  root 257 objectid 260 offset 65536

we will add those refs in block 394985472 and wrongly treat those as the
refs we want.

Fix this by checking if the leaf we are processing is shared data
backref, if so, just skip this leaf.

Shared data refs added into preftrees.direct have all entry value = 0
(root_id = 0, key = NULL, level = 0) except parent entry.

Other refs from indirect tree will have key value and root id != 0, and
these values won't be changed when their parent is resolved and added to
preftrees.direct. Therefore, we could reuse the preftrees.direct and
search ref with all values = 0 except parent is set to avoid getting
those resolved refs block.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: backref, only collect file extent items matching backref offset
ethanwu [Fri, 7 Feb 2020 09:38:15 +0000 (17:38 +0800)]
btrfs: backref, only collect file extent items matching backref offset

When resolving one backref of type EXTENT_DATA_REF, we collect all
references that simply reference the EXTENT_ITEM even though their
(file_pos - file_extent_item::offset) are not the same as the
btrfs_extent_data_ref::offset we are searching for.

This patch adds additional check so that we only collect references whose
(file_pos - file_extent_item::offset) == btrfs_extent_data_ref::offset.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: remove buffer_heads form super block mirror integrity checking
Johannes Thumshirn [Thu, 13 Feb 2020 15:24:36 +0000 (00:24 +0900)]
btrfs: remove buffer_heads form super block mirror integrity checking

The integrity checking code for the super block mirrors is the last
remaining user of buffer_heads, change it to using plain bios as well.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: remove buffer_heads from btrfsic_process_written_block()
Johannes Thumshirn [Thu, 13 Feb 2020 15:24:35 +0000 (00:24 +0900)]
btrfs: remove buffer_heads from btrfsic_process_written_block()

Now that the last caller of btrfsic_process_written_block() with
buffer_heads is gone, remove the buffer_head processing path from it as
well.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: remove btrfsic_submit_bh()
Johannes Thumshirn [Thu, 13 Feb 2020 15:24:34 +0000 (00:24 +0900)]
btrfs: remove btrfsic_submit_bh()

Now that the last use of btrfsic_submit_bh() is gone as the super block
is now written using bios, remove the function as well.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: use bios instead of buffer_heads from super block writeout
Johannes Thumshirn [Thu, 13 Feb 2020 15:24:33 +0000 (00:24 +0900)]
btrfs: use bios instead of buffer_heads from super block writeout

Similar to the superblock read path, change the write path to using bios
and pages instead of buffer_heads. This allows us to skip over the
buffer_head code, for writing the superblock to disk.

This is based on a patch originally authored by Nikolay Borisov.

Co-developed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: use the page cache for super block reading
Johannes Thumshirn [Thu, 13 Feb 2020 15:24:32 +0000 (00:24 +0900)]
btrfs: use the page cache for super block reading

Super-block reading in BTRFS is done using buffer_heads. Buffer_heads
have some drawbacks, like not being able to propagate errors from the
lower layers.

Directly use the page cache for reading the super blocks from disk or
invalidating an on-disk super block. We have to use the page cache so to
avoid races between mkfs and udev. See also 6f60cbd3ae44 ("btrfs: access
superblock via pagecache in scan_one_device").

This patch unwraps the buffer head API and does not change the way the
super block is actually read.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: reduce scope of btrfs_scratch_superblocks()
Johannes Thumshirn [Thu, 13 Feb 2020 15:24:31 +0000 (00:24 +0900)]
btrfs: reduce scope of btrfs_scratch_superblocks()

btrfs_scratch_superblocks() isn't used anywhere outside volumes.c so
remove it from the header file and mark it as static.  Also move it
above it's callers so we don't need a forward declaration.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: don't kmap() pages from block devices
Johannes Thumshirn [Thu, 13 Feb 2020 15:24:30 +0000 (00:24 +0900)]
btrfs: don't kmap() pages from block devices

Block device mappings are never in highmem so kmap() / kunmap() calls for
pages from block devices are unneeded. Use page_address() instead of
kmap() to get to the virtual addreses.

While we're at it, read_cache_page_gfp() doesn't return NULL on error,
only an ERR_PTR, so use IS_ERR() to check for errors.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Export btrfs_release_disk_super
Nikolay Borisov [Thu, 13 Feb 2020 15:24:29 +0000 (00:24 +0900)]
btrfs: Export btrfs_release_disk_super

Preparatory patch for removal of buffer_head usage in btrfs.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agoBtrfs: avoid unnecessary splits when setting bits on an extent io tree
Filipe Manana [Thu, 13 Feb 2020 10:20:02 +0000 (10:20 +0000)]
Btrfs: avoid unnecessary splits when setting bits on an extent io tree

When attempting to set bits on a range of an exent io tree that already
has those bits set we can end up splitting an extent state record, use
the preallocated extent state record, insert it into the red black tree,
do another search on the red black tree, merge the preallocated extent
state record with the previous extent state record, remove that previous
record from the red black tree and then free it. This is all unnecessary
work that consumes time.

This happens specifically at the following case at __set_extent_bit():

  $ cat -n fs/btrfs/extent_io.c
   957  static int __must_check
   958  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
  (...)
  1044          /*
  1045           *     | ---- desired range ---- |
  1046           * | state |
  1047           *   or
  1048           * | ------------- state -------------- |
  1049           *
  (...)
  1060          if (state->start < start) {
  1061                  if (state->state & exclusive_bits) {
  1062                          *failed_start = start;
  1063                          err = -EEXIST;
  1064                          goto out;
  1065                  }
  1066
  1067                  prealloc = alloc_extent_state_atomic(prealloc);
  1068                  BUG_ON(!prealloc);
  1069                  err = split_state(tree, state, prealloc, start);
  1070                  if (err)
  1071                          extent_io_tree_panic(tree, err);
  1072
  1073                  prealloc = NULL;

So if our extent state represents a range from 0 to 1MiB for example, and
we want to set bits in the range 128KiB to 256KiB for example, and that
extent state record already has all those bits set, we end up splitting
that record, so we end up with extent state records in the tree which
represent the ranges from 0 to 128KiB and from 128KiB to 1MiB. This is
temporary because a subsequent iteration in that function will end up
merging the records.

The splitting requires using the preallocated extent state record, so
a future iteration that needs to do another split will need to allocate
another extent state record in an atomic context, something not ideal
that we try to avoid as much as possible. The splitting also requires
an insertion in the red black tree, and a subsequent merge will require
a deletion from the red black tree and freeing an extent state record.

This change just skips the splitting of an extent state record when it
already has all the bits the we need to set.

Setting a bit that is already set for a range is very common in the
inode's 'file_extent_tree' extent io tree for example, where we keep
setting the EXTENT_DIRTY bit every time we replace an extent.

This change also fixes a bug that happens after the recent patchset from
Josef that avoids having implicit holes after a power failure when not
using the NO_HOLES feature, more specifically the patch with the subject:

  "btrfs: introduce the inode->file_extent_tree"

This patch introduced an extent io tree per inode to keep track of
completed ordered extents and figure out at any time what is the safe
value for the inode's disk_i_size. This assumes that for contiguous
ranges in a file we always end up with a single extent state record in
the io tree, but that is not the case, as there is a short time window
where we can have two extent state records representing contiguous
ranges. When this happens we end setting up an incorrect value for the
inode's disk_i_size, resulting in data loss after a clean unmount
of the filesystem. The following example explains how this can happen.

Suppose we have an inode with an i_size and a disk_i_size of 1MiB, so in
the inode's file_extent_tree we have a single extent state record that
represents the range [0, 1MiB) with the EXTENT_DIRTY bit set. Then the
following steps happen:

1) A buffered write against file range [512KiB, 768KiB) is made. At this
   point delalloc was not flushed yet;

2) Deduplication from some other inode into this inode's range
   [128KiB, 256KiB) is made. This causes btrfs_inode_set_file_extent_range()
   to be called, from btrfs_insert_clone_extent(), to mark the range
   [128KiB, 256KiB) with EXTENT_DIRTY in the inode's file_extent_tree;

3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we
   end up at __set_extent_bit(). In the first iteration of that function's
   loop we end up in the following branch:

   $ cat -n fs/btrfs/extent_io.c
    957  static int __must_check
    958  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
   (...)
   1044          /*
   1045           *     | ---- desired range ---- |
   1046           * | state |
   1047           *   or
   1048           * | ------------- state -------------- |
   1049           *
   (...)
   1060          if (state->start < start) {
   1061                  if (state->state & exclusive_bits) {
   1062                          *failed_start = start;
   1063                          err = -EEXIST;
   1064                          goto out;
   1065                  }
   1066
   1067                  prealloc = alloc_extent_state_atomic(prealloc);
   1068                  BUG_ON(!prealloc);
   1069                  err = split_state(tree, state, prealloc, start);
   1070                  if (err)
   1071                          extent_io_tree_panic(tree, err);
   1072
   1073                  prealloc = NULL;
   (...)
   1089                  goto search_again;

   This splits the state record into two, one for range [0, 128KiB) and
   another for the range [128KiB, 1MiB). Both already have the EXTENT_DIRTY
   bit set. Then we jump to the 'search_again' label, where we unlock the
   the spinlock protecting the extent io tree before jumping to the
   'again' label to perform the next iteration;

4) In the meanwhile, delalloc is flushed, the ordered extent for the range
   [512KiB, 768KiB) is created and when it completes, at
   btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write()
   with a value of 0 for its 'new_size' argument;

5) Before the deduplication task currently at __set_extent_bit() moves to
   the next iteration, the task finishing the ordered extent calls
   find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write()
   and gets 'start' set to 0 and 'end' set to 128KiB - because at this
   moment the io tree has two extent state records, one representing the
   range [0, 128KiB) and another representing the range [128KiB, 1MiB),
   both with EXTENT_DIRTY set. Then we set 'isize' to:

   isize = min(isize, end + 1)
         = min(1MiB, 128KiB - 1 + 1)
         = 128KiB

   Then we set the inode's disk_i_size to 128KiB (isize).

   After a clean unmount of the filesystem and mounting it again, we have
   the file with a size of 128KiB, and effectively lost all the data it
   had before in the range from 128KiB to 1MiB.

This change fixes that issue too, as we never end up splitting extent
state records when they already have all the bits we want set.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: handle logged extent failure properly
Josef Bacik [Thu, 13 Feb 2020 15:47:30 +0000 (10:47 -0500)]
btrfs: handle logged extent failure properly

If we're allocating a logged extent we attempt to insert an extent
record for the file extent directly.  We increase
space_info->bytes_reserved, because the extent entry addition will call
btrfs_update_block_group(), which will convert the ->bytes_reserved to
->bytes_used.  However if we fail at any point while inserting the
extent entry we will bail and leave space on ->bytes_reserved, which
will trigger a WARN_ON() on umount.  Fix this by pinning the space if we
fail to insert, which is what happens in every other failure case that
involves adding the extent entry.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: relocation: Remove is_cowonly_root()
Qu Wenruo [Wed, 12 Feb 2020 07:43:31 +0000 (15:43 +0800)]
btrfs: relocation: Remove is_cowonly_root()

This function is only used in read_fs_root(), which is just a wrapper of
btrfs_get_fs_root().

For all the mentioned essential roots except log root tree,
btrfs_get_fs_root() has its own quick path to grab them from fs_info
directly, thus no need for key.offset modification.

For subvolume trees, btrfs_get_fs_root() with key.offset == -1 is
completely fine.

For log trees and log root tree, it's impossible to hit them, as for
relocation all backrefs are fetched from commit root, which never
records log tree blocks.

Log tree blocks either get freed in regular transaction commit, or
replayed at mount time. At runtime we should never hit an backref for
log tree in extent tree.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: switch to per-transaction pinned extents
Nikolay Borisov [Mon, 20 Jan 2020 14:09:18 +0000 (16:09 +0200)]
btrfs: switch to per-transaction pinned extents

This commit flips the switch to start tracking/processing pinned extents
on a per-transaction basis. It mostly replaces all references from
btrfs_fs_info::(pinned_extents|freed_extents[]) to
btrfs_transaction::pinned_extents.

Two notable modifications that warrant explicit mention are changing
clean_pinned_extents to get a reference to the previously running
transaction. The other one is removal of call to
btrfs_destroy_pinned_extent since transactions are going to be cleaned
in btrfs_cleanup_one_transaction.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Factor out pinned extent clean up in btrfs_delete_unused_bgs
Nikolay Borisov [Mon, 20 Jan 2020 14:09:17 +0000 (16:09 +0200)]
btrfs: Factor out pinned extent clean up in btrfs_delete_unused_bgs

Next patch is going to refactor how pinned extents are tracked which
will necessitate changing this code. To ease that work and contain the
changes factor the code now in preparation, this will also help review.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Mark pinned log extents as excluded
Nikolay Borisov [Mon, 20 Jan 2020 14:09:16 +0000 (16:09 +0200)]
btrfs: Mark pinned log extents as excluded

In preparation to making pinned extents per-transaction ensure that log
such extents are always excluded from caching. To achieve this in
addition to marking them via btrfs_pin_extent_for_log_replay they also
need to be marked with btrfs_add_excluded_extent to prevent log tree
extent buffer being loaded by the free space caching thread. That's
required since log tree blocks are not recorded in the extent tree, hence
they always look free.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Pass transaction handle to write_pinned_extent_entries
Nikolay Borisov [Mon, 20 Jan 2020 14:09:15 +0000 (16:09 +0200)]
btrfs: Pass transaction handle to write_pinned_extent_entries

Preparation for refactoring pinned extents tracking.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Make pin_down_extent take transaction handle
Nikolay Borisov [Mon, 20 Jan 2020 14:09:14 +0000 (16:09 +0200)]
btrfs: Make pin_down_extent take transaction handle

All callers have a reference to a transaction handle so pass it to
pin_down_extent. This is the final step before switching pinned extent
tracking to a per-transaction basis.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Make btrfs_pin_extent_for_log_replay take transaction handle
Nikolay Borisov [Mon, 20 Jan 2020 14:09:13 +0000 (16:09 +0200)]
btrfs: Make btrfs_pin_extent_for_log_replay take transaction handle

Preparation for refactoring pinned extents tracking.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Make btrfs_pin_reserved_extent take transaction handle
Nikolay Borisov [Mon, 20 Jan 2020 14:09:12 +0000 (16:09 +0200)]
btrfs: Make btrfs_pin_reserved_extent take transaction handle

btrfs_pin_reserved_extent is now only called with a valid transaction so
exploit the fact to take a transaction. This is preparation for tracking
pinned extents on a per-transaction basis.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Call btrfs_pin_reserved_extent only during active transaction
Nikolay Borisov [Mon, 20 Jan 2020 14:09:11 +0000 (16:09 +0200)]
btrfs: Call btrfs_pin_reserved_extent only during active transaction

Calling btrfs_pin_reserved_extent makes sense only with a valid
transaction since pinned extents are processed from transaction commit
in btrfs_finish_extent_commit. In case of error it's sufficient to
adjust the reserved counter to account for log tree extents allocated in
the last transaction.

This commit moves btrfs_pin_reserved_extent to be called only with valid
transaction handle and otherwise uses the newly introduced
unaccount_log_buffer to adjust "reserved". If this is not done if a
failure occurs before transaction is committed WARN_ON are going to be
triggered on unmount. This was especially pronounced with generic/475
test.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Introduce unaccount_log_buffer
Nikolay Borisov [Mon, 20 Jan 2020 14:09:10 +0000 (16:09 +0200)]
btrfs: Introduce unaccount_log_buffer

This function correctly adjusts the reserved bytes occupied by a log
tree extent buffer. It will be used instead of calling
btrfs_pin_reserved_extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Make btrfs_pin_extent take trans handle
Nikolay Borisov [Mon, 20 Jan 2020 14:09:09 +0000 (16:09 +0200)]
btrfs: Make btrfs_pin_extent take trans handle

Preparation for switching pinned extent tracking to a per-transaction
basis.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: Perform pinned cleanup directly in btrfs_destroy_delayed_refs
Nikolay Borisov [Mon, 20 Jan 2020 14:09:08 +0000 (16:09 +0200)]
btrfs: Perform pinned cleanup directly in btrfs_destroy_delayed_refs

Having btrfs_destroy_delayed_refs call btrfs_pin_extent is problematic
for making pinned extents tracking per-transaction since
btrfs_trans_handle cannot be passed to btrfs_pin_extent in this context.
Additionally delayed refs heads pinned in btrfs_destroy_delayed_refs
are going to be handled very closely, in btrfs_destroy_pinned_extent.

To enable btrfs_pin_extent to take btrfs_trans_handle simply open code
it in btrfs_destroy_delayed_refs and call btrfs_error_unpin_extent_range
on the range. This enables us to do less work in
btrfs_destroy_pinned_extent and leaves btrfs_pin_extent being called in
contexts which have a valid btrfs_trans_handle.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: sysfs, unify handler name of devinfo/missing
Anand Jain [Thu, 13 Feb 2020 08:40:53 +0000 (16:40 +0800)]
btrfs: sysfs, unify handler name of devinfo/missing

The devinfo attribute handlers were added in 668e48af7a94 ("btrfs:
sysfs, add devid/dev_state kobject and device attributes") and the name
should contain _devinfo_, there's one that does not conform, so unify it
with the rest.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: sysfs, rename device_link add/remove functions
Anand Jain [Wed, 12 Feb 2020 09:28:13 +0000 (17:28 +0800)]
btrfs: sysfs, rename device_link add/remove functions

Since commit 668e48af7a94 ("btrfs: sysfs, add devid/dev_state kobject and
device attributes"), the functions btrfs_sysfs_add_device_link() and
btrfs_sysfs_rm_device_link() do more than just adding and removing the
device link as its name indicated. Rename them to be more specific
that's about the directory with the attirbutes

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: sysfs, use btrfs_sysfs_remove_fsid to celanup errors in add_fsid
Anand Jain [Wed, 12 Feb 2020 09:28:12 +0000 (17:28 +0800)]
btrfs: sysfs, use btrfs_sysfs_remove_fsid to celanup errors in add_fsid

We have one simple function btrfs_sysfs_remove_fsid() to undo
btrfs_sysfs_add_fsid(), which also does proper checks before releasing
objects.

One difference, if btrfs_sysfs_remove_fsid is used that now we also call
kobject_del() which was missing before. This was tested (with kobject
debug turned on) and no change in behaviour was found.

This is a cleanup patch.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: sink argument tree to __do_readpage
David Sterba [Wed, 5 Feb 2020 18:09:42 +0000 (19:09 +0100)]
btrfs: sink argument tree to __do_readpage

The tree pointer can be safely read from the inode, use it and drop the
redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: sink arugment tree to contiguous_readpages
David Sterba [Wed, 5 Feb 2020 18:09:40 +0000 (19:09 +0100)]
btrfs: sink arugment tree to contiguous_readpages

The tree pointer can be safely read from the inode, use it and drop the
redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: sink argument tree to __extent_read_full_page
David Sterba [Wed, 5 Feb 2020 18:09:37 +0000 (19:09 +0100)]
btrfs: sink argument tree to __extent_read_full_page

The tree pointer can be safely read from the inode, use it and drop the
redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: sink argument tree to extent_read_full_page
David Sterba [Wed, 5 Feb 2020 18:09:35 +0000 (19:09 +0100)]
btrfs: sink argument tree to extent_read_full_page

The tree pointer can be safely read from the page's inode, use it and
drop the redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: drop argument tree from btrfs_lock_and_flush_ordered_range
David Sterba [Wed, 5 Feb 2020 18:09:33 +0000 (19:09 +0100)]
btrfs: drop argument tree from btrfs_lock_and_flush_ordered_range

The tree pointer can be safely read from the inode so we can drop the
redundant argument from btrfs_lock_and_flush_ordered_range.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: add assertions for tree == inode->io_tree to extent IO helpers
David Sterba [Wed, 5 Feb 2020 18:09:30 +0000 (19:09 +0100)]
btrfs: add assertions for tree == inode->io_tree to extent IO helpers

Add assertions to all helpers that get tree as argument and verify that
it's the same that can be obtained from the inode or from its pages. In
followup patches the redundant arguments and assertions will be removed
one by one.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: drop argument tree from submit_extent_page
David Sterba [Wed, 5 Feb 2020 18:09:28 +0000 (19:09 +0100)]
btrfs: drop argument tree from submit_extent_page

Now that we're sure the tree from argument is same as the one we can get
from the page's inode io_tree, drop the redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: remove extent_page_data::tree
David Sterba [Wed, 5 Feb 2020 18:09:26 +0000 (19:09 +0100)]
btrfs: remove extent_page_data::tree

All functions that set up extent_page_data::tree set it to the inode
io_tree. That's passed down the callstack that accesses either the same
inode or its pages. In the end submit_extent_page can pull the tree out
of the page and we don't have to store it in the structure.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: add wrapper for transaction abort predicate
David Sterba [Wed, 5 Feb 2020 16:34:34 +0000 (17:34 +0100)]
btrfs: add wrapper for transaction abort predicate

The status of aborted transaction can change between calls and it needs
to be accessed by READ_ONCE. Add a helper that also wraps the unlikely
hint.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: move root node locking helpers to locking.c
David Sterba [Wed, 5 Feb 2020 16:26:51 +0000 (17:26 +0100)]
btrfs: move root node locking helpers to locking.c

The helpers are related to locking so move them there, update comments.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root
Josef Bacik [Fri, 24 Jan 2020 14:33:01 +0000 (09:33 -0500)]
btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root

We are now using these for all roots, rename them to btrfs_put_root()
and btrfs_grab_root();

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: add a leak check for roots
Josef Bacik [Fri, 24 Jan 2020 14:33:00 +0000 (09:33 -0500)]
btrfs: add a leak check for roots

Now that we're going to start relying on getting ref counting right for
roots, add a list to track allocated roots and print out any roots that
aren't freed up at free_fs_info time.

Hide this behind CONFIG_BTRFS_DEBUG because this will just be used for
developers to verify they aren't breaking things.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: make the init of static elements in fs_info separate
Josef Bacik [Fri, 24 Jan 2020 14:32:59 +0000 (09:32 -0500)]
btrfs: make the init of static elements in fs_info separate

In adding things like eb leak checking and root leak checking there were
a lot of weird corner cases that come from the fact that

  1) We do not init the fs_info until we get to open_ctree time in the
     normal case and

  2) The test infrastructure half-init's the fs_info for things that it
     needs.

This makes it really annoying to make changes because you have to add
init in two different places, have special cases for testing fs_info's
that may not have certain things initialized, and cases for fs_info's
that didn't make it to open_ctree and thus are not fully set up.

Fix this by extracting out the non-allocating init of the fs info into
it's own public function and use that to make sure we're all getting
consistent views of an allocated fs_info.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: move fs_info init work into it's own helper function
Josef Bacik [Fri, 24 Jan 2020 14:32:58 +0000 (09:32 -0500)]
btrfs: move fs_info init work into it's own helper function

open_ctree mixes initialization of fs stuff and fs_info stuff, which
makes it confusing when doing things like adding the root leak
detection.  Make a separate function that inits all the static
structures inside of the fs_info needed for the fs to operate, and then
call that before we start setting up the fs_info to be mounted.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: free more things in btrfs_free_fs_info
Josef Bacik [Fri, 24 Jan 2020 14:32:57 +0000 (09:32 -0500)]
btrfs: free more things in btrfs_free_fs_info

Things like the percpu_counters, the mapping_tree, and the csum hash can
all be freed at btrfs_free_fs_info time, since the helpers all check if
the structure has been initialized already.  This significantly cleans
up the error cases in open_ctree.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: push btrfs_grab_fs_root into btrfs_get_fs_root
Josef Bacik [Fri, 24 Jan 2020 14:32:56 +0000 (09:32 -0500)]
btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root

Now that all callers of btrfs_get_fs_root are subsequently calling
btrfs_grab_fs_root and handling dropping the ref when they are done
appropriately, go ahead and push btrfs_grab_fs_root up into
btrfs_get_fs_root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: use btrfs_put_fs_root to free roots always
Josef Bacik [Fri, 24 Jan 2020 14:32:55 +0000 (09:32 -0500)]
btrfs: use btrfs_put_fs_root to free roots always

If we are going to track leaked roots we need to free them all the same
way, so don't kfree() roots directly, use btrfs_put_fs_root.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in open_ctree
Josef Bacik [Fri, 24 Jan 2020 14:32:54 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in open_ctree

We lookup the fs_root and put it in our fs_info directly, we should hold
a ref on this root for the lifetime of the fs_info.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: export and rename free_fs_info
Josef Bacik [Fri, 24 Jan 2020 14:32:53 +0000 (09:32 -0500)]
btrfs: export and rename free_fs_info

We're going to start freeing roots and doing other complicated things in
free_fs_info, so we need to move it to disk-io.c and export it in order
to use things lik btrfs_put_fs_root().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in btrfs_check_uuid_tree_entry
Josef Bacik [Fri, 24 Jan 2020 14:32:52 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in btrfs_check_uuid_tree_entry

We lookup the uuid of arbitrary subvolumes, hold a ref on the root while
we're doing this.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
4 years agobtrfs: hold a ref on the root in btrfs_recover_log_trees
Josef Bacik [Fri, 24 Jan 2020 14:32:51 +0000 (09:32 -0500)]
btrfs: hold a ref on the root in btrfs_recover_log_trees

We replay the log into arbitrary fs roots, hold a ref on the root while
we're doing this.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>