David Sterba [Wed, 4 Apr 2018 00:00:17 +0000 (02:00 +0200)]
btrfs: replace btrfs_set_lock_blocking_rw with appropriate helpers
We can use the right helper where the lock type is a fixed parameter.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 3 Apr 2018 23:52:31 +0000 (01:52 +0200)]
btrfs: split btrfs_clear_lock_blocking_rw to read and write helpers
There are many callers that hardcode the desired lock type so we can
avoid the switch and call them directly. Split the current function to
two. There are no remaining users of btrfs_clear_lock_blocking_rw so
it's removed. The call sites will be converted in followup patches.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 3 Apr 2018 23:43:05 +0000 (01:43 +0200)]
btrfs: split btrfs_set_lock_blocking_rw to read and write helpers
There are many callers that hardcode the desired lock type so we can
avoid the switch and call them directly. Split the current function to
two but leave a helper that still takes the variable lock type to make
current code compile. The call sites will be converted in followup
patches.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 23 Jan 2019 07:15:18 +0000 (15:15 +0800)]
btrfs: qgroup: Cleanup old subtree swap code
Since it's replaced by new delayed subtree swap code, remove the
original code.
The cleanup is small since most of its core function is still used by
delayed subtree swap trace.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 23 Jan 2019 07:15:17 +0000 (15:15 +0800)]
btrfs: qgroup: Use delayed subtree rescan for balance
Before this patch, qgroup code traces the whole subtree of subvolume and
reloc trees unconditionally.
This makes qgroup numbers consistent, but it could cause tons of
unnecessary extent tracing, which causes a lot of overhead.
However for subtree swap of balance, just swap both subtrees because
they contain the same contents and tree structure, so qgroup numbers
won't change.
It's the race window between subtree swap and transaction commit could
cause qgroup number change.
This patch will delay the qgroup subtree scan until COW happens for the
subtree root.
So if there is no other operations for the fs, balance won't cause extra
qgroup overhead. (best case scenario)
Depending on the workload, most of the subtree scan can still be
avoided.
Only for worst case scenario, it will fall back to old subtree swap
overhead. (scan all swapped subtrees)
[[Benchmark]]
Hardware:
VM 4G vRAM, 8 vCPUs,
disk is using 'unsafe' cache mode,
backing device is SAMSUNG 850 evo SSD.
Host has 16G ram.
Mkfs parameter:
--nodesize 4K (To bump up tree size)
Initial subvolume contents:
4G data copied from /usr and /lib.
(With enough regular small files)
Snapshots:
16 snapshots of the original subvolume.
each snapshot has 3 random files modified.
balance parameter:
-m
So the content should be pretty similar to a real world root fs layout.
And after file system population, there is no other activity, so it
should be the best case scenario.
| v4.20-rc1 | w/ patchset | diff
-----------------------------------------------------------------------
relocated extents | 22615 | 22457 | -0.1%
qgroup dirty extents | 163457 | 121606 | -25.6%
time (sys) | 22.884s | 18.842s | -17.6%
time (real) | 27.724s | 22.884s | -17.5%
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 23 Jan 2019 07:15:16 +0000 (15:15 +0800)]
btrfs: qgroup: Introduce per-root swapped blocks infrastructure
To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.
The designed workflow will be:
1) Record the subtree root block that gets swapped.
During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF
In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.
2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF
3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.
3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.
4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.
This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 23 Jan 2019 07:15:15 +0000 (15:15 +0800)]
btrfs: qgroup: Refactor btrfs_qgroup_trace_subtree_swap
Refactor btrfs_qgroup_trace_subtree_swap() into
qgroup_trace_subtree_swap(), which only needs two extent buffer and some
other bool to control the behavior.
This provides the basis for later delayed subtree scan work.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 23 Jan 2019 07:15:14 +0000 (15:15 +0800)]
btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots
Relocation code will drop btrfs_root::reloc_root as soon as
merge_reloc_root() finishes.
However later qgroup code will need to access btrfs_root::reloc_root
after merge_reloc_root() for delayed subtree rescan.
So alter the timming of resetting btrfs_root:::reloc_root, make it
happens after transaction commit.
With this patch, we will introduce a new btrfs_root::state,
BTRFS_ROOT_DEAD_RELOC_TREE, to info part of btrfs_root::reloc_tree user
that although btrfs_root::reloc_tree is still non-NULL, but still it's
not used any more.
The lifespan of btrfs_root::reloc tree will become:
Old behavior | New
------------------------------------------------------------------------
btrfs_init_reloc_root() --- | btrfs_init_reloc_root() ---
set reloc_root | | set reloc_root |
| | |
| | |
merge_reloc_root() | | merge_reloc_root() |
|- btrfs_update_reloc_root() --- | |- btrfs_update_reloc_root() -+-
clear btrfs_root::reloc_root | set ROOT_DEAD_RELOC_TREE |
| record root into dirty |
| roots rbtree |
| |
| reloc_block_group() Or |
| btrfs_recover_relocation() |
| | After transaction commit |
| |- clean_dirty_subvols() ---
| clear btrfs_root::reloc_root
During ROOT_DEAD_RELOC_TREE set lifespan, the only user of
btrfs_root::reloc_tree should be qgroup.
Since reloc root needs a longer life-span, this patch will also delay
btrfs_drop_snapshot() call.
Now btrfs_drop_snapshot() is called in clean_dirty_subvols().
This patch will increase the size of btrfs_root by 16 bytes.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 21 Nov 2018 19:05:42 +0000 (14:05 -0500)]
btrfs: call btrfs_create_pending_block_groups unconditionally
The first thing we do is loop through the list, this
if (!list_empty())
btrfs_create_pending_block_groups();
thing is just wasted space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 21 Nov 2018 19:05:40 +0000 (14:05 -0500)]
btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head
Instead of open coding this stuff use the helper instead.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 21 Nov 2018 19:05:39 +0000 (14:05 -0500)]
btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock
We have this open coded in btrfs_destroy_delayed_refs, use the helper
instead.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 3 Jan 2019 08:17:40 +0000 (16:17 +0800)]
btrfs: scrub: print messages when started or finished
The kernel log messages help debugging and audit, add them for scrub
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 17 Jan 2019 16:15:18 +0000 (17:15 +0100)]
btrfs: simplify workqueue name when allocating
The workqueue name is constructed from a format string but the prefix
does not need to be set by %s.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Sat, 19 Jan 2019 06:48:55 +0000 (14:48 +0800)]
btrfs: merge btrfs_find_device and find_device
Both btrfs_find_device() and find_device() does the same thing except
that the latter does not take the seed device onto account in the device
scanning context. We can merge them.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Fri, 4 Jan 2019 05:31:53 +0000 (13:31 +0800)]
btrfs: refactor btrfs_free_stale_devices() to get return value
Preparatory patch to add ioctl that allows to forget a device (ie.
reverse of scan).
Refactors btrfs_free_stale_devices() to obtain return status. As this
function can fail if it can't find the given path (returns -ENOENT) or
trying to delete a mounted device (returns -EBUSY).
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 17 Jan 2019 15:32:31 +0000 (23:32 +0800)]
btrfs: refactor btrfs_find_device() take fs_devices as argument
btrfs_find_device() accepts fs_info as an argument and retrieves
fs_devices from fs_info.
Instead use fs_devices, so that this function can be used in non-mount
(during device scanning) context as well.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 17 Jan 2019 15:32:29 +0000 (23:32 +0800)]
btrfs: cleanup btrfs_find_device_by_devspec()
btrfs_find_device_by_devspec() finds the device by @devid or by
@device_path. This patch makes code flow easy to read by open coding the
else part and renames devpath to device_path.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 17 Jan 2019 15:32:28 +0000 (23:32 +0800)]
btrfs: merge btrfs_find_device_missing_or_by_path() into parent
btrfs_find_device_missing_or_by_path() is relatively small function, and
its only parent btrfs_find_device_by_devspec() is small as well. Besides
there are a number of find_device functions. Merge
btrfs_find_device_missing_or_by_path() into its parent.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 17 Dec 2018 08:36:02 +0000 (10:36 +0200)]
btrfs: Remove not_found_em label from btrfs_get_extent
In order to avoid duplicating init code for em there is an additional
label, not_found_em, which is used to only set ->block_start. The only
case when it will be used is if the extent we are adding overlaps with
an existing extent. Make that case more obvious by:
1. Adding a comment hinting at what's going on
2. Assigning EXTENT_MAP_HOLE and directly going to insert.
No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 17 Dec 2018 09:49:00 +0000 (11:49 +0200)]
btrfs: Consolidate retval checking of core btree functions
Core btree functions in btrfs generally return 0 when an item is found,
1 in case the sought item cannot be found and <0 when an error happens.
Consolidate the checks for those conditions in one 'if () {} else if ()
{}' construct rather than 2 separate 'if () {}' statements. This
emphasizes that the handling code pertains to a single function. No
functional changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 17 Dec 2018 08:35:59 +0000 (10:35 +0200)]
btrfs: Rename found_type to extent_type in btrfs_get_extent
found_type really holds the type of extent and is guaranteed to to have
a value between [0, 2]. The only time it can contain anything different
is if btrfs_lookup_file_extent returned a positive value and the
previous item is different than an extent. Avoid this situation by
simply checking found_key.type rather than assigning the item type to
found_type intermittently. Also make the variable an u8 to reduce stack
usage. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 12 Dec 2018 18:05:56 +0000 (18:05 +0000)]
Btrfs: move duplicated nodatasum check into common reflink/dedupe helper
Move the check that verifies if both inodes have checksums disabled or
both have them enabled, from the clone and deduplication functions into
the new common helper btrfs_remap_file_range_prep().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 8 Jan 2019 14:53:46 +0000 (16:53 +0200)]
btrfs: Remove impossible condition from mergable_maps
We can never have extents marked as EXTENT_MAP_DELALLOC since this
value is only ever used by btrfs_get_extent_fiemap. In this case the
extent map is created by btrfs_get_extent_fiemap and is never really
published, this flag is used to return the corresponding userspace one.
Considering this, it's pointless having a check for EXTENT_MAP_DELALLOC
in mergable_maps. Just remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 8 Jan 2019 11:42:01 +0000 (11:42 +0000)]
Btrfs: do not overwrite error return value in the balance ioctl
If the call to btrfs_balance() failed we would overwrite the error
returned to user space with -EFAULT if the call to copy_to_user() failed
as well. Fix that by calling copy_to_user() only if btrfs_balance()
returned success or was canceled.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 8 Jan 2019 11:42:09 +0000 (11:42 +0000)]
Btrfs: do not overwrite error return value in the device replace ioctl
If the call to btrfs_dev_replace_by_ioctl() failed we would overwrite the
error returned to user space with -EFAULT if the call to copy_to_user()
failed as well. Fix that by calling copy_to_user() only if no error
happened before or a device replace operation was canceled.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 8 Jan 2019 11:43:18 +0000 (11:43 +0000)]
Btrfs: remove redundant check for swapfiles when reflinking
Checking if either of the inodes corresponds to a swapfile is already
performed by generic_remap_file_range_prep(), so we do not need to do
it in the btrfs clone and deduplication functions.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 3 Jan 2019 08:50:05 +0000 (10:50 +0200)]
btrfs: Refactor shrink_delalloc
Add a couple of comments regarding the logic flow in shrink_delalloc.
Then, cease using max_reclaim as a temporary variable when calculating
nr_pages. Finally give max_reclaim a more becoming name, which
uneqivocally shows at what this variable really holds. No functional
changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 3 Jan 2019 08:50:03 +0000 (10:50 +0200)]
btrfs: Document logic regarding inode in async_cow_submit
Add a comment explaining when ->inode could be NULL and why we always
perform the ->async_delalloc_pages modification.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 3 Jan 2019 08:50:02 +0000 (10:50 +0200)]
btrfs: Remove WARN_ON in btrfs_alloc_delalloc_work
It can never trigger since before calling alloc_delalloc_work we have
called igrab in start_delalloc_inodes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 3 Jan 2019 08:50:01 +0000 (10:50 +0200)]
btrfs: Use ihold instead of igrab in cow_file_range_async
ihold is supposed to be used when the caller already has a reference to
the inode. In the case of cow_file_range_async this invariants holds,
since the 3 call chains leading to this function all take a reference:
btrfs_writepage <--- does igrab
extent_write_full_page
__extent_writepage
writepage_delalloc
btrfs_run_delalloc_range
cow_file_range_async
extent_write_cache_pages <--- does igrab
__extent_writepage (same callchain as above)
and
submit_compressed_extents <-- already called from async CoW submit path,
which would have done ihold.
extent_write_locked_range
__extent_writepage
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 3 Jan 2019 08:50:00 +0000 (10:50 +0200)]
btrfs: Remove isize local variable in compress_file_range
It's used only once so just inline the call to i_size_read. The
semantics regarding the inode size are not changed, the pages in the
range are locked and i_size cannot change between the time it was set
and used.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 3 Jan 2019 08:49:59 +0000 (10:49 +0200)]
btrfs: Remove inode argument from async_cow_submit
We already pass the async_cow struct that holds a reference to the
inode. Exploit this fact and remove the extra inode argument. No
functional changes.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
YueHaibing [Sat, 15 Dec 2018 06:31:07 +0000 (06:31 +0000)]
btrfs: remove set but not used variable 'num_pages'
Fixes gcc '-Wunused-but-set-variable' warning:
fs/btrfs/ioctl.c: In function 'btrfs_extent_same':
fs/btrfs/ioctl.c:3260:6: warning:
variable 'num_pages' set but not used [-Wunused-but-set-variable]
It not used any more since commit
9ee8234e6220 ("Btrfs: use
generic_remap_file_range_prep() for cloning and deduplication")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 12 Dec 2018 07:42:34 +0000 (09:42 +0200)]
btrfs: Remove redundant assignment in btrfs_get_extent_fiemap
hole_len is only used if the hole falls within the requested range. Make
that explicitly clear by only assigning in the corresponding branch.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 12 Dec 2018 07:42:33 +0000 (09:42 +0200)]
btrfs: Refactor btrfs_get_extent_fiemap
Make btrfs_get_extent_fiemap a bit more friendly. First step is to
rename the closely related, yet arbitrary named
range_start/found_end/found variables. They define the delalloc range
that is found in case a real extent wasn't found. Subsequently remove
an unnecessary check for hole_em since it's guaranteed to be set i.e the
check is always true. Top it off by giving all comments a refresh.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformatted a few more comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 12 Dec 2018 07:42:32 +0000 (09:42 +0200)]
btrfs: Remove unused arguments from btrfs_get_extent_fiemap
This function is a simple wrapper over btrfs_get_extent that returns
either:
a) A real extent in the passed range or
b) Adjusted extent based on whether delalloc bytes are found backing up
a hole.
To support these semantics it doesn't need the page/pg_offset/create
arguments which are passed to btrfs_get_extent in case an extent is to
be created. So simplify the function by removing the unused arguments.
No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 13 Dec 2018 21:16:56 +0000 (21:16 +0000)]
Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl
We are holding a transaction handle when setting an acl, therefore we can
not allocate the xattr value buffer using GFP_KERNEL, as we could deadlock
if reclaim is triggered by the allocation, therefore setup a nofs context.
Fixes:
39a27ec1004e8 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 13 Dec 2018 21:16:45 +0000 (21:16 +0000)]
Btrfs: setup a nofs context for memory allocation at btrfs_create_tree()
We are holding a transaction handle when creating a tree, therefore we can
not allocate the root using GFP_KERNEL, as we could deadlock if reclaim is
triggered by the allocation, therefore setup a nofs context.
Fixes:
74e4d82757f74 ("btrfs: let callers of btrfs_alloc_root pass gfp flags")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 14 Dec 2018 19:45:22 +0000 (19:45 +0000)]
Btrfs: do not overwrite error return value in the get device stats ioctl
If the call to btrfs_get_dev_stats() failed we would overwrite the error
returned to user space with -EFAULT if the call to copy_to_user() failed
as well. Fix that by calling copy_to_user() only if btrfs_get_dev_stats()
returned success.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 14 Dec 2018 19:45:13 +0000 (19:45 +0000)]
Btrfs: do not overwrite error return value in scrub progress ioctl
If the call to btrfs_scrub_progress() failed we would overwrite the error
returned to user space with -EFAULT if the call to copy_to_user() failed
as well. Fix that by calling copy_to_user() only if btrfs_scrub_progress()
returned success.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 14 Dec 2018 19:50:17 +0000 (19:50 +0000)]
Btrfs: do not overwrite scrub error with fault error in scrub ioctl
If scrub returned an error and then the copy_to_user() call did not
succeed, we would overwrite the error returned by scrub with -EFAULT.
Fix that by calling copy_to_user() only if btrfs_scrub_dev() returned
success.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 19 Dec 2018 07:50:20 +0000 (09:50 +0200)]
btrfs: Make first argument of btrfs_run_delalloc_range directly an inode
Since this function is no longer a callback there is no need to have
its first argument obfuscated with a void *. Change it directly to a
pointer to an inode. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Julia Lawall [Sun, 23 Dec 2018 08:57:06 +0000 (09:57 +0100)]
Btrfs: drop useless LIST_HEAD in merge_reloc_root
Drop LIST_HEAD where the variable it declares is never used.
The uses were removed in
3fd0a5585eb9 ("Btrfs: Metadata ENOSPC
handling for balance"), but not the declaration.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
... when != x
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Linus Torvalds [Mon, 25 Feb 2019 00:46:45 +0000 (16:46 -0800)]
Linux 5.0-rc8
Linus Torvalds [Sun, 24 Feb 2019 17:47:07 +0000 (09:47 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"Bug fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: MMU: record maximum physical address width in kvm_mmu_extended_role
kvm: x86: Return LA57 feature based on hardware capability
x86/kvm/mmu: fix switch between root and guest MMUs
s390: vsie: Use effective CRYCBD.31 to check CRYCBD validity
Linus Torvalds [Sun, 24 Feb 2019 17:28:26 +0000 (09:28 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
"Hopefully the last pull request for this release. Fingers crossed:
1) Only refcount ESP stats on full sockets, from Martin Willi.
2) Missing barriers in AF_UNIX, from Al Viro.
3) RCU protection fixes in ipv6 route code, from Paolo Abeni.
4) Avoid false positives in untrusted GSO validation, from Willem de
Bruijn.
5) Forwarded mesh packets in mac80211 need more tailroom allocated,
from Felix Fietkau.
6) Use operstate consistently for linkup in team driver, from George
Wilkie.
7) ThunderX bug fixes from Vadim Lomovtsev. Mostly races between VF
and PF code paths.
8) Purge ipv6 exceptions during netdevice removal, from Paolo Abeni.
9) nfp eBPF code gen fixes from Jiong Wang.
10) bnxt_en firmware timeout fix from Michael Chan.
11) Use after free in udp/udpv6 error handlers, from Paolo Abeni.
12) Fix a race in x25_bind triggerable by syzbot, from Eric Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
net: phy: realtek: Dummy IRQ calls for RTL8366RB
tcp: repaired skbs must init their tso_segs
net/x25: fix a race in x25_bind()
net: dsa: Remove documentation for port_fdb_prepare
Revert "bridge: do not add port to router list when receives query with source 0.0.0.0"
selftests: fib_tests: sleep after changing carrier. again.
net: set static variable an initial value in atl2_probe()
net: phy: marvell10g: Fix Multi-G advertisement to only advertise 10G
bpf, doc: add bpf list as secondary entry to maintainers file
udp: fix possible user after free in error handler
udpv6: fix possible user after free in error handler
fou6: fix proto error handler argument type
udpv6: add the required annotation to mib type
mdio_bus: Fix use-after-free on device_register fails
net: Set rtm_table to RT_TABLE_COMPAT for ipv6 for tables > 255
bnxt_en: Wait longer for the firmware message response to complete.
bnxt_en: Fix typo in firmware message timeout logic.
nfp: bpf: fix ALU32 high bits clearance bug
nfp: bpf: fix code-gen bug on BPF_ALU | BPF_XOR | BPF_K
Documentation: networking: switchdev: Update port parent ID section
...
Linus Walleij [Sun, 24 Feb 2019 00:11:15 +0000 (01:11 +0100)]
net: phy: realtek: Dummy IRQ calls for RTL8366RB
This fixes a regression introduced by
commit
0d2e778e38e0ddffab4bb2b0e9ed2ad5165c4bf7
"net: phy: replace PHY_HAS_INTERRUPT with a check for
config_intr and ack_interrupt".
This assumes that a PHY cannot trigger interrupt unless
it has .config_intr() or .ack_interrupt() implemented.
A later patch makes the code assume both need to be
implemented for interrupts to be present.
But this PHY (which is inside a DSA) will happily
fire interrupts without either callback.
Implement dummy callbacks for .config_intr() and
.ack_interrupt() in the phy header to fix this.
Tested on the RTL8366RB on D-Link DIR-685.
Fixes:
0d2e778e38e0 ("net: phy: replace PHY_HAS_INTERRUPT with a check for config_intr and ack_interrupt")
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 23 Feb 2019 23:51:51 +0000 (15:51 -0800)]
tcp: repaired skbs must init their tso_segs
syzbot reported a WARN_ON(!tcp_skb_pcount(skb))
in tcp_send_loss_probe() [1]
This was caused by TCP_REPAIR sent skbs that inadvertenly
were missing a call to tcp_init_tso_segs()
[1]
WARNING: CPU: 1 PID: 0 at net/ipv4/tcp_output.c:2534 tcp_send_loss_probe+0x771/0x8a0 net/ipv4/tcp_output.c:2534
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-rc7+ #77
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
panic+0x2cb/0x65c kernel/panic.c:214
__warn.cold+0x20/0x45 kernel/panic.c:571
report_bug+0x263/0x2b0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:178 [inline]
fixup_bug arch/x86/kernel/traps.c:173 [inline]
do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:290
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
RIP: 0010:tcp_send_loss_probe+0x771/0x8a0 net/ipv4/tcp_output.c:2534
Code: 88 fc ff ff 4c 89 ef e8 ed 75 c8 fb e9 c8 fc ff ff e8 43 76 c8 fb e9 63 fd ff ff e8 d9 75 c8 fb e9 94 f9 ff ff e8 bf 03 91 fb <0f> 0b e9 7d fa ff ff e8 b3 03 91 fb 0f b6 1d 37 43 7a 03 31 ff 89
RSP: 0018:
ffff8880ae907c60 EFLAGS:
00010206
RAX:
ffff8880a989c340 RBX:
0000000000000000 RCX:
ffffffff85dedbdb
RDX:
0000000000000100 RSI:
ffffffff85dee0b1 RDI:
0000000000000005
RBP:
ffff8880ae907c90 R08:
ffff8880a989c340 R09:
ffffed10147d1ae1
R10:
ffffed10147d1ae0 R11:
ffff8880a3e8d703 R12:
ffff888091b90040
R13:
ffff8880a3e8d540 R14:
0000000000008000 R15:
ffff888091b90860
tcp_write_timer_handler+0x5c0/0x8a0 net/ipv4/tcp_timer.c:583
tcp_write_timer+0x10e/0x1d0 net/ipv4/tcp_timer.c:607
call_timer_fn+0x190/0x720 kernel/time/timer.c:1325
expire_timers kernel/time/timer.c:1362 [inline]
__run_timers kernel/time/timer.c:1681 [inline]
__run_timers kernel/time/timer.c:1649 [inline]
run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694
__do_softirq+0x266/0x95a kernel/softirq.c:292
invoke_softirq kernel/softirq.c:373 [inline]
irq_exit+0x180/0x1d0 kernel/softirq.c:413
exiting_irq arch/x86/include/asm/apic.h:536 [inline]
smp_apic_timer_interrupt+0x14a/0x570 arch/x86/kernel/apic/apic.c:1062
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807
</IRQ>
RIP: 0010:native_safe_halt+0x2/0x10 arch/x86/include/asm/irqflags.h:58
Code: ff ff ff 48 89 c7 48 89 45 d8 e8 59 0c a1 fa 48 8b 45 d8 e9 ce fe ff ff 48 89 df e8 48 0c a1 fa eb 82 90 90 90 90 90 90 fb f4 <c3> 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 90 90 90 90 90 90
RSP: 0018:
ffff8880a98afd78 EFLAGS:
00000286 ORIG_RAX:
ffffffffffffff13
RAX:
1ffffffff1125061 RBX:
ffff8880a989c340 RCX:
0000000000000000
RDX:
dffffc0000000000 RSI:
0000000000000001 RDI:
ffff8880a989cbbc
RBP:
ffff8880a98afda8 R08:
ffff8880a989c340 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000000 R12:
0000000000000001
R13:
ffffffff889282f8 R14:
0000000000000001 R15:
0000000000000000
arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:555
default_idle_call+0x36/0x90 kernel/sched/idle.c:93
cpuidle_idle_call kernel/sched/idle.c:153 [inline]
do_idle+0x386/0x570 kernel/sched/idle.c:262
cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:353
start_secondary+0x404/0x5c0 arch/x86/kernel/smpboot.c:271
secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243
Kernel Offset: disabled
Rebooting in 86400 seconds..
Fixes:
79861919b889 ("tcp: fix TCP_REPAIR xmit queue setup")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 23 Feb 2019 21:24:59 +0000 (13:24 -0800)]
net/x25: fix a race in x25_bind()
syzbot was able to trigger another soft lockup [1]
I first thought it was the O(N^2) issue I mentioned in my
prior fix (
f657d22ee1f "net/x25: do not hold the cpu
too long in x25_new_lci()"), but I eventually found
that x25_bind() was not checking SOCK_ZAPPED state under
socket lock protection.
This means that multiple threads can end up calling
x25_insert_socket() for the same socket, and corrupt x25_list
[1]
watchdog: BUG: soft lockup - CPU#0 stuck for 123s! [syz-executor.2:10492]
Modules linked in:
irq event stamp: 27515
hardirqs last enabled at (27514): [<
ffffffff81006673>] trace_hardirqs_on_thunk+0x1a/0x1c
hardirqs last disabled at (27515): [<
ffffffff8100668f>] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last enabled at (32): [<
ffffffff8632ee73>] x25_get_neigh+0xa3/0xd0 net/x25/x25_link.c:336
softirqs last disabled at (34): [<
ffffffff86324bc3>] x25_find_socket+0x23/0x140 net/x25/af_x25.c:341
CPU: 0 PID: 10492 Comm: syz-executor.2 Not tainted 5.0.0-rc7+ #88
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__sanitizer_cov_trace_pc+0x4/0x50 kernel/kcov.c:97
Code: f4 ff ff ff e8 11 9f ea ff 48 c7 05 12 fb e5 08 00 00 00 00 e9 c8 e9 ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 <48> 8b 75 08 65 48 8b 04 25 40 ee 01 00 65 8b 15 38 0c 92 7e 81 e2
RSP: 0018:
ffff88806e94fc48 EFLAGS:
00000286 ORIG_RAX:
ffffffffffffff13
RAX:
1ffff1100d84dac5 RBX:
0000000000000001 RCX:
ffffc90006197000
RDX:
0000000000040000 RSI:
ffffffff86324bf3 RDI:
ffff88806c26d628
RBP:
ffff88806e94fc48 R08:
ffff88806c1c6500 R09:
fffffbfff1282561
R10:
fffffbfff1282560 R11:
ffffffff89412b03 R12:
ffff88806c26d628
R13:
ffff888090455200 R14:
dffffc0000000000 R15:
0000000000000000
FS:
00007f3a107e4700(0000) GS:
ffff8880ae800000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00007f3a107e3db8 CR3:
00000000a5544000 CR4:
00000000001406f0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
__x25_find_socket net/x25/af_x25.c:327 [inline]
x25_find_socket+0x7d/0x140 net/x25/af_x25.c:342
x25_new_lci net/x25/af_x25.c:355 [inline]
x25_connect+0x380/0xde0 net/x25/af_x25.c:784
__sys_connect+0x266/0x330 net/socket.c:1662
__do_sys_connect net/socket.c:1673 [inline]
__se_sys_connect net/socket.c:1670 [inline]
__x64_sys_connect+0x73/0xb0 net/socket.c:1670
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457e29
Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:
00007f3a107e3c78 EFLAGS:
00000246 ORIG_RAX:
000000000000002a
RAX:
ffffffffffffffda RBX:
0000000000000003 RCX:
0000000000457e29
RDX:
0000000000000012 RSI:
0000000020000200 RDI:
0000000000000005
RBP:
000000000073c040 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000246 R12:
00007f3a107e46d4
R13:
00000000004be362 R14:
00000000004ceb98 R15:
00000000ffffffff
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 10493 Comm: syz-executor.3 Not tainted 5.0.0-rc7+ #88
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:193 [inline]
RIP: 0010:queued_write_lock_slowpath+0x143/0x290 kernel/locking/qrwlock.c:86
Code: 4c 8d 2c 01 41 83 c7 03 41 0f b6 45 00 41 38 c7 7c 08 84 c0 0f 85 0c 01 00 00 8b 03 3d 00 01 00 00 74 1a f3 90 41 0f b6 55 00 <41> 38 d7 7c eb 84 d2 74 e7 48 89 df e8 cc aa 4e 00 eb dd be 04 00
RSP: 0018:
ffff888085c47bd8 EFLAGS:
00000206
RAX:
0000000000000300 RBX:
ffffffff89412b00 RCX:
1ffffffff1282560
RDX:
0000000000000000 RSI:
0000000000000004 RDI:
ffffffff89412b00
RBP:
ffff888085c47c70 R08:
1ffffffff1282560 R09:
fffffbfff1282561
R10:
fffffbfff1282560 R11:
ffffffff89412b03 R12:
00000000000000ff
R13:
fffffbfff1282560 R14:
1ffff11010b88f7d R15:
0000000000000003
FS:
00007fdd04086700(0000) GS:
ffff8880ae900000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00007fdd04064db8 CR3:
0000000090be0000 CR4:
00000000001406e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
queued_write_lock include/asm-generic/qrwlock.h:104 [inline]
do_raw_write_lock+0x1d6/0x290 kernel/locking/spinlock_debug.c:203
__raw_write_lock_bh include/linux/rwlock_api_smp.h:204 [inline]
_raw_write_lock_bh+0x3b/0x50 kernel/locking/spinlock.c:312
x25_insert_socket+0x21/0xe0 net/x25/af_x25.c:267
x25_bind+0x273/0x340 net/x25/af_x25.c:703
__sys_bind+0x23f/0x290 net/socket.c:1481
__do_sys_bind net/socket.c:1492 [inline]
__se_sys_bind net/socket.c:1490 [inline]
__x64_sys_bind+0x73/0xb0 net/socket.c:1490
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457e29
Fixes:
90c27297a9bf ("X.25 remove bkl in bind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: andrew hendry <andrew.hendry@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hauke Mehrtens [Fri, 22 Feb 2019 19:07:45 +0000 (20:07 +0100)]
net: dsa: Remove documentation for port_fdb_prepare
This callback was removed some time ago, also remove the documentation.
Fixes:
1b6dd556c304 ("net: dsa: Remove prepare phase for FDB")
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hangbin Liu [Fri, 22 Feb 2019 13:22:32 +0000 (21:22 +0800)]
Revert "bridge: do not add port to router list when receives query with source 0.0.0.0"
This reverts commit
5a2de63fd1a5 ("bridge: do not add port to router list
when receives query with source 0.0.0.0") and commit
0fe5119e267f ("net:
bridge: remove ipv6 zero address check in mcast queries")
The reason is RFC 4541 is not a standard but suggestive. Currently we
will elect 0.0.0.0 as Querier if there is no ip address configured on
bridge. If we do not add the port which recives query with source
0.0.0.0 to router list, the IGMP reports will not be about to forward
to Querier, IGMP data will also not be able to forward to dest.
As Nikolay suggested, revert this change first and add a boolopt api
to disable none-zero election in future if needed.
Reported-by: Linus Lüssing <linus.luessing@c0d3.blue>
Reported-by: Sebastian Gottschall <s.gottschall@newmedia-net.de>
Fixes:
5a2de63fd1a5 ("bridge: do not add port to router list when receives query with source 0.0.0.0")
Fixes:
0fe5119e267f ("net: bridge: remove ipv6 zero address check in mcast queries")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thadeu Lima de Souza Cascardo [Fri, 22 Feb 2019 10:27:41 +0000 (07:27 -0300)]
selftests: fib_tests: sleep after changing carrier. again.
Just like commit
e2ba732a1681 ("selftests: fib_tests: sleep after
changing carrier"), wait one second to allow linkwatch to propagate the
carrier change to the stack.
There are two sets of carrier tests. The first slept after the carrier
was set to off, and when the second set ran, it was likely that the
linkwatch would be able to run again without much delay, reducing the
likelihood of a race. However, if you run 'fib_tests.sh -t carrier' on a
loop, you will quickly notice the failures.
Sleeping on the second set of tests make the failures go away.
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mao Wenan [Fri, 22 Feb 2019 06:57:23 +0000 (14:57 +0800)]
net: set static variable an initial value in atl2_probe()
cards_found is a static variable, but when it enters atl2_probe(),
cards_found is set to zero, the value is not consistent with last probe,
so next behavior is not our expect.
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier [Thu, 21 Feb 2019 16:54:11 +0000 (17:54 +0100)]
net: phy: marvell10g: Fix Multi-G advertisement to only advertise 10G
Some Marvell Alaska PHYs support 2.5G, 5G and 10G BaseT links. Their
default behaviour is to advertise all of these modes, but at the moment,
only 10GBaseT is supported. To prevent link partners from establishing
link at that speed, clear these modes upon configuring aneg parameters.
Fixes:
20b2af32ff3f ("net: phy: add Marvell Alaska X 88X3310 10Gigabit PHY support")
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reported-by: Russell King <linux@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 23 Feb 2019 19:13:50 +0000 (11:13 -0800)]
Merge tag 'powerpc-5.0-6' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"One fix for an oops when using SRIOV, introduced by the recent changes
to support compound IOMMU groups.
Thanks to Alexey Kardashevskiy"
* tag 'powerpc-5.0-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/powernv/sriov: Register IOMMU groups for VFs
Linus Torvalds [Sat, 23 Feb 2019 17:48:01 +0000 (09:48 -0800)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Four small fixes: three in drivers and one in the core.
The core fix is also minor in scope since the bug it fixes is only
known to affect systems using SCSI reservations. Of the driver bugs,
the libsas one is the most major because it can lead to multiple disks
on the same expander not being exposed"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: core: reset host byte in DID_NEXUS_FAILURE case
scsi: libsas: Fix rphy phy_identifier for PHYs with end devices attached
scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation
scsi: libiscsi: Fix race between iscsi_xmit_task and iscsi_complete_task
David S. Miller [Sat, 23 Feb 2019 04:45:38 +0000 (20:45 -0800)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2019-02-23
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a bug in BPF's LPM deletion logic to match correct prefix
length, from Alban.
2) Fix AF_XDP teardown by not destroying umem prematurely as it
is still needed till all outstanding skbs are freed, from Björn.
3) Fix unkillable BPF_PROG_TEST_RUN under preempt kernel by checking
signal_pending() outside need_resched() condition which is never
triggered there, from Stanislav.
4) Fix two nfp JIT bugs, one in code emission for K-based xor, and
another one to explicitly clear upper bits in alu32, from Jiong.
5) Add bpf list address to maintainers file, from Daniel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 23 Feb 2019 01:48:50 +0000 (17:48 -0800)]
Merge branch 'fixes-v5.0-rc7' of git://git./linux/kernel/git/jmorris/linux-security
Pull keys fixes from James Morris:
"Two fixes from Eric Biggers"
* 'fixes-v5.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
KEYS: always initialize keyring_index_key::desc_len
KEYS: user: Align the payload buffer
Linus Torvalds [Sat, 23 Feb 2019 01:46:30 +0000 (17:46 -0800)]
Merge tag 'pm-5.0' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a regression in the PM-runtime framework introduced by the
recent switch-over of it to using hrtimers and a use-after-free
introduced by one of the recent changes in the scmi-cpufreq driver.
Specifics:
- Use hrtimer_try_to_cancel() instead of hrtimer_cancel() in the
PM-runtime framework to avoid a possible timer-related deadlock
introduced recently (Vincent Guittot).
- Reorder the scmi-cpufreq driver code to avoid accessing memory that
has just been freed (Yangtao Li)"
* tag 'pm-5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM-runtime: Fix deadlock when canceling hrtimer
cpufreq: scmi: Fix use-after-free in scmi_cpufreq_exit()
Linus Torvalds [Sat, 23 Feb 2019 00:48:37 +0000 (16:48 -0800)]
Merge tag 'armsoc-fixes' of git://git./linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"Only a handful of device tree fixes, all simple enough:
NVIDIA Tegra:
- Fix a regression for booting on chromebooks
TI OMAP:
- Two fixes PHY mode on am335x reference boards
Marvell mvebu:
- A regression fix for Armada XP NAND flash controllers
- An incorrect reset signal on the clearfog board"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
ARM: tegra: Restore DT ABI on Tegra124 Chromebooks
ARM: dts: am335x-evm: Fix PHY mode for ethernet
ARM: dts: am335x-evmsk: Fix PHY mode for ethernet
arm64: dts: clearfog-gt-8k: fix SGMII PHY reset signal
ARM: dts: armada-xp: fix Armada XP boards NAND description
Linus Torvalds [Sat, 23 Feb 2019 00:31:26 +0000 (16:31 -0800)]
Merge tag 'arc-5.0-final' of git://git./linux/kernel/git/vgupta/arc
Pull ARC fixes from Vineet Gupta:
"Fixes for ARC for 5.0, bunch of those are stable fodder anyways so
sooner the better.
- Fix memcpy to prevent prefetchw beyond end of buffer [Eugeniy]
- Enable unaligned access early to prevent exceptions given newer gcc
code gen [Eugeniy]
- Tighten up uboot arg checking to prevent false negatives and also
allow both jtag and bootloading to coexist w/o config option as
needed by kernelCi folks [Eugeniy]
- Set slab alignment to 8 for ARC to avoid the atomic64_t unalign
[Alexey]
- Disable regfile auto save on interrupts on HSDK platform due to a
silicon issue [Vineet]
- Avoid HS38x boot printing crash by not reading HS48x only reg
[Vineet]"
* tag 'arc-5.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARCv2: don't assume core 0x54 has dual issue
ARC: define ARCH_SLAB_MINALIGN = 8
ARC: enable uboot support unconditionally
ARC: U-boot: check arguments paranoidly
ARCv2: support manual regfile save on interrupts
ARC: uacces: remove lp_start, lp_end from clobber list
ARC: fix actionpoints configuration detection
ARCv2: lib: memcpy: fix doing prefetchw outside of buffer
ARCv2: Enable unaligned access in early ASM code
Daniel Borkmann [Fri, 22 Feb 2019 23:03:44 +0000 (00:03 +0100)]
bpf, doc: add bpf list as secondary entry to maintainers file
We recently created a bpf@vger.kernel.org list (https://lore.kernel.org/bpf/)
for BPF related discussions, originally in context of BPF track at LSF/MM
for topic discussions. It's *optional* but *desirable* to keep it in Cc for
BPF related kernel/loader/llvm/tooling threads, meaning also infrastructure
like llvm that sits on top of kernel but is crucial to BPF. In any case,
netdev with it's bpf delegate is *as-is* today primary list for patches, so
nothing changes in the workflow. Main purpose is to have some more awareness
for the bpf@vger.kernel.org list that folks can Cc for BPF specific topics.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Linus Torvalds [Sat, 23 Feb 2019 00:12:01 +0000 (16:12 -0800)]
Merge branch 'parisc-5.0-1' of git://git./linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
"Fix ptrace syscall number modification which has been broken since
kernel v4.5 and provide alternative email addresses for the remaining
users of the retired parisc-linux.org email domain"
* 'parisc-5.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
CREDITS/MAINTAINERS: Retire parisc-linux.org email domain
parisc: Fix ptrace syscall number modification
Linus Torvalds [Sat, 23 Feb 2019 00:09:55 +0000 (16:09 -0800)]
Merge tag 'kbuild-fixes-v5.0-2' of git://git./linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild fixes from Masahiro Yamada:
- fix scripts/kallsyms.c to correctly check too long symbol names
- fix sh build error for the combination of CONFIG_OF_EARLY_FLATTREE=y
and CONFIG_USE_BUILTIN_DTB=n
* tag 'kbuild-fixes-v5.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
sh: fix build error for invisible CONFIG_BUILTIN_DTB_SOURCE
kallsyms: Handle too long symbols in kallsyms.c
David S. Miller [Sat, 23 Feb 2019 00:05:12 +0000 (16:05 -0800)]
Merge branch 'udp-a-few-fixes'
Paolo Abeni says:
====================
udp: a few fixes
This series includes some UDP-related fixlet. All this stuff has been
pointed out by the sparse tool. The first two patches are just annotation
related, while the last 2 cover some very unlikely races.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Thu, 21 Feb 2019 16:44:00 +0000 (17:44 +0100)]
udp: fix possible user after free in error handler
Similar to the previous commit, this addresses the same issue for
ipv4: use a single fetch operation and use the correct rcu
annotation.
Fixes:
e7cc082455cb ("udp: Support for error handlers of tunnels with arbitrary destination port")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Thu, 21 Feb 2019 16:43:59 +0000 (17:43 +0100)]
udpv6: fix possible user after free in error handler
Before derefencing the encap pointer, commit
e7cc082455cb ("udp: Support
for error handlers of tunnels with arbitrary destination port") checks
for a NULL value, but the two fetch operation can race with removal.
Fix the above using a single access.
Also fix a couple of type annotations, to make sparse happy.
Fixes:
e7cc082455cb ("udp: Support for error handlers of tunnels with arbitrary destination port")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Thu, 21 Feb 2019 16:43:58 +0000 (17:43 +0100)]
fou6: fix proto error handler argument type
Last argument of gue6_err_proto_handler() has a wrong type annotation,
fix it and make sparse happy again.
Fixes:
b8a51b38e4d4 ("fou, fou6: ICMP error handlers for FoU and GUE")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Thu, 21 Feb 2019 16:43:57 +0000 (17:43 +0100)]
udpv6: add the required annotation to mib type
In commit
029a37434880 ("udp6: cleanup stats accounting in recvmsg()")
I forgot to add the percpu annotation for the mib pointer. Add it, and
make sparse happy.
Fixes:
029a37434880 ("udp6: cleanup stats accounting in recvmsg()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Thu, 21 Feb 2019 14:42:01 +0000 (22:42 +0800)]
mdio_bus: Fix use-after-free on device_register fails
KASAN has found use-after-free in fixed_mdio_bus_init,
commit
0c692d07842a ("drivers/net/phy/mdio_bus.c: call
put_device on device_register() failure") call put_device()
while device_register() fails,give up the last reference
to the device and allow mdiobus_release to be executed
,kfreeing the bus. However in most drives, mdiobus_free
be called to free the bus while mdiobus_register fails.
use-after-free occurs when access bus again, this patch
revert it to let mdiobus_free free the bus.
KASAN report details as below:
BUG: KASAN: use-after-free in mdiobus_free+0x85/0x90 drivers/net/phy/mdio_bus.c:482
Read of size 4 at addr
ffff8881dc824d78 by task syz-executor.0/3524
CPU: 1 PID: 3524 Comm: syz-executor.0 Not tainted 5.0.0-rc7+ #45
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0xfa/0x1ce lib/dump_stack.c:113
print_address_description+0x65/0x270 mm/kasan/report.c:187
kasan_report+0x149/0x18d mm/kasan/report.c:317
mdiobus_free+0x85/0x90 drivers/net/phy/mdio_bus.c:482
fixed_mdio_bus_init+0x283/0x1000 [fixed_phy]
? 0xffffffffc0e40000
? 0xffffffffc0e40000
? 0xffffffffc0e40000
do_one_initcall+0xfa/0x5ca init/main.c:887
do_init_module+0x204/0x5f6 kernel/module.c:3460
load_module+0x66b2/0x8570 kernel/module.c:3808
__do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x462e99
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:
00007f6215c19c58 EFLAGS:
00000246 ORIG_RAX:
0000000000000139
RAX:
ffffffffffffffda RBX:
000000000073bf00 RCX:
0000000000462e99
RDX:
0000000000000000 RSI:
0000000020000080 RDI:
0000000000000003
RBP:
00007f6215c19c70 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000246 R12:
00007f6215c1a6bc
R13:
00000000004bcefb R14:
00000000006f7030 R15:
0000000000000004
Allocated by task 3524:
set_track mm/kasan/common.c:85 [inline]
__kasan_kmalloc.constprop.3+0xa0/0xd0 mm/kasan/common.c:496
kmalloc include/linux/slab.h:545 [inline]
kzalloc include/linux/slab.h:740 [inline]
mdiobus_alloc_size+0x54/0x1b0 drivers/net/phy/mdio_bus.c:143
fixed_mdio_bus_init+0x163/0x1000 [fixed_phy]
do_one_initcall+0xfa/0x5ca init/main.c:887
do_init_module+0x204/0x5f6 kernel/module.c:3460
load_module+0x66b2/0x8570 kernel/module.c:3808
__do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 3524:
set_track mm/kasan/common.c:85 [inline]
__kasan_slab_free+0x130/0x180 mm/kasan/common.c:458
slab_free_hook mm/slub.c:1409 [inline]
slab_free_freelist_hook mm/slub.c:1436 [inline]
slab_free mm/slub.c:2986 [inline]
kfree+0xe1/0x270 mm/slub.c:3938
device_release+0x78/0x200 drivers/base/core.c:919
kobject_cleanup lib/kobject.c:662 [inline]
kobject_release lib/kobject.c:691 [inline]
kref_put include/linux/kref.h:67 [inline]
kobject_put+0x146/0x240 lib/kobject.c:708
put_device+0x1c/0x30 drivers/base/core.c:2060
__mdiobus_register+0x483/0x560 drivers/net/phy/mdio_bus.c:382
fixed_mdio_bus_init+0x26b/0x1000 [fixed_phy]
do_one_initcall+0xfa/0x5ca init/main.c:887
do_init_module+0x204/0x5f6 kernel/module.c:3460
load_module+0x66b2/0x8570 kernel/module.c:3808
__do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The buggy address belongs to the object at
ffff8881dc824c80
which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 248 bytes inside of
2048-byte region [
ffff8881dc824c80,
ffff8881dc825480)
The buggy address belongs to the page:
page:
ffffea0007720800 count:1 mapcount:0 mapping:
ffff8881f6c02800 index:0x0 compound_mapcount: 0
flags: 0x2fffc0000010200(slab|head)
raw:
02fffc0000010200 0000000000000000 0000000500000001 ffff8881f6c02800
raw:
0000000000000000 00000000800f000f 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8881dc824c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8881dc824c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>
ffff8881dc824d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8881dc824d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8881dc824e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes:
0c692d07842a ("drivers/net/phy/mdio_bus.c: call put_device on device_register() failure")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kalash Nainwal [Thu, 21 Feb 2019 00:23:04 +0000 (16:23 -0800)]
net: Set rtm_table to RT_TABLE_COMPAT for ipv6 for tables > 255
Set rtm_table to RT_TABLE_COMPAT for ipv6 for tables > 255 to
keep legacy software happy. This is similar to what was done for
ipv4 in commit
709772e6e065 ("net: Fix routing tables with
id > 255 for legacy software").
Signed-off-by: Kalash Nainwal <kalash@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 22 Feb 2019 23:16:56 +0000 (15:16 -0800)]
Merge branch 'bnxt_en-firmware-message-delay-fixes'
Michael Chan says:
====================
bnxt_en: firmware message delay fixes.
We were seeing some intermittent firmware message timeouts in our lab and
these 2 small patches fix them. Please apply to stable as well. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 21 Feb 2019 00:07:32 +0000 (19:07 -0500)]
bnxt_en: Wait longer for the firmware message response to complete.
The code waits up to 20 usec for the firmware response to complete
once we've seen the valid response header in the buffer. It turns
out that in some scenarios, this wait time is not long enough.
Extend it to 150 usec and use usleep_range() instead of udelay().
Fixes:
9751e8e71487 ("bnxt_en: reduce timeout on initial HWRM calls")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 21 Feb 2019 00:07:31 +0000 (19:07 -0500)]
bnxt_en: Fix typo in firmware message timeout logic.
The logic that polls for the firmware message response uses a shorter
sleep interval for the first few passes. But there was a typo so it
was using the wrong counter (larger counter) for these short sleep
passes. The result is a slightly shorter timeout period for these
firmware messages than intended. Fix it by using the proper counter.
Fixes:
9751e8e71487 ("bnxt_en: reduce timeout on initial HWRM calls")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 22 Feb 2019 23:07:48 +0000 (00:07 +0100)]
Merge branch 'bpf-nfp-codegen-fixes'
Jiong Wang says:
====================
Code-gen for BPF_ALU | BPF_XOR | BPF_K is wrong when imm is -1,
also high 32-bit of 64-bit register should always be cleared.
This set fixed both bugs.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jiong Wang [Fri, 22 Feb 2019 22:36:04 +0000 (22:36 +0000)]
nfp: bpf: fix ALU32 high bits clearance bug
NFP BPF JIT compiler is doing a couple of small optimizations when jitting
ALU imm instructions, some of these optimizations could save code-gen, for
example:
A & -1 = A
A | 0 = A
A ^ 0 = A
However, for ALU32, high 32-bit of the 64-bit register should still be
cleared according to ISA semantics.
Fixes:
cd7df56ed3e6 ("nfp: add BPF to NFP code translator")
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jiong Wang [Fri, 22 Feb 2019 22:36:03 +0000 (22:36 +0000)]
nfp: bpf: fix code-gen bug on BPF_ALU | BPF_XOR | BPF_K
The intended optimization should be A ^ 0 = A, not A ^ -1 = A.
Fixes:
cd7df56ed3e6 ("nfp: add BPF to NFP code translator")
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
David S. Miller [Fri, 22 Feb 2019 20:51:21 +0000 (12:51 -0800)]
Merge tag 'mac80211-for-davem-2019-02-22' of git://git./linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Three more fixes:
* mac80211 mesh code wasn't allocating SKB tailroom properly
in some cases
* tx_sk_pacing_shift should be 7 for better performance
* mac80211_hwsim wasn't propagating genlmsg_reply() errors
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Wed, 20 Feb 2019 22:58:50 +0000 (14:58 -0800)]
Documentation: networking: switchdev: Update port parent ID section
Update the section about switchdev drivers having to implement a
switchdev_port_attr_get() function to return
SWITCHDEV_ATTR_ID_PORT_PARENT_ID since that is no longer valid after
commit
bccb30254a4a ("net: Get rid of
SWITCHDEV_ATTR_ID_PORT_PARENT_ID").
Fixes:
bccb30254a4a ("net: Get rid of SWITCHDEV_ATTR_ID_PORT_PARENT_ID")
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jann Horn [Wed, 20 Feb 2019 21:34:54 +0000 (22:34 +0100)]
net: socket: add check for negative optlen in compat setsockopt
__sys_setsockopt() already checks for `optlen < 0`. Add an equivalent check
to the compat path for robustness. This has to be `> INT_MAX` instead of
`< 0` because the signedness of `optlen` is different here.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Wed, 20 Feb 2019 17:18:12 +0000 (18:18 +0100)]
ipv6: route: purge exception on removal
When a netdevice is unregistered, we flush the relevant exception
via rt6_sync_down_dev() -> fib6_ifdown() -> fib6_del() -> fib6_del_route().
Finally, we end-up calling rt6_remove_exception(), where we release
the relevant dst, while we keep the references to the related fib6_info and
dev. Such references should be released later when the dst will be
destroyed.
There are a number of caches that can keep the exception around for an
unlimited amount of time - namely dst_cache, possibly even socket cache.
As a result device registration may hang, as demonstrated by this script:
ip netns add cl
ip netns add rt
ip netns add srv
ip netns exec rt sysctl -w net.ipv6.conf.all.forwarding=1
ip link add name cl_veth type veth peer name cl_rt_veth
ip link set dev cl_veth netns cl
ip -n cl link set dev cl_veth up
ip -n cl addr add dev cl_veth 2001::2/64
ip -n cl route add default via 2001::1
ip -n cl link add tunv6 type ip6tnl mode ip6ip6 local 2001::2 remote 2002::1 hoplimit 64 dev cl_veth
ip -n cl link set tunv6 up
ip -n cl addr add 2013::2/64 dev tunv6
ip link set dev cl_rt_veth netns rt
ip -n rt link set dev cl_rt_veth up
ip -n rt addr add dev cl_rt_veth 2001::1/64
ip link add name rt_srv_veth type veth peer name srv_veth
ip link set dev srv_veth netns srv
ip -n srv link set dev srv_veth up
ip -n srv addr add dev srv_veth 2002::1/64
ip -n srv route add default via 2002::2
ip -n srv link add tunv6 type ip6tnl mode ip6ip6 local 2002::1 remote 2001::2 hoplimit 64 dev srv_veth
ip -n srv link set tunv6 up
ip -n srv addr add 2013::1/64 dev tunv6
ip link set dev rt_srv_veth netns rt
ip -n rt link set dev rt_srv_veth up
ip -n rt addr add dev rt_srv_veth 2002::2/64
ip netns exec srv netserver & sleep 0.1
ip netns exec cl ping6 -c 4 2013::1
ip netns exec cl netperf -H 2013::1 -t TCP_STREAM -l 3 & sleep 1
ip -n rt link set dev rt_srv_veth mtu 1400
wait %2
ip -n cl link del cl_veth
This commit addresses the issue purging all the references held by the
exception at time, as we currently do for e.g. ipv6 pcpu dst entries.
v1 -> v2:
- re-order the code to avoid accessing dst and net after dst_dev_put()
Fixes:
93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 22 Feb 2019 19:43:45 +0000 (11:43 -0800)]
Merge branch 'nic-thunderx-fix-communication-races-between-VF-PF'
Vadim Lomovtsev says:
====================
nic: thunderx: fix communication races between VF & PF
The ThunderX CN88XX NIC Virtual Function driver uses mailbox interface
to communicate to physical function driver. Each of VF has it's own pair
of mailbox registers to read from and write to. The mailbox registers
has no protection from possible races, so it has to be implemented
at software side.
After long term testing by loop of 'ip link set <ifname> up/down'
command it was found that there are two possible scenarios when
race condition appears:
1. VF receives link change message from PF and VF send RX mode
configuration message to PF in the same time from separate thread.
2. PF receives RX mode configuration from VF and in the same time,
in separate thread PF detects link status change and sends appropriate
message to particular VF.
Both cases leads to mailbox data to be rewritten, NIC VF messaging control
data to be updated incorrectly and communication sequence gets broken.
This patch series is to address race condition with VF & PF communication.
Changes:
v1 -> v2
- 0000: correct typo in cover letter subject: 'betwen' -> 'between';
- move link state polling request task from pf to vf
instead of cheking status of mailbox irq;
v2 -> v3
- 0003: change return type of nicvf_send_cfg_done() function
from int to void;
- 0007: update subject and remove unused variable 'netdev'
from nicvf_link_status_check_task() function;
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Lomovtsev [Wed, 20 Feb 2019 11:02:45 +0000 (11:02 +0000)]
net: thunderx: remove link change polling code and info from nicpf
Since link change polling routine was moved to nicvf side,
we don't need anymore polling function at nicpf side along
with link status info for all enabled Vfs as at VF side
this info is already tracked.
This commit is to remove unnecessary code & fields from
nicpf structure.
Signed-off-by: Vadim Lomovtsev <vlomovtsev@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Lomovtsev [Wed, 20 Feb 2019 11:02:45 +0000 (11:02 +0000)]
net: thunderx: move link state polling function to VF
Move the link change polling task to VF side in order to
prevent races between VF and PF while sending link change
message(s). This commit is to implement link change request
to be initiated by VF.
Signed-off-by: Vadim Lomovtsev <vlomovtsev@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Lomovtsev [Wed, 20 Feb 2019 11:02:44 +0000 (11:02 +0000)]
net: thunderx: add mutex to protect mailbox from concurrent calls for same VF
In some cases it could happen that nicvf_send_msg_to_pf() could be called
concurrently for the same NIC VF, and thus re-writing mailbox contents and
breaking messaging sequence with PF by re-writing NICVF data.
This commit is to implement mutex for NICVF to protect mailbox registers
and NICVF messaging control data from concurrent access.
Signed-off-by: Vadim Lomovtsev <vlomovtsev@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Lomovtsev [Wed, 20 Feb 2019 11:02:44 +0000 (11:02 +0000)]
net: thunderx: rework xcast message structure to make it fit into 64 bit
To communicate to PF each of ThunderX NIC VF uses mailbox which is
pair of 64 bit registers available to both VFn and PF.
This commit is to change the xcast message structure in order to
fit it into 64 bit.
Signed-off-by: Vadim Lomovtsev <vlomovtsev@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Lomovtsev [Wed, 20 Feb 2019 11:02:44 +0000 (11:02 +0000)]
net: thunderx: add nicvf_send_msg_to_pf result check for set_rx_mode_task
The rx_set_mode invokes number of messages to be send to PF for receive
mode configuration. In case if there any issues we need to stop sending
messages and release allocated memory.
This commit is to implement check of nicvf_msg_send_to_pf() result.
Signed-off-by: Vadim Lomovtsev <vlomovtsev@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Lomovtsev [Wed, 20 Feb 2019 11:02:43 +0000 (11:02 +0000)]
net: thunderx: make CFG_DONE message to run through generic send-ack sequence
At the end of NIC VF initialization VF sends CFG_DONE message to PF without
using nicvf_msg_send_to_pf routine. This potentially could re-write data in
mailbox. This commit is to implement common way of sending CFG_DONE message
by the same way with other configuration messages by using
nicvf_send_msg_to_pf() routine.
Signed-off-by: Vadim Lomovtsev <vlomovtsev@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Lomovtsev [Wed, 20 Feb 2019 11:02:43 +0000 (11:02 +0000)]
net: thunderx: replace global nicvf_rx_mode_wq work queue for all VFs to private for each of them.
Having one work queue for receive mode configuration ndo_set_rx_mode()
call for all VFs results in making each of them wait till the
set_rx_mode() call completes for another VF if any of close, set
receive mode and change flags calls being already invoked. Potentially
this could cause device state change before appropriate call of receive
mode configuration completes, so the call itself became meaningless,
corrupt data or break configuration sequence.
We don't need any delays in NIC VF configuration sequence so having delayed
work call with 0 delay has no sense.
This commit is to implement one work queue for each NIC VF for set_rx_mode
task and to let them work independently and replacing delayed_work
with work_struct.
Signed-off-by: Vadim Lomovtsev <vlomovtsev@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Lomovtsev [Wed, 20 Feb 2019 11:02:43 +0000 (11:02 +0000)]
net: thunderx: correct typo in macro name
Correct STREERING to STEERING at macro name for BGX steering register.
Signed-off-by: Vadim Lomovtsev <vlomovtsev@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Bianconi [Wed, 20 Feb 2019 08:23:03 +0000 (09:23 +0100)]
net: ip6_gre: fix possible NULL pointer dereference in ip6erspan_set_version
Fix a possible NULL pointer dereference in ip6erspan_set_version checking
nlattr data pointer
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 7549 Comm: syz-executor432 Not tainted 5.0.0-rc6-next-
20190218
#37
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:ip6erspan_set_version+0x5c/0x350 net/ipv6/ip6_gre.c:1726
Code: 07 38 d0 7f 08 84 c0 0f 85 9f 02 00 00 49 8d bc 24 b0 00 00 00 c6 43
54 01 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f
85 9a 02 00 00 4d 8b ac 24 b0 00 00 00 4d 85 ed 0f
RSP: 0018:
ffff888089ed7168 EFLAGS:
00010202
RAX:
dffffc0000000000 RBX:
ffff8880869d6e58 RCX:
0000000000000000
RDX:
0000000000000016 RSI:
ffffffff862736b4 RDI:
00000000000000b0
RBP:
ffff888089ed7180 R08:
1ffff11010d3adcb R09:
ffff8880869d6e58
R10:
ffffed1010d3add5 R11:
ffff8880869d6eaf R12:
0000000000000000
R13:
ffffffff8931f8c0 R14:
ffffffff862825d0 R15:
ffff8880869d6e58
FS:
0000000000b3d880(0000) GS:
ffff8880ae900000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000020000184 CR3:
0000000092cc5000 CR4:
00000000001406e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
ip6erspan_newlink+0x66/0x7b0 net/ipv6/ip6_gre.c:2210
__rtnl_newlink+0x107b/0x16c0 net/core/rtnetlink.c:3176
rtnl_newlink+0x69/0xa0 net/core/rtnetlink.c:3234
rtnetlink_rcv_msg+0x465/0xb00 net/core/rtnetlink.c:5192
netlink_rcv_skb+0x17a/0x460 net/netlink/af_netlink.c:2485
rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5210
netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
netlink_unicast+0x536/0x720 net/netlink/af_netlink.c:1336
netlink_sendmsg+0x8ae/0xd70 net/netlink/af_netlink.c:1925
sock_sendmsg_nosec net/socket.c:621 [inline]
sock_sendmsg+0xdd/0x130 net/socket.c:631
___sys_sendmsg+0x806/0x930 net/socket.c:2136
__sys_sendmsg+0x105/0x1d0 net/socket.c:2174
__do_sys_sendmsg net/socket.c:2183 [inline]
__se_sys_sendmsg net/socket.c:2181 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2181
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x440159
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:
00007fffa69156e8 EFLAGS:
00000246 ORIG_RAX:
000000000000002e
RAX:
ffffffffffffffda RBX:
00000000004002c8 RCX:
0000000000440159
RDX:
0000000000000000 RSI:
0000000020001340 RDI:
0000000000000003
RBP:
00000000006ca018 R08:
0000000000000001 R09:
00000000004002c8
R10:
0000000000000011 R11:
0000000000000246 R12:
00000000004019e0
R13:
0000000000401a70 R14:
0000000000000000 R15:
0000000000000000
Modules linked in:
---[ end trace
09f8a7d13b4faaa1 ]---
RIP: 0010:ip6erspan_set_version+0x5c/0x350 net/ipv6/ip6_gre.c:1726
Code: 07 38 d0 7f 08 84 c0 0f 85 9f 02 00 00 49 8d bc 24 b0 00 00 00 c6 43
54 01 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f
85 9a 02 00 00 4d 8b ac 24 b0 00 00 00 4d 85 ed 0f
RSP: 0018:
ffff888089ed7168 EFLAGS:
00010202
RAX:
dffffc0000000000 RBX:
ffff8880869d6e58 RCX:
0000000000000000
RDX:
0000000000000016 RSI:
ffffffff862736b4 RDI:
00000000000000b0
RBP:
ffff888089ed7180 R08:
1ffff11010d3adcb R09:
ffff8880869d6e58
R10:
ffffed1010d3add5 R11:
ffff8880869d6eaf R12:
0000000000000000
R13:
ffffffff8931f8c0 R14:
ffffffff862825d0 R15:
ffff8880869d6e58
FS:
0000000000b3d880(0000) GS:
ffff8880ae900000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000020000184 CR3:
0000000092cc5000 CR4:
00000000001406e0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Fixes:
4974d5f678ab ("net: ip6_gre: initialize erspan_ver just for erspan tunnels")
Reported-and-tested-by: syzbot+30191cf1057abd3064af@syzkaller.appspotmail.com
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
George Wilkie [Wed, 20 Feb 2019 08:19:11 +0000 (08:19 +0000)]
team: use operstate consistently for linkup
When a port is added to a team, its initial state is derived
from netif_carrier_ok rather than netif_oper_up.
If it is carrier up but operationally down at the time of being
added, the port state.linkup will be set prematurely.
port state.linkup should be set consistently using
netif_oper_up rather than netif_carrier_ok.
Fixes:
f1d22a1e0595 ("team: account for oper state")
Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Chen [Wed, 20 Feb 2019 05:47:19 +0000 (13:47 +0800)]
r8152: Fix an error on RTL8153-BD MAC Address Passthrough support
RTL8153-BD is used in Dell DA300 type-C dongle.
Added RTL8153-BD support to activate MAC address pass through on DA300.
Apply correction on previously submitted patch in net.git tree.
Signed-off-by: David Chen <david.chen7@dell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Tue, 19 Feb 2019 23:15:30 +0000 (00:15 +0100)]
ipvlan: disallow userns cap_net_admin to change global mode/flags
When running Docker with userns isolation e.g. --userns-remap="default"
and spawning up some containers with CAP_NET_ADMIN under this realm, I
noticed that link changes on ipvlan slave device inside that container
can affect all devices from this ipvlan group which are in other net
namespaces where the container should have no permission to make changes
to, such as the init netns, for example.
This effectively allows to undo ipvlan private mode and switch globally to
bridge mode where slaves can communicate directly without going through
hostns, or it allows to switch between global operation mode (l2/l3/l3s)
for everyone bound to the given ipvlan master device. libnetwork plugin
here is creating an ipvlan master and ipvlan slave in hostns and a slave
each that is moved into the container's netns upon creation event.
* In hostns:
# ip -d a
[...]
8: cilium_host@bond0: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
ipvlan mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 10.41.0.1/32 scope link cilium_host
valid_lft forever preferred_lft forever
[...]
* Spawn container & change ipvlan mode setting inside of it:
# docker run -dt --cap-add=NET_ADMIN --network cilium-net --name client -l app=test cilium/netperf
9fff485d69dcb5ce37c9e33ca20a11ccafc236d690105aadbfb77e4f4170879c
# docker exec -ti client ip -d a
[...]
10: cilium0@if4: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
ipvlan mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
valid_lft forever preferred_lft forever
# docker exec -ti client ip link change link cilium0 name cilium0 type ipvlan mode l2
# docker exec -ti client ip -d a
[...]
10: cilium0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
valid_lft forever preferred_lft forever
* In hostns (mode switched to l2):
# ip -d a
[...]
8: cilium_host@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 10.41.0.1/32 scope link cilium_host
valid_lft forever preferred_lft forever
[...]
Same l3 -> l2 switch would also happen by creating another slave inside
the container's network namespace when specifying the existing cilium0
link to derive the actual (bond0) master:
# docker exec -ti client ip link add link cilium0 name cilium1 type ipvlan mode l2
# docker exec -ti client ip -d a
[...]
2: cilium1@if4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
10: cilium0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
valid_lft forever preferred_lft forever
* In hostns:
# ip -d a
[...]
8: cilium_host@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
ipvlan mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 10.41.0.1/32 scope link cilium_host
valid_lft forever preferred_lft forever
[...]
One way to mitigate it is to check CAP_NET_ADMIN permissions of
the ipvlan master device's ns, and only then allow to change
mode or flags for all devices bound to it. Above two cases are
then disallowed after the patch.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maciej Kwiecien [Fri, 22 Feb 2019 08:45:26 +0000 (09:45 +0100)]
sctp: don't compare hb_timer expire date before starting it
hb_timer might not start at all for a particular transport because its
start is conditional. In a result a node is not sending heartbeats.
Function sctp_transport_reset_hb_timer has two roles:
- initial start of hb_timer for a given transport,
- update expire date of hb_timer for a given transport.
The function is optimized to update timer's expire only if it is before
a new calculated one but this comparison is invalid for a timer which
has not yet started. Such a timer has expire == 0 and if a new expire
value is bigger than (MAX_JIFFIES / 2 + 2) then "time_before" macro will
fail and timer will not start resulting in no heartbeat packets send by
the node.
This was found when association was initialized within first 5 mins
after system boot due to jiffies init value which is near to MAX_JIFFIES.
Test kernel version: 4.9.154 (ARCH=arm)
hb_timer.expire = 0; //initialized, not started timer
new_expire = MAX_JIFFIES / 2 + 2; //or more
time_before(hb_timer.expire, new_expire) == false
Fixes:
ba6f5e33bdbb ("sctp: avoid refreshing heartbeat timer too often")
Reported-by: Marcin Stojek <marcin.stojek@nokia.com>
Tested-by: Marcin Stojek <marcin.stojek@nokia.com>
Signed-off-by: Maciej Kwiecien <maciej.kwiecien@nokia.com>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 22 Feb 2019 18:35:06 +0000 (10:35 -0800)]
Merge tag 'drm-fixes-2019-02-22' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"This contains a single i915 tiled display fix, and a set of
amdgpu/radeon fixes.
i915:
- tiled display fix
amdgpu/radeon:
- runtime PM fix
- bulk moves disable (fix is too large for 5.0)
- a set of display fixes that are all cc'ed stable so we didn't want
to leave them until -next"
* tag 'drm-fixes-2019-02-22' of git://anongit.freedesktop.org/drm/drm:
drm/amdgpu: disable bulk moves for now
drm/amd/display: set clocks to 0 on suspend on dce80
drm/amd/display: fix optimize_bandwidth func pointer for dce80
drm/amd/display: Fix negative cursor pos programming
drm/i915/fbdev: Actually configure untiled displays
drm/amd/display: Raise dispclk value for dce11
drm/amd/display: Fix MST reboot/poweroff sequence
drm/amdgpu: Update sdma golden setting for vega20
drm/amdgpu: Set DPM_FLAG_NEVER_SKIP when enabling PM-runtime
gpu: drm: radeon: Set DPM_FLAG_NEVER_SKIP when enabling PM-runtime
Linus Torvalds [Fri, 22 Feb 2019 18:32:26 +0000 (10:32 -0800)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Small set of three regression fixing patches, things are looking
pretty good here.
- Fix cxgb4 to work again with non-4k page sizes
- NULL pointer oops in SRP during sg_reset"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
iw_cxgb4: cq/qp mask depends on bar2 pages in a host page
cxgb4: Export sge_host_page_size to ulds
RDMA/srp: Rework SCSI device reset handling
Yu Zhang [Thu, 31 Jan 2019 16:09:23 +0000 (00:09 +0800)]
KVM: MMU: record maximum physical address width in kvm_mmu_extended_role
Previously, commit
7dcd57552008 ("x86/kvm/mmu: check if tdp/shadow
MMU reconfiguration is needed") offered some optimization to avoid
the unnecessary reconfiguration. Yet one scenario is broken - when
cpuid changes VM's maximum physical address width, reconfiguration
is needed to reset the reserved bits. Also, the TDP may need to
reset its shadow_root_level when this value is changed.
To fix this, a new field, maxphyaddr, is introduced in the extended
role structure to keep track of the configured guest physical address
width.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yu Zhang [Thu, 31 Jan 2019 16:09:43 +0000 (00:09 +0800)]
kvm: x86: Return LA57 feature based on hardware capability
Previously, 'commit
372fddf70904 ("x86/mm: Introduce the 'no5lvl' kernel
parameter")' cleared X86_FEATURE_LA57 in boot_cpu_data, if Linux chooses
to not run in 5-level paging mode. Yet boot_cpu_data is queried by
do_cpuid_ent() as the host capability later when creating vcpus, and Qemu
will not be able to detect this feature and create VMs with LA57 feature.
As discussed earlier, VMs can still benefit from extended linear address
width, e.g. to enhance features like ASLR. So we would like to fix this,
by return the true hardware capability when Qemu queries.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Vitaly Kuznetsov [Fri, 22 Feb 2019 16:45:01 +0000 (17:45 +0100)]
x86/kvm/mmu: fix switch between root and guest MMUs
Commit
14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu") brought one subtle
change: previously, when switching back from L2 to L1, we were resetting
MMU hooks (like mmu->get_cr3()) in kvm_init_mmu() called from
nested_vmx_load_cr3() and now we do that in nested_ept_uninit_mmu_context()
when we re-target vcpu->arch.mmu pointer.
The change itself looks logical: if nested_ept_init_mmu_context() changes
something than nested_ept_uninit_mmu_context() restores it back. There is,
however, one thing: the following call chain:
nested_vmx_load_cr3()
kvm_mmu_new_cr3()
__kvm_mmu_new_cr3()
fast_cr3_switch()
cached_root_available()
now happens with MMU hooks pointing to the new MMU (root MMU in our case)
while previously it was happening with the old one. cached_root_available()
tries to stash current root but it is incorrect to read current CR3 with
mmu->get_cr3(), we need to use old_mmu->get_cr3() which in case we're
switching from L2 to L1 is guest_mmu. (BTW, in shadow page tables case this
is a non-issue because we don't switch MMU).
While we could've tried to guess that we're switching between MMUs and call
the right ->get_cr3() from cached_root_available() this seems to be overly
complicated. Instead, just stash the corresponding CR3 when setting
root_hpa and make cached_root_available() use the stashed value.
Fixes:
14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>