Dave Chinner [Wed, 17 Sep 2025 22:12:54 +0000 (08:12 +1000)]
xfs: rework datasync tracking and execution
Jan Kara reported that the shared ILOCK held across the journal
flush during fdatasync operations slows down O_DSYNC DIO on
unwritten extents significantly. The underlying issue is that
unwritten extent conversion needs the ILOCK exclusive, whilst the
datasync operation after the extent conversion holds it shared.
Hence we cannot be flushing the journal for one IO completion whilst
at the same time doing unwritten extent conversion on another IO
completion on the same inode. This means that IO completions
lock-step, and IO performance is dependent on the journal flush
latency.
Jan demonstrated that reducing the ifdatasync lock hold time can
improve O_DSYNC DIO to unwritten extents performance by 2.5x.
Discussion on that patch found issues with the method, and we
came to the conclusion that separately tracking datasync flush
sequences was the best approach to solving the problem.
The fsync code uses the ILOCK to serialise against concurrent
modifications in the transaction commit phase. In a transaction
commit, there are several disjoint updates to inode log item state
that need to be considered atomically by the fsync code. These
operations are all done under ILOCK_EXCL context:
1. ili_fsync_flags is updated in ->iop_precommit
2. i_pincount is updated in ->iop_pin before it is added to the CIL
3. ili_commit_seq is updated in ->iop_committing, after it has been
added to the CIL
In fsync, we need to:
1. check that the inode is dirty in the journal (ipincount)
2. check that ili_fsync_flags is set
3. grab the ili_commit_seq if a journal flush is needed
4. clear the ili_fsync_flags to ensure that new modifications that
require fsync are tracked in ->iop_precommit correctly
The serialisation of ipincount/ili_commit_seq is needed
to ensure that we don't try to unnecessarily flush the journal.
The serialisation of ili_fsync_flags being set in
->iop_precommit and cleared in fsync post journal flush is
required for correctness.
Hence holding the ILOCK_SHARED in xfs_file_fsync() performs all this
serialisation for us. Ideally, we want to remove the need to hold
the ILOCK_SHARED in xfs_file_fsync() for best performance.
We start with the observation that fsync/fdatasync() only need to
wait for operations that have been completed. Hence operations that
are still being committed have not completed and datasync operations
do not need to wait for them.
This means we can use a single point in time in the commit process
to signal "this modification is complete". This is what
->iop_committing is supposed to provide - it is the point at which
the object is unlocked after the modification has been recorded in
the CIL. Hence we could use ili_commit_seq to determine if we should
flush the journal.
In theory, we can already do this. However, in practice this will
expose an internal global CIL lock to the IO path. The ipincount()
checks optimise away the need to take this lock - if the inode is
not pinned, then it is not in the CIL and we don't need to check if
a journal flush at ili_commit_seq needs to be performed.
The reason this is needed is that the ili_commit_seq is never
cleared. Once it is set, it remains set even once the journal has
been committed and the object has been unpinned. Hence we have to
look that journal internal commit sequence state to determine if
ili_commit_seq needs to be acted on or not.
We can solve this by clearing ili_commit_seq when the inode is
unpinned. If we clear it atomically with the last unpin going away,
then we are guaranteed that new modifications will order correctly
as they add a new pin counts and we won't clear a sequence number
for an active modification in the CIL.
Further, we can then allow the per-transaction flag state to
propagate into ->iop_committing (instead of clearing it in
->iop_precommit) and that will allow us to determine if the
modification needs a full fsync or just a datasync, and so we can
record a separate datasync sequence number (Jan's idea!) and then
use that in the fdatasync path instead of the full fsync sequence
number.
With this infrastructure in place, we no longer need the
ILOCK_SHARED in the fsync path. All serialisation is done against
the commit sequence numbers - if the sequence number is set, then we
have to flush the journal. If it is not set, then we have nothing to
do. This greatly simplifies the fsync implementation....
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Dave Chinner [Wed, 17 Sep 2025 22:12:53 +0000 (08:12 +1000)]
xfs: rearrange code in xfs_inode_item_precommit
There are similar extsize checks and updates done inside and outside
the inode item lock, which could all be done under a single top
level logic branch outside the ili_lock. The COW extsize fixup can
potentially miss updating the XFS_ILOG_CORE in ili_fsync_fields, so
moving this code up above the ili_fsync_fields update could also be
considered a fix.
Further, to make the next change a bit cleaner, move where we
calculate the on-disk flag mask to after we attach the cluster
buffer to the the inode log item.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Dmitry Antipov [Thu, 18 Sep 2025 11:14:03 +0000 (14:14 +0300)]
xfs: scrub: use kstrdup_const() for metapath scan setups
Except 'xchk_setup_metapath_rtginode()' case, 'path' argument of
'xchk_setup_metapath_scan()' is a compile-time constant. So it may
be reasonable to use 'kstrdup_const()' / 'kree_const()' to manage
'path' field of 'struct xchk_metapath' in attempt to reuse .rodata
instance rather than making a copy. Compile tested only.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Fri, 19 Sep 2025 13:12:09 +0000 (06:12 -0700)]
xfs: use bt_nr_sectors in xfs_dax_translate_range
Only ranges inside the file system can be translated, and the file system
can be smaller than the containing device.
Fixes:
f4ed93037966 ("xfs: don't shut down the filesystem for media failures beyond end of log")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Fri, 19 Sep 2025 13:12:08 +0000 (06:12 -0700)]
xfs: track the number of blocks in each buftarg
Add a bt_nr_sectors to track the number of sector in each buftarg, and
replace the check that hard codes sb_dblock in xfs_buf_map_verify with
this new value so that it is correct for non-ddev buftargs. The
RT buftarg only has a superblock in the first block, so it is unlikely
to trigger this, or are we likely to ever have enough blocks in the
in-memory buftargs, but we might as well get the check right.
Fixes:
10616b806d1d ("xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Fri, 19 Sep 2025 13:14:11 +0000 (06:14 -0700)]
xfs: constify xfs_errortag_random_default
This table is never modified, so mark it const.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Damien Le Moal [Thu, 18 Sep 2025 13:01:11 +0000 (22:01 +0900)]
xfs: improve default maximum number of open zones
For regular block devices using the zoned allocator, the default
maximum number of open zones is set to 1/4 of the number of realtime
groups. For a large capacity device, this leads to a very large limit.
E.g. with a 26 TB HDD:
mount /dev/sdb /mnt
...
XFS (sdb): 95836 zones of 65536 blocks size (23959 max open)
In turn such large limit on the number of open zones can lead, depending
on the workload, on a very large number of concurrent write streams
which devices generally do not handle well, leading to poor performance.
Introduce the default limit XFS_DEFAULT_MAX_OPEN_ZONES, defined as 128
to match the hardware limit of most SMR HDDs available today, and use
this limit to set mp->m_max_open_zones in xfs_calc_open_zones() instead
of calling xfs_max_open_zones(), when the user did not specify a limit
with the max_open_zones mount option.
For the 26 TB HDD example, we now get:
mount /dev/sdb /mnt
...
XFS (sdb): 95836 zones of 65536 blocks (128 max open zones)
This change does not prevent the user from specifying a lareger number
for the open zones limit. E.g.
mount -o max_open_zones=4096 /dev/sdb /mnt
...
XFS (sdb): 95836 zones of 65536 blocks (4096 max open zones)
Finally, since xfs_calc_open_zones() checks and caps the
mp->m_max_open_zones limit against the value calculated by
xfs_max_open_zones() for any type of device, this new default limit does
not increase m_max_open_zones for small capacity devices.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Damien Le Moal [Thu, 18 Sep 2025 13:01:10 +0000 (22:01 +0900)]
xfs: improve zone statistics message
Reword the information message displayed in xfs_mount_zones()
indicating the total zone count and maximum number of open zones.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Thu, 18 Sep 2025 14:34:37 +0000 (07:34 -0700)]
xfs: centralize error tag definitions
Right now 5 places in the kernel and one in xfsprogs need to be updated
for each new error tag. Add a bit of macro magic so that only the
error tag definition and a single table, which reside next to each
other, need to be updated.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Thu, 18 Sep 2025 14:34:36 +0000 (07:34 -0700)]
xfs: remove pointless externs in xfs_error.h
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Thu, 18 Sep 2025 14:34:35 +0000 (07:34 -0700)]
xfs: remove the expr argument to XFS_TEST_ERROR
Don't pass expr to XFS_TEST_ERROR. Most calls pass a constant false,
and the places that do pass an expression become cleaner by moving it
out.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Thu, 18 Sep 2025 14:34:34 +0000 (07:34 -0700)]
xfs: remove xfs_errortag_set
xfs_errortag_set is only called by xfs_errortag_attr_store, , which does
not need to validate the error tag, because it can only be called on
valid error tags that had a sysfs attribute registered.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Thu, 18 Sep 2025 14:34:33 +0000 (07:34 -0700)]
xfs: remove xfs_errortag_get
xfs_errortag_get is only called by xfs_errortag_attr_show, which does not
need to validate the error tag, because it can only be called on valid
error tags that had a sysfs attribute registered.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:24:13 +0000 (06:24 -0700)]
xfs: move the XLOG_REG_ constants out of xfs_log_format.h
These are purely in-memory values and not used at all in xfsprogs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Hans Holmberg [Mon, 1 Sep 2025 10:52:05 +0000 (10:52 +0000)]
xfs: adjust the hint based zone allocation policy
As we really can't make any general assumptions about files that don't
have any life time hint set or are set to "NONE", adjust the allocation
policy to avoid co-locating data from those files with files with a set
life time.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Hans Holmberg [Mon, 1 Sep 2025 10:52:05 +0000 (10:52 +0000)]
xfs: refactor hint based zone allocation
Replace the co-location code with a matrix that makes it more clear
on how the decisions are made.
The matrix contains scores for zone/file hint combinations. A "GOOD"
score for an open zone will result in immediate co-location while "OK"
combinations will only be picked if we cannot open a new zone.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Hans Holmberg [Mon, 1 Sep 2025 10:52:04 +0000 (10:52 +0000)]
fs: add an enum for number of life time hints
Add WRITE_LIFE_HINT_NR into the rw_hint enum to define the number of
values write life time hints can be set to. This is useful for e.g.
file systems which may want to map these values to allocation groups.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:20:30 +0000 (06:20 -0700)]
xfs: fix log CRC mismatches between i386 and other architectures
When mounting file systems with a log that was dirtied on i386 on
other architectures or vice versa, log recovery is unhappy:
[ 11.068052] XFS (vdb): Torn write (CRC failure) detected at log block 0x2. Truncating head block from 0xc.
This is because the CRCs generated by i386 and other architectures
always diff. The reason for that is that sizeof(struct xlog_rec_header)
returns different values for i386 vs the rest (324 vs 328), because the
struct is not sizeof(uint64_t) aligned, and i386 has odd struct size
alignment rules.
This issue goes back to commit
13cdc853c519 ("Add log versioning, and new
super block field for the log stripe") in the xfs-import tree, which
adds log v2 support and the h_size field that causes the unaligned size.
At that time it only mattered for the crude debug only log header
checksum, but with commit
0e446be44806 ("xfs: add CRC checks to the log")
it became a real issue for v5 file system, because now there is a proper
CRC, and regular builds actually expect it match.
Fix this by allowing checksums with and without the padding.
Fixes:
0e446be44806 ("xfs: add CRC checks to the log")
Cc: <stable@vger.kernel.org> # v3.8
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:20:29 +0000 (06:20 -0700)]
xfs: rename the old_crc variable in xlog_recover_process
old_crc is a very misleading name. Rename it to expected_crc as that
described the usage much better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:27:05 +0000 (06:27 -0700)]
xfs: remove the unused xfs_log_iovec_t typedef
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:27:04 +0000 (06:27 -0700)]
xfs: remove the unused xfs_qoff_logformat_t typedef
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:27:03 +0000 (06:27 -0700)]
xfs: remove the unused xfs_dq_logformat_t typedef
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:27:02 +0000 (06:27 -0700)]
xfs: remove the unused xfs_buf_log_format_t typedef
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:27:01 +0000 (06:27 -0700)]
xfs: remove the unused xfs_efd_log_format_64_t typedef
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:27:00 +0000 (06:27 -0700)]
xfs: remove the unused xfs_efd_log_format_32_t typedef
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:26:59 +0000 (06:26 -0700)]
xfs: remove the xfs_efd_log_format_t typedef
There are almost no users of the typedef left, kill it and switch the
remaining users to use the underlying struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:26:58 +0000 (06:26 -0700)]
xfs: remove the xfs_efi_log_format_64_t typedef
There are almost no users of the typedef left, kill it and switch the
remaining users to use the underlying struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:26:57 +0000 (06:26 -0700)]
xfs: remove the xfs_efi_log_format_32_t typedef
There are almost no users of the typedef left, kill it and switch the
remaining users to use the underlying struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:26:56 +0000 (06:26 -0700)]
xfs: remove the xfs_efi_log_format_t typedef
There are almost no users of the typedef left, kill it and switch the
remaining users to use the underlying struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:26:55 +0000 (06:26 -0700)]
xfs: remove the xfs_extent64_t typedef
There are almost no users of the typedef left, kill it and switch the
remaining users to use the underlying struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:26:54 +0000 (06:26 -0700)]
xfs: remove the xfs_extent32_t typedef
There are almost no users of the typedef left, kill it and switch the
remaining users to use the underlying struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:26:53 +0000 (06:26 -0700)]
xfs: remove the xfs_extent_t typedef
There are almost no users of the typedef left, kill it and switch the
remaining users to use the underlying struct.
Also fix up the comment about the struct xfs_extent definition to be
correct and read more easily.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:26:52 +0000 (06:26 -0700)]
xfs: remove the xfs_trans_header_t typedef
There are almost no users of the typedef left, kill it and switch the
remaining users to use the underlying struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 15 Sep 2025 13:26:51 +0000 (06:26 -0700)]
xfs: remove the xlog_op_header_t typedef
There are almost no users of the typedef left, kill it and switch the
remaining users to use the underlying struct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Bagas Sanjaya [Tue, 9 Sep 2025 00:04:31 +0000 (07:04 +0700)]
xfs: extend removed sysctls table
Commit
21d59d00221e4e ("xfs: remove deprecated sysctl knobs") moves
recently-removed sysctls to the removed sysctls table but fails to
extend the table, hence triggering Sphinx warning:
Documentation/admin-guide/xfs.rst:365: ERROR: Malformed table.
Text in column margin in table line 8.
============================= =======
Name Removed
============================= =======
fs.xfs.xfsbufd_centisec v4.0
fs.xfs.age_buffer_centisecs v4.0
fs.xfs.irix_symlink_mode v6.18
fs.xfs.irix_sgid_inherit v6.18
fs.xfs.speculative_cow_prealloc_lifetime v6.18
============================= ======= [docutils]
Extend "Name" column of the table to fit the now-longest sysctl, which
is fs.xfs.speculative_cow_prealloc_lifetime.
Fixes:
21d59d00221e ("xfs: remove deprecated sysctl knobs")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/
20250908180406.
32124fb7@canb.auug.org.au/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Carlos Maiolino [Fri, 5 Sep 2025 18:27:25 +0000 (20:27 +0200)]
Merge tag 'kconfig-2025-changes_2025-09-05' of https://git./linux/kernel/git/djwong/xfs-linux into xfs-6.18-merge
xfs: kconfig and feature changes for 2025 LTS [6.18 v2 2/2]
Ahead of the 2025 LTS kernel, disable by default the two features that
we promised to turn off in September 2025: V4 filesystems, and the
long-broken ASCII case insensitive directories.
Since online fsck has not had any major issues in the 16 months since it
was merged upstream, let's also turn that on by default.
With a bit of luck, this should all go splendidly.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Carlos Maiolino [Fri, 5 Sep 2025 18:26:42 +0000 (20:26 +0200)]
Merge tag 'fix-scrub-reap-calculations_2025-09-05' of https://git./linux/kernel/git/djwong/xfs-linux into xfs-6.18-merge
xfs: improve online repair reap calculations [6.18 v2 1/2]
A few months ago, the multi-fsblock untorn writes patchset added a bunch
of log intent item helper functions to estimate the number of intent
items that could be added to a particular transaction. Those helpers
enabled us to compute a safe upper bound on the number of blocks that
could be written in an untorn fashion with filesystem-provided out of
place writes.
Currently, the online fsck code employs static limits on the number of
intent items that it's willing to accrue to a single transaction when
it's trying to reap what it thinks are the old blocks from a corrupt
structure. There have been no problems reported with this approach
after years of testing, but static limits are scary and gross because
overestimating the intent item limit could result in transaction
overflows and dead filesystems; and underestimating causes unnecessary
overhead.
This series uses the new log intent item size helpers to estimate the
limits dynamically based on worst-case per-block repair work vs. the
size of the scrub transaction. After several months of testing this,
there don't seem to be any problems here either.
v2: rearrange patches, add review tags
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Wed, 27 Aug 2025 16:31:59 +0000 (09:31 -0700)]
xfs: enable online fsck by default in Kconfig
Online fsck has been a part of upstream for over a year now without any
serious problems. Turn it on by default in time for the 2025 LTS
kernel, and get rid of the "say N if unsure" messages for the default Y
options.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Darrick J. Wong [Wed, 9 Apr 2025 01:04:57 +0000 (18:04 -0700)]
xfs: use deferred reaping for data device cow extents
Don't roll the whole transaction after every extent, that's rather
inefficient.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 2 Sep 2025 21:38:27 +0000 (14:38 -0700)]
xfs: remove deprecated sysctl knobs
These sysctl knobs were scheduled for removal in September 2025. That
time has come, so remove them.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Darrick J. Wong [Tue, 8 Apr 2025 23:14:36 +0000 (16:14 -0700)]
xfs: remove static reap limits from repair.h
Delete XREAP_MAX_BINVAL and XREAP_MAX_DEFER_CHAIN because the reap code
now calculates those limits dynamically, so they're no longer needed.
Move the third limit (XREP_MAX_ITRUNCATE_EFIS) to the one file that uses
it. Note that the btree rebuilding code should reserve exactly the
number of blocks needed to rebuild a btree, so it is rare that the newbt
code will need to add any EFIs to the commit transaction. That's why
that static limit remains.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 2 Sep 2025 21:33:53 +0000 (14:33 -0700)]
xfs: remove deprecated mount options
These four mount options were scheduled for removal in September 2025,
so remove them now.
Cc: preichl@redhat.com
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Darrick J. Wong [Wed, 27 Aug 2025 16:30:58 +0000 (09:30 -0700)]
xfs: disable deprecated features by default in Kconfig
We promised to turn off these old features by default in September 2025.
Do so now.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Darrick J. Wong [Tue, 8 Apr 2025 23:14:35 +0000 (16:14 -0700)]
xfs: compute file mapping reap limits dynamically
Reaping file fork mappings is a little different -- log recovery can
free the blocks for us, so we only try to process a single mapping at a
time. Therefore, we only need to figure out the maximum number of
blocks that we can invalidate in a single transaction.
The rough calculation here is:
nr_extents = (logres - reservation used by any one step) /
(space used per binval)
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 8 Apr 2025 23:14:34 +0000 (16:14 -0700)]
xfs: compute realtime device CoW staging extent reap limits dynamically
Calculate the maximum number of CoW staging extents that can be reaped
in a single transaction chain. The rough calculation here is:
nr_extents = (logres - reservation used by any one step) /
(space used by intents per extent)
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 8 Apr 2025 23:14:33 +0000 (16:14 -0700)]
xfs: compute data device CoW staging extent reap limits dynamically
Calculate the maximum number of CoW staging extents that can be reaped
in a single transaction chain. The rough calculation here is:
nr_extents = (logres - reservation used by any one step) /
(space used by intents per extent +
space used for a few buffer invalidations)
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 8 Apr 2025 23:14:33 +0000 (16:14 -0700)]
xfs: compute per-AG extent reap limits dynamically
Calculate the maximum number of extents that can be reaped in a single
transaction chain, and the number of buffers that can be invalidated in
a single transaction. The rough calculation here is:
nr_extents = (logres - reservation used by any one step) /
(space used by intents per extent +
space used per binval)
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 8 Apr 2025 23:14:31 +0000 (16:14 -0700)]
xfs: convert the ifork reap code to use xreap_state
Convert the file fork reaping code to use struct xreap_state so that we
can reuse the dynamic state tracking code.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 8 Apr 2025 23:14:30 +0000 (16:14 -0700)]
xfs: prepare reaping code for dynamic limits
The online repair block reaping code employs static limits to decide if
it's time to roll the transaction or finish the deferred item chains to
avoid overflowing the scrub transaction's reservation. However, the
use of static limits aren't great -- btree blocks are assumed to be
scattered around the AG and the buffers need to be invalidated, whereas
COW staging extents are usually contiguous and do not have buffers. We
would like to configure the limits dynamically.
To get ready for this, reorganize struct xreap_state to store dynamic
limits, and add helpers to hide some of the details of how the limits
are enforced. Also rename the "xreap roll" functions to include the
word "binval" because they only exist to decide when we should roll the
transaction to deal with buffer invalidations.
No functional changes intended here.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 8 Apr 2025 23:14:32 +0000 (16:14 -0700)]
xfs: use deferred intent items for reaping crosslinked blocks
When we're removing rmap records for crosslinked blocks, use deferred
intent items so that we can try to free/unmap as many of the old data
structure's blocks as we can in the same transaction as the commit.
Cc: <stable@vger.kernel.org> # v6.6
Fixes:
1c7ce115e52106 ("xfs: reap large AG metadata extents when possible")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Marcelo Moreira [Wed, 27 Aug 2025 22:17:07 +0000 (19:17 -0300)]
xfs: Replace strncpy with memcpy
The changes modernizes the code by aligning it with current kernel best
practices. It improves code clarity and consistency, as strncpy is deprecated
as explained in Documentation/process/deprecated.rst. This change does
not alter the functionality or introduce any behavioral changes.
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Marcelo Moreira <marcelomoreira1905@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Linus Torvalds [Sun, 31 Aug 2025 22:33:07 +0000 (15:33 -0700)]
Linux 6.17-rc4
Linus Torvalds [Sun, 31 Aug 2025 16:20:17 +0000 (09:20 -0700)]
Merge tag 'x86_urgent_for_v6.17_rc4' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Convert the SSB mitigation to the attack vector controls which got
forgotten at the time
- Prevent the CPUID topology hierarchy detection on AMD from
overwriting the correct initial APIC ID
- Fix the case of a machine shipping without microcode in the BIOS, in
the AMD microcode loader
- Correct the Pentium 4 model range which has a constant TSC
* tag 'x86_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Add attack vector controls for SSB
x86/cpu/topology: Use initial APIC ID from XTOPOLOGY leaf on AMD/HYGON
x86/microcode/AMD: Handle the case of no BIOS microcode
x86/cpu/intel: Fix the constant_tsc model check for Pentium 4
Linus Torvalds [Sun, 31 Aug 2025 16:13:00 +0000 (09:13 -0700)]
Merge tag 'sched_urgent_for_v6.17_rc4' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Fix a stall on the CPU offline path due to mis-counting a deadline
server task twice as part of the runqueue's running tasks count
- Fix a realtime tasks starvation case where failure to enqueue a timer
whose expiration time is already in the past would cause repeated
attempts to re-enqueue a deadline server task which leads to starving
the former, realtime one
- Prevent a delayed deadline server task stop from breaking the
per-runqueue bandwidth tracking
- Have a function checking whether the deadline server task has
stopped, return the correct value
* tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Don't count nr_running for dl_server proxy tasks
sched/deadline: Fix RT task potential starvation when expiry time passed
sched/deadline: Always stop dl-server before changing parameters
sched/deadline: Fix dl_server_stopped()
Linus Torvalds [Sun, 31 Aug 2025 16:07:37 +0000 (09:07 -0700)]
Merge tag 'irq_urgent_for_v6.17_rc4' of git://git./linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Remove unnecessary and noisy WARN_ONs in gic-v5's init path
- Avoid a kmemleak false positive for the gic-v5's L2 IST table entries
- Fix a retval check in mvebu-gicp's probe function
- Fix a wrong conversion to guards in atmel-aic[5] irqchip
* tag 'irq_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v5: Remove undue WARN_ON()s in the IRS affinity parsing
irqchip/gic-v5: Fix kmemleak L2 IST table entries false positives
irqchip/mvebu-gicp: Fix an IS_ERR() vs NULL check in probe()
irqchip/atmel-aic[5]: Fix incorrect lock guard conversion
Linus Torvalds [Sun, 31 Aug 2025 15:56:45 +0000 (08:56 -0700)]
Merge tag 'hardening-v6.17-rc4' of git://git./linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
- ARM: stacktrace: include asm/sections.h in asm/stacktrace.h (Arnd
Bergmann)
- ubsan: Fix incorrect hand-side used in handle (Junhui Pei)
- hardening: Require clang 20.1.0 for __counted_by (Nathan Chancellor)
* tag 'hardening-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
hardening: Require clang 20.1.0 for __counted_by
ARM: stacktrace: include asm/sections.h in asm/stacktrace.h
ubsan: Fix incorrect hand-side used in handle
Linus Torvalds [Sun, 31 Aug 2025 15:49:55 +0000 (08:49 -0700)]
Merge tag 'gpio-fixes-for-v6.17-rc4' of git://git./linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix an off-by-one bug in interrupt handling in gpio-timberdale
- update MAINTAINERS
* tag 'gpio-fixes-for-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
MAINTAINERS: Change Altera-PIO driver maintainer
gpio: timberdale: fix off-by-one in IRQ type boundary check
Linus Torvalds [Sat, 30 Aug 2025 17:43:53 +0000 (10:43 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- CFI failure due to kpti_ng_pgd_alloc() signature mismatch
- Underallocation bug in the SVE ptrace kselftest
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
kselftest/arm64: Don't open code SVE_PT_SIZE() in fp-ptrace
arm64: mm: Fix CFI failure due to kpti_ng_pgd_alloc function signature
Mark Brown [Tue, 12 Aug 2025 14:49:27 +0000 (15:49 +0100)]
kselftest/arm64: Don't open code SVE_PT_SIZE() in fp-ptrace
In fp-trace when allocating a buffer to write SVE register data we open
code the addition of the header size to the VL depeendent register data
size, which lead to an underallocation bug when we cut'n'pasted the code
for FPSIMD format writes. Use the SVE_PT_SIZE() macro that the kernel
UAPI provides for this.
Fixes:
b84d2b27954f ("kselftest/arm64: Test FPSIMD format data writes via NT_ARM_SVE in fp-ptrace")
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250812-arm64-fp-trace-macro-v1-1-317cfff986a5@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Kees Cook [Fri, 29 Aug 2025 19:07:25 +0000 (12:07 -0700)]
arm64: mm: Fix CFI failure due to kpti_ng_pgd_alloc function signature
Seen during KPTI initialization:
CFI failure at create_kpti_ng_temp_pgd+0x124/0xce8 (target: kpti_ng_pgd_alloc+0x0/0x14; expected type: 0xd61b88b6)
The call site is alloc_init_pud() at arch/arm64/mm/mmu.c:
pud_phys = pgtable_alloc(TABLE_PUD);
alloc_init_pud() has the prototype:
static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot,
phys_addr_t (*pgtable_alloc)(enum pgtable_type),
int flags)
where the pgtable_alloc() prototype is declared.
The target (kpti_ng_pgd_alloc) is used in arch/arm64/kernel/cpufeature.c:
create_kpti_ng_temp_pgd(kpti_ng_temp_pgd, __pa(alloc), KPTI_NG_TEMP_VA,
PAGE_SIZE, PAGE_KERNEL, kpti_ng_pgd_alloc, 0);
which is an alias for __create_pgd_mapping_locked() with prototype:
extern __alias(__create_pgd_mapping_locked)
void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys,
unsigned long virt,
phys_addr_t size, pgprot_t prot,
phys_addr_t (*pgtable_alloc)(enum pgtable_type),
int flags);
__create_pgd_mapping_locked() passes the function pointer down:
__create_pgd_mapping_locked() -> alloc_init_p4d() -> alloc_init_pud()
But the target function (kpti_ng_pgd_alloc) has the wrong signature:
static phys_addr_t __init kpti_ng_pgd_alloc(int shift);
The "int" should be "enum pgtable_type".
To make "enum pgtable_type" available to cpufeature.c, move
enum pgtable_type definition from arch/arm64/mm/mmu.c to
arch/arm64/include/asm/mmu.h.
Adjust kpti_ng_pgd_alloc to use "enum pgtable_type" instead of "int".
The function behavior remains identical (parameter is unused).
Fixes:
c64f46ee1377 ("arm64: mm: use enum to identify pgtable level instead of *_SHIFT")
Cc: <stable@vger.kernel.org> # 6.16.x
Signed-off-by: Kees Cook <kees@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250829190721.it.373-kees@kernel.org
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Linus Torvalds [Fri, 29 Aug 2025 20:54:26 +0000 (13:54 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Correctly handle 'invariant' system registers for protected VMs
- Improved handling of VNCR data aborts, including external aborts
- Fixes for handling of FEAT_RAS for NV guests, providing a sane
fault context during SEA injection and preventing the use of
RASv1p1 fault injection hardware
- Ensure that page table destruction when a VM is destroyed gives an
opportunity to reschedule
- Large fix to KVM's infrastructure for managing guest context loaded
on the CPU, addressing issues where the output of AT emulation
doesn't get reflected to the guest
- Fix AT S12 emulation to actually perform stage-2 translation when
necessary
- Avoid attempting vLPI irqbypass when GICv4 has been explicitly
disabled for a VM
- Minor KVM + selftest fixes
RISC-V:
- Fix pte settings within kvm_riscv_gstage_ioremap()
- Fix comments in kvm_riscv_check_vcpu_requests()
- Fix stack overrun when setting vlenb via ONE_REG
x86:
- Use array_index_nospec() to sanitize the target vCPU ID when
handling PV IPIs and yields as the ID is guest-controlled.
- Drop a superfluous cpumask_empty() check when reclaiming SEV
memory, as the common case, by far, is that at least one CPU will
have entered the VM, and wbnoinvd_on_cpus_mask() will naturally
handle the rare case where the set of have_run_cpus is empty.
Selftests (not KVM):
- Rename the is_signed_type() macro in kselftest_harness.h to
is_signed_var() to fix a collision with linux/overflow.h. The
collision generates compiler warnings due to the two macros having
different meaning"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (29 commits)
KVM: arm64: nv: Fix ATS12 handling of single-stage translation
KVM: arm64: Remove __vcpu_{read,write}_sys_reg_{from,to}_cpu()
KVM: arm64: Fix vcpu_{read,write}_sys_reg() accessors
KVM: arm64: Simplify sysreg access on exception delivery
KVM: arm64: Check for SYSREGS_ON_CPU before accessing the 32bit state
RISC-V: KVM: fix stack overrun when loading vlenb
RISC-V: KVM: Correct kvm_riscv_check_vcpu_requests() comment
RISC-V: KVM: Fix pte settings within kvm_riscv_gstage_ioremap()
KVM: arm64: selftests: Sync ID_AA64MMFR3_EL1 in set_id_regs
KVM: arm64: Get rid of ARM64_FEATURE_MASK()
KVM: arm64: Make ID_AA64PFR1_EL1.RAS_frac writable
KVM: arm64: Make ID_AA64PFR0_EL1.RAS writable
KVM: arm64: Ignore HCR_EL2.FIEN set by L1 guest's EL2
KVM: arm64: Handle RASv1p1 registers
arm64: Add capability denoting FEAT_RASv1p1
KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables
KVM: arm64: Split kvm_pgtable_stage2_destroy()
selftests: harness: Rename is_signed_type() to avoid collision with overflow.h
KVM: SEV: don't check have_run_cpus in sev_writeback_caches()
KVM: arm64: Correctly populate FAR_EL2 on nested SEA injection
...
Nathan Chancellor [Thu, 7 Aug 2025 21:36:28 +0000 (14:36 -0700)]
hardening: Require clang 20.1.0 for __counted_by
After an innocuous change in -next that modified a structure that
contains __counted_by, clang-19 start crashing when building certain
files in drivers/gpu/drm/xe. When assertions are enabled, the more
descriptive failure is:
clang: clang/lib/AST/RecordLayoutBuilder.cpp:3335: const ASTRecordLayout &clang::ASTContext::getASTRecordLayout(const RecordDecl *) const: Assertion `D && "Cannot get layout of forward declarations!"' failed.
According to a reverse bisect, a tangential change to the LLVM IR
generation phase of clang during the LLVM 20 development cycle [1]
resolves this problem. Bump the version of clang that enables
CONFIG_CC_HAS_COUNTED_BY to 20.1.0 to ensure that this issue cannot be
hit.
Link: https://github.com/llvm/llvm-project/commit/160fb1121cdf703c3ef5e61fb26c5659eb581489
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20250807-fix-counted_by-clang-19-v1-1-902c86c1d515@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
Paolo Bonzini [Fri, 29 Aug 2025 16:57:31 +0000 (12:57 -0400)]
Merge tag 'kvmarm-fixes-6.17-1' of https://git./linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 changes for 6.17, take #2
- Correctly handle 'invariant' system registers for protected VMs
- Improved handling of VNCR data aborts, including external aborts
- Fixes for handling of FEAT_RAS for NV guests, providing a sane
fault context during SEA injection and preventing the use of
RASv1p1 fault injection hardware
- Ensure that page table destruction when a VM is destroyed gives an
opportunity to reschedule
- Large fix to KVM's infrastructure for managing guest context loaded
on the CPU, addressing issues where the output of AT emulation
doesn't get reflected to the guest
- Fix AT S12 emulation to actually perform stage-2 translation when
necessary
- Avoid attempting vLPI irqbypass when GICv4 has been explicitly
disabled for a VM
- Minor KVM + selftest fixes
Paolo Bonzini [Fri, 29 Aug 2025 16:57:18 +0000 (12:57 -0400)]
Merge tag 'kvm-riscv-fixes-6.17-1' of https://github.com/kvm-riscv/linux into HEAD
KVM/riscv fixes for 6.17, take #1
- Fix pte settings within kvm_riscv_gstage_ioremap()
- Fix comments in kvm_riscv_check_vcpu_requests()
- Fix stack overrun when setting vlenb via ONE_REG
Linus Torvalds [Fri, 29 Aug 2025 16:15:46 +0000 (09:15 -0700)]
Merge tag 'efi-fixes-for-v6.17-1' of git://git./linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- Assorted fixes for the OP-TEE based pseudo-EFI variable store
- Fix for an OOB access when looking up the same non-existing efivarfs
entry multiple times in parallel
* tag 'efi-fixes-for-v6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare
efi: stmm: Drop unneeded null pointer check
efi: stmm: Drop unused EFI error from setup_mm_hdr arguments
efi: stmm: Do not return EFI_OUT_OF_RESOURCES on internal errors
efi: stmm: Fix incorrect buffer allocation method
Linus Torvalds [Fri, 29 Aug 2025 15:51:34 +0000 (08:51 -0700)]
Merge tag 'v6.17-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Fix possible refcount leak in compound operations
- Fix remap_file_range() return code mapping, found by generic/157
* tag 'v6.17-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
fs/smb: Fix inconsistent refcnt update
smb3 client: fix return code mapping of remap_file_range
Linus Torvalds [Fri, 29 Aug 2025 15:09:34 +0000 (08:09 -0700)]
Merge tag 'xfs-fixes-6.17-rc4' of git://git./fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
"The highlight I'd like to point here is related to the XFS_RT
Kconfig, which has been updated to be enabled by default now if
CONFIG_BLK_DEV_ZONED is enabled.
This also contains a few fixes for zoned devices support in XFS,
specially related to swapon requests in inodes belonging to the zoned
FS.
A null-ptr dereference fix in the xattr data, due to a mishandling of
medium errors generated by block devices is also included"
* tag 'xfs-fixes-6.17-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: do not propagate ENODATA disk errors into xattr code
xfs: reject swapon for inodes on a zoned file system earlier
xfs: kick off inodegc when failing to reserve zoned blocks
xfs: remove xfs_last_used_zone
xfs: Default XFS_RT to Y if CONFIG_BLK_DEV_ZONED is enabled
Linus Torvalds [Fri, 29 Aug 2025 14:44:14 +0000 (07:44 -0700)]
Merge tag 'hid-for-linus-
2025082901' of git://git./linux/kernel/git/hid/hid
Pull HID fixes from Jiri Kosina:
- fixes for memory corruption in intel-thc-hid, hid-multitouch,
hid-mcp2221 and hid-asus (Aaron Ma, Qasim Ijaz, Arnaud Lecomte)
- power management/resume fix for intel-ish-hid (Zhang Lixu)
- driver reinitialization fix for intel-thc-hid (Even Xu)
- ensure that battery level status is reported as soon as possible,
which is required at least for some Android use-cases (José Expósito)
- quite a few new device ID additions and device-specific quirks
* tag 'hid-for-linus-
2025082901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: quirks: add support for Legion Go dual dinput modes
HID: elecom: add support for ELECOM M-DT2DRBK
HID: logitech: Add ids for G PRO 2 LIGHTSPEED
HID: input: report battery status changes immediately
HID: input: rename hidinput_set_battery_charge_status()
HID: intel-thc-hid: Intel-quicki2c: Enhance driver re-install flow
HID: hid-ntrig: fix unable to handle page fault in ntrig_report_version()
HID: asus: fix UAF via HID_CLAIMED_INPUT validation
hid: fix I2C read buffer overflow in raw_event() for mcp2221
HID: wacom: Add a new Art Pen 2
HID: multitouch: fix slab out-of-bounds access in mt_report_fixup()
HID: Kconfig: Fix spelling mistake "enthropy" -> "entropy"
HID: intel-ish-hid: Increase ISHTP resume ack timeout to 300ms
HID: intel-thc-hid: intel-thc: Fix incorrect pointer arithmetic in I2C regs save
HID: intel-thc-hid: intel-quicki2c: Fix ACPI dsd ICRS/ISUB length
Linus Torvalds [Fri, 29 Aug 2025 14:37:21 +0000 (07:37 -0700)]
Merge tag 'regulator-fix-v6.17-rc3' of git://git./linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"One simple fix for the pm8008 driver for poor error handling,
switching to use a helper which does the right thing in the
affected case"
* tag 'regulator-fix-v6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: pm8008: fix probe failure due to negative voltage selector
Linus Torvalds [Fri, 29 Aug 2025 14:29:17 +0000 (07:29 -0700)]
Merge tag 'ata-6.17-rc4' of git://git./linux/kernel/git/libata/linux
Pull ata fixes from Damien Le Moal:
- Fix the type of return values to be signed in the ahci_xgen driver
(Qianfeng)
- Add the mask_port_ext module parameter to the ahci driver.
This is to allow a user to ignore ports that are advertized as
external (hotplug capable) in favor of lower link power management
policies instead of the default max_performance for these ports.
This is useful to allow e.g. laptops to go into low power states when
hooked up to docking station with sata slots, connected with an
external port for hotplug (me)
* tag 'ata-6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: ahci_xgene: Use int type for 'rc' to store error codes
ata: ahci: Allow ignoring the external/hotplug capability of ports
Linus Torvalds [Fri, 29 Aug 2025 02:56:32 +0000 (19:56 -0700)]
Merge tag 'drm-fixes-2025-08-29' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Weekly fixes, feels a bit big.
The major piece is msm fixes, then the usual amdgpu/xe along with some
mediatek and nouveau fixes and a tegra revert.
gpuvm:
- fix some typos
xe:
- Fix user-fence race issue
- Couple xe_vm fixes
- Don't trigger rebind on initial dma-buf validation
- Fix a build issue related to basename() posix vs gnu discrepancy
amdgpu:
- pin buffers while vmapping
- UserQ fixes
- Revert CSA fix
- SR-IOV fix
nouveau:
- fix linear modifier
- remove some dead code
msm:
- Core/GPU:
- fix comment doc warning in gpuvm
- fix build with KMS disabled
- fix pgtable setup/teardown race
- global fault counter fix
- various error path fixes
- GPU devcoredump snapshot fixes
- handle in-place VM_BIND remaps to solve turnip vm update race
- skip re-emitting IBs for unusable VMs
- Don't use %pK through printk
- moved display snapshot init earlier, fixing a crash
- DPU:
- Fixed crash in virtual plane checking code
- Fixed mode comparison in virtual plane checking code
- DSI:
- Adjusted width of resulution-related registers
- Fixed locking issue on 14nm PLLs
- UBWC (per Bjorn's ack)
- Added UBWC configuration for several missing platforms (fixing
regression)
mediatek:
- Add error handling for old state CRTC in atomic_disable
- Fix DSI host and panel bridge pre-enable order
- Fix device/node reference count leaks in mtk_drm_get_all_drm_priv
- mtk_hdmi: Fix inverted parameters in some regmap_update_bits calls
tegra:
- revert dma-buf change"
* tag 'drm-fixes-2025-08-29' of https://gitlab.freedesktop.org/drm/kernel: (56 commits)
drm/mediatek: mtk_hdmi: Fix inverted parameters in some regmap_update_bits calls
drm/amdgpu/userq: fix error handling of invalid doorbell
drm/amdgpu: update firmware version checks for user queue support
drm/amd/amdgpu: disable hwmon power1_cap* for gfx 11.0.3 on vf mode
Revert "drm/amdgpu: fix incorrect vm flags to map bo"
drm/amdgpu/gfx12: set MQD as appriopriate for queue types
drm/amdgpu/gfx11: set MQD as appriopriate for queue types
drm/xe: switch to local xbasename() helper
drm/xe: Don't trigger rebind on initial dma-buf validation
drm/xe/vm: Clear the scratch_pt pointer on error
drm/xe/vm: Don't pin the vm_resv during validation
drm/xe/xe_sync: avoid race during ufence signaling
Revert "drm/tegra: Use dma_buf from GEM object instance"
soc: qcom: use no-UBWC config for MSM8956/76
soc: qcom: add configuration for MSM8929
soc: qcom: ubwc: add more missing platforms
soc: qcom: ubwc: use no-uwbc config for MSM8917
drm/msm/dpu: Add a null ptr check for dpu_encoder_needs_modeset
dt-bindings: display/msm: qcom,mdp5: drop lut clock
drm/gpuvm: fix various typos in .c and .h gpuvm file
...
Linus Torvalds [Fri, 29 Aug 2025 01:51:28 +0000 (18:51 -0700)]
Merge tag 'block-6.17-
20250828' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fix a lockdep spotted issue on recursive locking for zoned writes, in
case of errors
- Update bcache MAINTAINERS entry address for Coly
- Fix for a ublk release issue, with selftests
- Fix for a regression introduced in this cycle, where it assumed
q->rq_qos was always set if the bio flag indicated that
- Fix for a regression introduced in this cycle, where loop retrieving
block device sizes got broken
* tag 'block-6.17-
20250828' of git://git.kernel.dk/linux:
bcache: change maintainer's email address
ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device
ublk: avoid ublk_io_release() called after ublk char dev is closed
block: validate QoS before calling __rq_qos_done_bio()
blk-zoned: Fix a lockdep complaint about recursive locking
loop: fix zero sized loop for block special file
Linus Torvalds [Fri, 29 Aug 2025 01:41:53 +0000 (18:41 -0700)]
Merge tag 'io_uring-6.17-
20250828' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Use the proper type for min_t() in getting the min of the leftover
bytes and the buffer length.
- As good practice, use READ_ONCE() consistently for reading ring
provided buffer lengths. Additionally, stop looping for incremental
commits if a zero sized buffer is hit, as no further progress can be
made at that point.
* tag 'io_uring-6.17-
20250828' of git://git.kernel.dk/linux:
io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengths
io_uring/kbuf: fix signedness in this_len calculation
Linus Torvalds [Fri, 29 Aug 2025 00:35:51 +0000 (17:35 -0700)]
Merge tag 'net-6.17-rc4' of git://git./linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from Bluetooth.
Current release - regressions:
- ipv4: fix regression in local-broadcast routes
- vsock: fix error-handling regression introduced in v6.17-rc1
Previous releases - regressions:
- bluetooth:
- mark connection as closed during suspend disconnect
- fix set_local_name race condition
- eth:
- ice: fix NULL pointer dereference on reset
- mlx5: fix memory leak in hws_pool_buddy_init error path
- bnxt_en: fix stats context reservation logic
- hv: fix loss of receive events from host during channel open
Previous releases - always broken:
- page_pool: fix incorrect mp_ops error handling
- sctp: initialize more fields in sctp_v6_from_sk()
- eth:
- octeontx2-vf: fix max packet length errors
- idpf: fix Tx flow scheduling to avoid Tx timeouts
- bnxt_en: fix memory corruption during ifdown
- ice: fix incorrect counter for buffer allocation failures
- mlx5: fix lockdep assertion on sync reset unload event
- fbnic: fixup rtnl_lock and devl_lock handling
- xgmac: do not enable RX FIFO overflow interrupts
- phy: mscc: fix when PTP clock is register and unregister
Misc:
- add Telit Cinterion LE910C4-WWX new compositions"
* tag 'net-6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (60 commits)
net: ipv4: fix regression in local-broadcast routes
net: macb: Disable clocks once
fbnic: Move phylink resume out of service_task and into open/close
fbnic: Fixup rtnl_lock and devl_lock handling related to mailbox code
net: rose: fix a typo in rose_clear_routes()
l2tp: do not use sock_hold() in pppol2tp_session_get_sock()
sctp: initialize more fields in sctp_v6_from_sk()
MAINTAINERS: rmnet: Update email addresses
net: rose: include node references in rose_neigh refcount
net: rose: convert 'use' field to refcount_t
net: rose: split remove and free operations in rose_remove_neigh()
net: hv_netvsc: fix loss of early receive events from host during channel open.
net: stmmac: Set CIC bit only for TX queues with COE
net: stmmac: xgmac: Correct supported speed modes
net: stmmac: xgmac: Do not enable RX FIFO Overflow interrupts
net/mlx5e: Set local Xoff after FW update
net/mlx5e: Update and set Xon/Xoff upon port speed set
net/mlx5e: Update and set Xon/Xoff upon MTU set
net/mlx5: Prevent flow steering mode changes in switchdev mode
net/mlx5: Nack sync reset when SFs are present
...
Dave Airlie [Fri, 29 Aug 2025 00:04:26 +0000 (10:04 +1000)]
Merge tag 'mediatek-drm-fixes-
20250829' of https://git./linux/kernel/git/chunkuang.hu/linux into drm-fixes
Mediatek DRM Fixes -
20250829
1. Add error handling for old state CRTC in atomic_disable
2. Fix DSI host and panel bridge pre-enable order
3. Fix device/node reference count leaks in mtk_drm_get_all_drm_priv
4. mtk_hdmi: Fix inverted parameters in some regmap_update_bits calls
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Chun-Kuang Hu <chunkuang.hu@kernel.org>
Link: https://lore.kernel.org/r/20250828234116.4960-1-chunkuang.hu@kernel.org
Linus Torvalds [Thu, 28 Aug 2025 23:34:32 +0000 (16:34 -0700)]
Merge tag 'pm-6.17-rc4' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Add missing locking annotations to two recently introduced
list_for_each_entry_rcu() loops in the core device suspend/resume
code (Johannes Berg)"
* tag 'pm-6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: annotate RCU list iterations
Louis-Alexis Eyraud [Mon, 18 Aug 2025 14:17:52 +0000 (16:17 +0200)]
drm/mediatek: mtk_hdmi: Fix inverted parameters in some regmap_update_bits calls
In mtk_hdmi driver, a recent change replaced custom register access
function calls by regmap ones, but two replacements by regmap_update_bits
were done incorrectly, because original offset and mask parameters were
inverted, so fix them.
Fixes:
d6e25b3590a0 ("drm/mediatek: hdmi: Use regmap instead of iomem for main registers")
Signed-off-by: Louis-Alexis Eyraud <louisalexis.eyraud@collabora.com>
Reviewed-by: CK Hu <ck.hu@mediatek.com>
Link: https://patchwork.kernel.org/project/dri-devel/patch/20250818-mt8173-fix-hdmi-issue-v1-1-55aff9b0295d@collabora.com/
Signed-off-by: Chun-Kuang Hu <chunkuang.hu@kernel.org>
Dave Airlie [Thu, 28 Aug 2025 23:05:16 +0000 (09:05 +1000)]
Merge tag 'drm-msm-fixes-2025-08-26' of https://gitlab.freedesktop.org/drm/msm into drm-fixes
Fixes for v6.17-rc4
Core/GPU:
- fix comment doc warning in gpuvm
- fix build with KMS disabled
- fix pgtable setup/teardown race
- global fault counter fix
- various error path fixes
- GPU devcoredump snapshot fixes
- handle in-place VM_BIND remaps to solve turnip vm update race
- skip re-emitting IBs for unusable VMs
- Don't use %pK through printk
- moved display snapshot init earlier, fixing a crash
DPU:
- Fixed crash in virtual plane checking code
- Fixed mode comparison in virtual plane checking code
DSI:
- Adjusted width of resulution-related registers
- Fixed locking issue on 14nm PLLs
UBWC (per Bjorn's ack)
- Added UBWC configuration for several missing platforms (fixing
regression)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rob Clark <rob.clark@oss.qualcomm.com>
Link: https://lore.kernel.org/r/CACSVV02+u1VW1dzuz6JWwVEfpgTj6Y-JXMH+vX43KsKTVsW+Yg@mail.gmail.com
Linus Torvalds [Thu, 28 Aug 2025 23:04:14 +0000 (16:04 -0700)]
Merge tag 'dma-mapping-6.17-2025-08-28' of git://git./linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
- another small fix for arm64 systems with memory encryption (Shanker
Donthineni)
- fix for arm32 systems with non-standard CMA configuration (Oreoluwa
Babatunde)
* tag 'dma-mapping-6.17-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma/pool: Ensure DMA_DIRECT_REMAP allocations are decrypted
of: reserved_mem: Restructure call site for dma_contiguous_early_fixup()
Dave Airlie [Thu, 28 Aug 2025 22:50:27 +0000 (08:50 +1000)]
Merge tag 'amd-drm-fixes-6.17-2025-08-28' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.17-2025-08-28:
amdgpu:
- UserQ fixes
- Revert CSA fix
- SR-IOV fix
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250828173904.75850-1-alexander.deucher@amd.com
Linus Torvalds [Thu, 28 Aug 2025 22:46:06 +0000 (15:46 -0700)]
Merge tag 'fixes-2025-08-28' of git://git./linux/kernel/git/rppt/memblock
Pull memblock fixes from Mike Rapoport:
- printk cleanups in memblock and numa_memblks
- update kernel-doc for MEMBLOCK_RSRV_NOINIT to be more accurate and
detailed
* tag 'fixes-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock: fix kernel-doc for MEMBLOCK_RSRV_NOINIT
mm: numa,memblock: Use SZ_1M macro to denote bytes to MB conversion
mm/numa_memblks: Use pr_debug instead of printk(KERN_DEBUG)
Dave Airlie [Thu, 28 Aug 2025 22:44:10 +0000 (08:44 +1000)]
Merge tag 'drm-misc-fixes-2025-08-28' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
Several nouveau fixes to remove unused code, fix an error path and be
less restrictive with the formats it accepts. A fix for amdgpu to pin
vmapped dma-buf, and a revert for tegra for a regression in the dma-buf
/ GEM code.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <mripard@redhat.com>
Link: https://lore.kernel.org/r/20250828-hypersonic-colorful-squirrel-64f04b@houat
Linus Torvalds [Thu, 28 Aug 2025 22:39:06 +0000 (15:39 -0700)]
Merge tag 'powerpc-6.17-3' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Madhavan Srinivasan:
- Merge two CONFIG_POWERPC64_CPU entries in Kconfig.cputype
- Replace extra-y to always-y in Makefile
- Cleanup to use dev_fwnode helper
- Fix misleading comment in kvmppc_prepare_to_enter()
- misc cleanup and fixes
Thanks to Amit Machhiwal, Andrew Donnellan, Christophe Leroy, Gautam
Menghani, Jiri Slaby (SUSE), Masahiro Yamada, Shrikanth Hegde, Stephen
Rothwell, Venkat Rao Bagalkote, and Xichao Zhao
* tag 'powerpc-6.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/boot/install.sh: Fix shellcheck warnings
powerpc/prom_init: Fix shellcheck warnings
powerpc/kvm: Fix ifdef to remove build warning
powerpc: unify two CONFIG_POWERPC64_CPU entries in the same choice block
powerpc: use always-y instead of extra-y in Makefiles
powerpc/64: Drop unnecessary 'rc' variable
powerpc: Use dev_fwnode()
KVM: PPC: Fix misleading interrupts comment in kvmppc_prepare_to_enter()
Linus Torvalds [Thu, 28 Aug 2025 22:16:16 +0000 (15:16 -0700)]
MAINTAINERS: mark bcachefs externally maintained
As per many long discussion threads, public and private.
Signed-off-by: Linus Torvalds <torbalds@linux-foundation.org>
Dave Airlie [Thu, 28 Aug 2025 21:06:31 +0000 (07:06 +1000)]
Merge tag 'drm-xe-fixes-2025-08-27' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
- Fix user-fence race issue (Zbigniew)
- Couple xe_vm fixes (Thomas)
- Don't trigger rebind on initial dma-buf validation (Brost)
- Fix a build issue related to basename() posix vs gnu discrepancy (Carlos)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/aK8oalcIU-zQOfws@intel.com
Marc Zyngier [Sat, 9 Aug 2025 14:48:10 +0000 (15:48 +0100)]
KVM: arm64: nv: Fix ATS12 handling of single-stage translation
Volodymyr reports that using a Xen DomU as a nested guest (where
HCR_EL2.E2H == 0), ATS12 results in a translation that stops at
the L2's S1, which isn't something you'd normally expects.
Comparing the code against the spec proves to be illuminating,
and suggests that the author of such code must have been tired,
cross-eyed, drunk, or maybe all of the above.
The gist of it is that, apart from HCR_EL2.VM or HCR_EL2.DC being
0, only the use of the EL2&0 translation regime limits the walk
to S1 only, and that we must finish the S2 walk in any other case.
Which solves the above issue, as E2H==0 indicates that ATS12 walks
the EL1&0 translation regime.
Explicitly checking for EL2&0 fixes this.
Reported-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Fixes:
be04cebf3e788 ("KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}")
Link: https://lore.kernel.org/r/20250806141707.3479194-2-volodymyr_babchuk@epam.com
Link: https://lore.kernel.org/r/20250809144811.2314038-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Marc Zyngier [Sun, 17 Aug 2025 12:19:26 +0000 (13:19 +0100)]
KVM: arm64: Remove __vcpu_{read,write}_sys_reg_{from,to}_cpu()
There is no point having __vcpu_{read,write}_sys_reg_{from,to}_cpu()
exposed to the rest of the kernel, as the only callers are in
sys_regs.c.
Move them where they below, which is another opportunity to
simplify things a bit.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250817121926.217900-5-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Marc Zyngier [Sun, 17 Aug 2025 12:19:25 +0000 (13:19 +0100)]
KVM: arm64: Fix vcpu_{read,write}_sys_reg() accessors
Volodymyr reports (again!) that under some circumstances (E2H==0,
walking S1 PTs), PAR_EL1 doesn't report the value of the latest
walk in the CPU register, but that instead the value is written to
the backing store.
Further investigation indicates that the root cause of this is
that a group of registers (PAR_EL1, TPIDR*_EL{0,1}, the *32_EL2 dregs)
should always be considered as "on CPU", as they are not remapped
between EL1 and EL2.
We fail to treat them accordingly, and end-up considering that
the register (PAR_EL1 in this example) should be written to memory
instead of in the register.
While it would be possible to quickly work around it, it is obvious
that the way we track these things at the moment is pretty horrible,
and could do with some improvement.
Revamp the whole thing by:
- defining a location for a register (memory, cpu), potentially
depending on the state of the vcpu
- define a transformation for this register (mapped register, potential
translation, special register needing some particular attention)
- convey this information in a structure that can be easily passed
around
As a result, the accessors themselves become much simpler, as the
state is explicit instead of being driven by hard-to-understand
conventions.
We get rid of the "pure EL2 register" notion, which wasn't very
useful, and add sanitisation of the values by applying the RESx
masks as required, something that was missing so far.
And of course, we add the missing registers to the list, with the
indication that they are always loaded.
Reported-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Fixes:
fedc612314acf ("KVM: arm64: nv: Handle virtual EL2 registers in vcpu_read/write_sys_reg()")
Link: https://lore.kernel.org/r/20250806141707.3479194-3-volodymyr_babchuk@epam.com
Link: https://lore.kernel.org/r/20250817121926.217900-4-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Marc Zyngier [Sun, 17 Aug 2025 12:19:24 +0000 (13:19 +0100)]
KVM: arm64: Simplify sysreg access on exception delivery
Distinguishing between NV and VHE is slightly pointless, and only
serves as an extra complication, or a way to introduce bugs, such
as the way SPSR_EL1 gets written without checking for the state
being resident.
Get rid if this silly distinction, and fix the bug in one go.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250817121926.217900-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Marc Zyngier [Sun, 17 Aug 2025 12:19:23 +0000 (13:19 +0100)]
KVM: arm64: Check for SYSREGS_ON_CPU before accessing the 32bit state
Just like
c6e35dff58d3 ("KVM: arm64: Check for SYSREGS_ON_CPU before
accessing the CPU state") fixed the 64bit state access, add a check
for the 32bit state actually being on the CPU before writing it.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250817121926.217900-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Coly Li [Thu, 28 Aug 2025 15:48:35 +0000 (23:48 +0800)]
bcache: change maintainer's email address
Change to my new email address on fnnas.com.
Signed-off-by: Coly Li <colyli@fnnas.com>
Link: https://lore.kernel.org/r/20250828154835.32926-1-colyli@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 27 Aug 2025 12:16:00 +0000 (20:16 +0800)]
ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device
Add a new command line option --no_ublk_fixed_fd that excludes the ublk
control device (/dev/ublkcN) from io_uring's registered files array.
When this option is used, only backing files are registered starting
from index 1, while the ublk control device is accessed using its raw
file descriptor.
Add ublk_get_registered_fd() helper function that returns the appropriate
file descriptor for use with io_uring operations.
Key optimizations implemented:
- Cache UBLKS_Q_NO_UBLK_FIXED_FD flag in ublk_queue.flags to avoid
reading dev->no_ublk_fixed_fd in fast path
- Cache ublk char device fd in ublk_queue.ublk_fd for fast access
- Update ublk_get_registered_fd() to use ublk_queue * parameter
- Update io_uring_prep_buf_register/unregister() to use ublk_queue *
- Replace ublk_device * access with ublk_queue * access in fast paths
Also pass --no_ublk_fixed_fd to test_stress_04.sh for covering
plain ublk char device mode.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250827121602.2619736-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 27 Aug 2025 12:15:59 +0000 (20:15 +0800)]
ublk: avoid ublk_io_release() called after ublk char dev is closed
When running test_stress_04.sh, the following warning is triggered:
WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv]
This happens when the daemon is abruptly killed:
- some references may still be held, because registering IO buffer
doesn't grab ublk char device reference
OR
- io->task_registered_buffers won't be cleared because io buffer is
released from non-daemon context
For zero-copy and auto buffer register modes, I/O reference crosses
syscalls, so IO reference may not be dropped naturally when ublk server is
killed abruptly. However, when releasing io_uring context, it is guaranteed
that the reference is dropped finally, see io_sqe_buffers_unregister() from
io_ring_ctx_free().
Fix this by adding ublk_drain_io_references() that:
- Waits for active I/O references dropped in async way by scheduling
work function, for avoiding ublk dev and io_uring file's release
dependency
- Reinitializes io->ref and io->task_registered_buffers to clean state
This ensures the reference count state is clean when ublk_queue_reinit()
is called, preventing the warning and potential use-after-free.
Fixes:
1f6540e2aabb ("ublk: zc register/unregister bvec")
Fixes:
1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task")
Fixes:
8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250827121602.2619736-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 25 Aug 2025 11:15:00 +0000 (13:15 +0200)]
xfs: implement XFS_IOC_DIOINFO in terms of vfs_getattr
Use the direct I/O alignment reporting from ->getattr instead of
reimplementing it. This exposes the relaxation of the memory
alignment in the XFS_IOC_DIOINFO info and ensure the information will
stay in sync. Note that randholes.c in xfstests has a bug where it
incorrectly fails when the required memory alignment is smaller than the
pointer size. Round up the reported value as there is a fair chance that
this code got copied into various applications.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Andrey Albershteyn [Mon, 4 Aug 2025 12:08:16 +0000 (14:08 +0200)]
xfs: allow setting file attributes on special files
XFS does't have file attributes manipulation ioctls for special
files. Changing or reading file attributes is rejected for them in
xfs_fileattr_*et().
In XFS, this is necessary to work for project quota directories.
When project is set up, xfs_quota opens and calls FS_IOC_SETFSXATTR on
every inode in the directory. However, special files are skipped due to
open() returning a special inode for them. So, they don't even get to
this check.
The recently added file_getattr/file_setattr will call xfs_fileattr_*et,
on special files. This patch allows reading/changing file attributes on
special files.
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Andrey Albershteyn [Mon, 4 Aug 2025 12:08:15 +0000 (14:08 +0200)]
xfs: add .fileattr_set and fileattr_get callbacks for symlinks
As there are now file_getattr() and file_setattr(), xfs_quota will
call them on special files. These new syscalls call ->fileattr_get/set.
Symlink inodes don't have callbacks to set file attributes. This
patch adds them. The attribute values combinations are checked in
fileattr_set_prepare().
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Andrey Albershteyn [Mon, 4 Aug 2025 12:08:14 +0000 (14:08 +0200)]
xfs: allow renames of project-less inodes
Special file inodes cannot have project ID set from userspace and
are skipped during initial project setup. Those inodes are left
project-less in the project directory. New inodes created after
project initialization do have an ID. Creating hard links or
renaming those project-less inodes then fails on different ID check.
In commit
e23d7e82b707 ("xfs: allow cross-linking special files
without project quota"), we relaxed the project id checks to
allow hardlinking special files with differing project ids since the
projid cannot be changed. Apply the same workaround for renaming
operations.
Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Jens Axboe [Wed, 27 Aug 2025 21:27:30 +0000 (15:27 -0600)]
io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengths
Since the buffers are mapped from userspace, it is prudent to use
READ_ONCE() to read the value into a local variable, and use that for
any other actions taken. Having a stable read of the buffer length
avoids worrying about it changing after checking, or being read multiple
times.
Similarly, the buffer may well change in between it being picked and
being committed. Ensure the looping for incremental ring buffer commit
stops if it hits a zero sized buffer, as no further progress can be made
at that point.
Fixes:
ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://lore.kernel.org/io-uring/tencent_000C02641F6250C856D0C26228DE29A3D30A@qq.com/
Reported-by: Qingyue Zhang <chunzhennn@qq.com>
Reported-by: Suoxing Zhang <aftern00n@qq.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Oscar Maes [Wed, 27 Aug 2025 06:23:21 +0000 (08:23 +0200)]
net: ipv4: fix regression in local-broadcast routes
Commit
9e30ecf23b1b ("net: ipv4: fix incorrect MTU in broadcast routes")
introduced a regression where local-broadcast packets would have their
gateway set in __mkroute_output, which was caused by fi = NULL being
removed.
Fix this by resetting the fib_info for local-broadcast packets. This
preserves the intended changes for directed-broadcast packets.
Cc: stable@vger.kernel.org
Fixes:
9e30ecf23b1b ("net: ipv4: fix incorrect MTU in broadcast routes")
Reported-by: Brett A C Sheffield <bacs@librecast.net>
Closes: https://lore.kernel.org/regressions/
20250822165231.4353-4-bacs@librecast.net
Signed-off-by: Oscar Maes <oscmaes92@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250827062322.4807-1-oscmaes92@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Neil Mandir [Tue, 26 Aug 2025 14:30:22 +0000 (10:30 -0400)]
net: macb: Disable clocks once
When the driver is removed the clocks are disabled twice: once in
macb_remove and a second time by runtime pm. Disable wakeup in remove so
all the clocks are disabled and skip the second call to macb_clks_disable.
Always suspend the device as we always set it active in probe.
Fixes:
d54f89af6cc4 ("net: macb: Add pm runtime support")
Signed-off-by: Neil Mandir <neil.mandir@seco.com>
Co-developed-by: Sean Anderson <sean.anderson@linux.dev>
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Link: https://patch.msgid.link/20250826143022.935521-1-sean.anderson@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>