Babu Moger [Tue, 16 Sep 2025 17:25:49 +0000 (12:25 -0500)]
fs/resctrl: Fix counter auto-assignment on mkdir with mbm_event enabled
rdt_resource::resctrl_mon::mbm_assign_on_mkdir determines if a counter will
automatically be assigned to an RMID, MBM event pair when its associated
monitor group is created via mkdir.
Testing shows that counters are always automatically assigned to new monitor
groups, whether mbm_assign_on_mkdir is set or not.
To support automatic counter assignment the check for mbm_assign_on_mkdir
should be in rdtgroup_assign_cntrs() that assigns counters during monitor
group creation. Instead, the check for mbm_assign_on_mkdir is in
rdtgroup_unassign_cntrs() that is called on monitor group deletion from where
counters should always be unassigned, whether mbm_assign_on_mkdir is set or
not.
Fix automatic counter assignment by moving the mbm_assign_on_mkdir check from
rdtgroup_unassign_cntrs() to rdtgroup_assign_cntrs().
[ bp: Replace commit message with Reinette's version. ]
Fixes:
ef712fe97ec57 ("fs/resctrl: Auto assign counters on mkdir and clean up on group removal")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Babu Moger [Fri, 5 Sep 2025 21:34:32 +0000 (16:34 -0500)]
MAINTAINERS: resctrl: Add myself as reviewer
I have been contributing to resctrl for sometime now and I would like to
help with code reviews as well.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:31 +0000 (16:34 -0500)]
x86/resctrl: Configure mbm_event mode if supported
Configure mbm_event mode on AMD platforms. On AMD platforms, it is
recommended to use the mbm_event mode, if supported, to prevent the
hardware from resetting counters between reads. This can result in
misleading values or display "Unavailable" if no counter is assigned
to the event.
Enable mbm_event mode, known as ABMC (Assignable Bandwidth Monitoring
Counters) on AMD, by default if the system supports it.
Update ABMC across all logical processors within the resctrl domain to
ensure proper functionality.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:30 +0000 (16:34 -0500)]
fs/resctrl: Introduce the interface to switch between monitor modes
Resctrl subsystem can support two monitoring modes, "mbm_event" or "default".
In mbm_event mode, monitoring event can only accumulate data while it is
backed by a hardware counter. In "default" mode, resctrl assumes there is
a hardware counter for each event within every CTRL_MON and MON group.
Introduce mbm_assign_mode resctrl file to switch between mbm_event and default
modes.
Example:
To list the MBM monitor modes supported:
$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
[mbm_event]
default
To enable the "mbm_event" counter assignment mode:
$ echo "mbm_event" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
To enable the "default" monitoring mode:
$ echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
Reset MBM event counters automatically as part of changing the mode. Clear
both architectural and non-architectural event states to prevent overflow
conditions during the next event read. Clear assignable counter configuration
on all the domains. Also, enable auto assignment when switching to "mbm_event"
mode.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:29 +0000 (16:34 -0500)]
fs/resctrl: Disable BMEC event configuration when mbm_event mode is enabled
The BMEC (Bandwidth Monitoring Event Configuration) feature enables per-domain
event configuration. With BMEC the MBM events are configured using the
mbm_total_bytes_config or mbm_local_bytes_config files in
/sys/fs/resctrl/info/L3_MON/
and the per-domain event configuration affects all monitor resource groups.
The mbm_event counter assignment mode enables counters to be assigned to RMID
(i.e. a monitor resource group), event pairs, with potentially unique event
configurations associated with every counter.
There may be systems that support both BMEC and mbm_event counter assignment
mode, but resctrl supporting both concurrently will present a conflicting
interface to the user with both per-domain and per RMID, event configurations
active at the same time.
The mbm_event counter assignment provides most flexibility to user space
and aligns with Arm's counter support. On systems that support both,
disable BMEC event configuration when mbm_event mode is enabled by hiding
the mbm_total_bytes_config or mbm_local_bytes_config files when mbm_event
mode is enabled. Ensure mon_features always displays accurate information
about monitor features.
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:28 +0000 (16:34 -0500)]
fs/resctrl: Introduce the interface to modify assignments in a group
Enable the mbm_l3_assignments resctrl file to be used to modify counter
assignments of CTRL_MON and MON groups when the "mbm_event" counter
assignment mode is enabled.
Process the assignment modifications in the following format:
<Event>:<Domain id>=<Assignment state>;<Domain id>=<Assignment state>
Event: A valid MBM event in the
/sys/fs/resctrl/info/L3_MON/event_configs directory.
Domain ID: A valid domain ID. When writing, '*' applies the changes
to all domains.
Assignment states:
_ : Unassign a counter.
e : Assign a counter exclusively.
Examples:
$ cd /sys/fs/resctrl
$ cat /sys/fs/resctrl/mbm_L3_assignments
mbm_total_bytes:0=e;1=e
mbm_local_bytes:0=e;1=e
To unassign the counter associated with the mbm_total_bytes event on
domain 0:
$ echo "mbm_total_bytes:0=_" > mbm_L3_assignments
$ cat /sys/fs/resctrl/mbm_L3_assignments
mbm_total_bytes:0=_;1=e
mbm_local_bytes:0=e;1=e
To unassign the counter associated with the mbm_total_bytes event on
all the domains:
$ echo "mbm_total_bytes:*=_" > mbm_L3_assignments
$ cat /sys/fs/resctrl/mbm_L3_assignments
mbm_total_bytes:0=_;1=_
mbm_local_bytes:0=e;1=e
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:27 +0000 (16:34 -0500)]
fs/resctrl: Introduce mbm_L3_assignments to list assignments in a group
Introduce the mbm_L3_assignments resctrl file associated with CTRL_MON and MON
resource groups to display the counter assignment states of the resource group
when "mbm_event" counter assignment mode is enabled.
Display the list in the following format:
<Event>:<Domain id>=<Assignment state>;<Domain id>=<Assignment state>
Event: A valid MBM event listed in
/sys/fs/resctrl/info/L3_MON/event_configs directory.
Domain ID: A valid domain ID.
The assignment state can be one of the following:
_ : No counter assigned.
e : Counter assigned exclusively.
Example:
To list the assignment states for the default group
$ cd /sys/fs/resctrl
$ cat /sys/fs/resctrl/mbm_L3_assignments
mbm_total_bytes:0=e;1=e
mbm_local_bytes:0=e;1=e
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:26 +0000 (16:34 -0500)]
fs/resctrl: Auto assign counters on mkdir and clean up on group removal
Resctrl provides a user-configurable option mbm_assign_on_mkdir that
determines if a counter will automatically be assigned to an RMID, event pair
when its associated monitor group is created via mkdir.
Enable mbm_assign_on_mkdir by default to automatically assign counters to
the two default events (MBM total and MBM local) of a new monitoring group
created via mkdir. This maintains backward compatibility with original
resctrl support for these two events.
Unassign and free counters belonging to a monitoring group when the group
is deleted.
Monitor group creation does not fail if a counter cannot be assigned to one or
both events. There may be limited counters and users have the flexibility to
modify counter assignments at a later time. Log the error message "Failed to
allocate counter for <event> in domain <id>" in
/sys/fs/resctrl/info/last_cmd_status when a new monitoring group is created
but counter assignment failed.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:25 +0000 (16:34 -0500)]
fs/resctrl: Introduce mbm_assign_on_mkdir to enable assignments on mkdir
The "mbm_event" counter assignment mode allows users to assign a hardware
counter to an RMID, event pair and monitor the bandwidth as long as it is
assigned.
Introduce a user-configurable option that determines if a counter will
automatically be assigned to an RMID, event pair when its associated
monitor group is created via mkdir. Accessible when "mbm_event" counter
assignment mode is enabled.
Suggested-by: Peter Newman <peternewman@google.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:24 +0000 (16:34 -0500)]
fs/resctrl: Provide interface to update the event configurations
When "mbm_event" counter assignment mode is enabled, users can modify the
event configuration by writing to the 'event_filter' resctrl file. The event
configurations for mbm_event mode are located in
/sys/fs/resctrl/info/L3_MON/event_configs/.
Update the assignments of all CTRL_MON and MON resource groups when the event
configuration is modified.
Example:
$ mount -t resctrl resctrl /sys/fs/resctrl
$ cd /sys/fs/resctrl/
$ cat info/L3_MON/event_configs/mbm_local_bytes/event_filter
local_reads,local_non_temporal_writes,local_reads_slow_memory
$ echo "local_reads,local_non_temporal_writes" >
info/L3_MON/event_configs/mbm_total_bytes/event_filter
$ cat info/L3_MON/event_configs/mbm_total_bytes/event_filter
local_reads,local_non_temporal_writes
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:23 +0000 (16:34 -0500)]
fs/resctrl: Add event configuration directory under info/L3_MON/
The "mbm_event" counter assignment mode allows the user to assign a hardware
counter to an RMID, event pair and monitor the bandwidth as long as it is
assigned. The user can specify the memory transaction(s) for the counter to
track.
When this mode is supported, the /sys/fs/resctrl/info/L3_MON/event_configs
directory contains a sub-directory for each MBM event that can be assigned to
a counter. The MBM event sub-directory contains a file named "event_filter"
that is used to view and modify which memory transactions the MBM event is
configured with.
Create /sys/fs/resctrl/info/L3_MON/event_configs directory on resctrl mount
and pre-populate it with directories for the two existing MBM events:
mbm_total_bytes and mbm_local_bytes. Create the "event_filter" file within
each MBM event directory with the needed *show() that displays the memory
transactions with which the MBM event is configured.
Example:
$ mount -t resctrl resctrl /sys/fs/resctrl
$ cd /sys/fs/resctrl/
$ cat info/L3_MON/event_configs/mbm_total_bytes/event_filter
local_reads,remote_reads,local_non_temporal_writes,
remote_non_temporal_writes,local_reads_slow_memory,
remote_reads_slow_memory,dirty_victim_writes_all
$ cat info/L3_MON/event_configs/mbm_local_bytes/event_filter
local_reads,local_non_temporal_writes,local_reads_slow_memory
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:22 +0000 (16:34 -0500)]
fs/resctrl: Support counter read/reset with mbm_event assignment mode
When "mbm_event" counter assignment mode is enabled, the architecture requires
a counter ID to read the event data.
Introduce an is_mbm_cntr field in struct rmid_read to indicate whether counter
assignment mode is in use.
Update the logic to call resctrl_arch_cntr_read() and resctrl_arch_reset_cntr()
when the assignment mode is active. Report 'Unassigned' in case the user attempts
to read an event without assigning a hardware counter.
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:21 +0000 (16:34 -0500)]
x86/resctrl: Implement resctrl_arch_reset_cntr() and resctrl_arch_cntr_read()
System software reads resctrl event data for a particular resource by writing
the RMID and Event Identifier (EvtID) to the QM_EVTSEL register and then
reading the event data from the QM_CTR register.
In ABMC mode, the event data of a specific counter ID is read by setting the
following fields: QM_EVTSEL.ExtendedEvtID = 1, QM_EVTSEL.EvtID = L3CacheABMC
(=1) and setting QM_EVTSEL.RMID to the desired counter ID. Reading the QM_CTR
then returns the contents of the specified counter ID. RMID_VAL_ERROR bit is
set if the counter configuration is invalid, or if an invalid counter ID is
set in the QM_EVTSEL.RMID field. RMID_VAL_UNAVAIL bit is set if the counter
data is unavailable.
Introduce resctrl_arch_reset_cntr() and resctrl_arch_cntr_read() to reset and
read event data for a specific counter.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:20 +0000 (16:34 -0500)]
x86/resctrl: Refactor resctrl_arch_rmid_read()
resctrl_arch_rmid_read() adjusts the value obtained from MSR_IA32_QM_CTR to
account for the overflow for MBM events and apply counter scaling for all the
events. This logic is common to both reading an RMID and reading a hardware
counter directly.
Refactor the hardware value adjustment logic into get_corrected_val() to
prepare for support of reading a hardware counter.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:19 +0000 (16:34 -0500)]
fs/resctrl: Introduce counter ID read, reset calls in mbm_event mode
When supported, "mbm_event" counter assignment mode allows users to assign
a hardware counter to an RMID, event pair and monitor the bandwidth usage as
long as it is assigned. The hardware continues to track the assigned counter
until it is explicitly unassigned by the user.
Introduce the architecture calls resctrl_arch_cntr_read() and
resctrl_arch_reset_cntr() to read and reset event counters when "mbm_event"
mode is supported. Function names match existing resctrl_arch_rmid_read() and
resctrl_arch_reset_rmid().
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:18 +0000 (16:34 -0500)]
fs/resctrl: Pass struct rdtgroup instead of individual members
Reading monitoring data for a monitoring group requires both the RMID and
CLOSID. The RMID and CLOSID are members of struct rdtgroup but passed
separately to several functions involved in retrieving event data.
When "mbm_event" counter assignment mode is enabled, a counter ID is required
to read event data. The counter ID is obtained through mbm_cntr_get(), which
expects a struct rdtgroup pointer.
Provide a pointer to the struct rdtgroup as parameter to functions involved in
retrieving event data to simplify access to RMID, CLOSID, and counter ID.
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:17 +0000 (16:34 -0500)]
fs/resctrl: Add the functionality to unassign MBM events
The "mbm_event" counter assignment mode offers "num_mbm_cntrs" number of
counters that can be assigned to RMID, event pairs and monitor bandwidth usage
as long as it is assigned. If all the counters are in use, the kernel logs the
error message "Failed to allocate counter for <event> in domain <id>" in
/sys/fs/resctrl/info/last_cmd_status when a new assignment is requested.
To make space for a new assignment, users must unassign an already assigned
counter and retry the assignment again.
Add the functionality to unassign and free the counters in the domain. Also,
add the helper rdtgroup_unassign_cntrs() to unassign counters in the group.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:16 +0000 (16:34 -0500)]
fs/resctrl: Add the functionality to assign MBM events
When supported, "mbm_event" counter assignment mode offers "num_mbm_cntrs"
number of counters that can be assigned to RMID, event pairs and monitor
bandwidth usage as long as it is assigned.
Add the functionality to allocate and assign a counter to an RMID, event
pair in the domain. Also, add the helper rdtgroup_assign_cntrs() to assign
counters in the group.
Log the error message "Failed to allocate counter for <event> in domain
<id>" in /sys/fs/resctrl/info/last_cmd_status if all the counters are in
use. Exit on the first failure when assigning counters across all the
domains.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:15 +0000 (16:34 -0500)]
x86,fs/resctrl: Implement resctrl_arch_config_cntr() to assign a counter with ABMC
The ABMC feature allows users to assign a hardware counter to an RMID,
event pair and monitor bandwidth usage as long as it is assigned. The
hardware continues to track the assigned counter until it is explicitly
unassigned by the user.
Implement an x86 architecture-specific handler to configure a counter. This
architecture specific handler is called by resctrl fs when a counter is
assigned or unassigned as well as when an already assigned counter's
configuration should be updated. Configure counters by writing to the
L3_QOS_ABMC_CFG MSR, specifying the counter ID, bandwidth source (RMID),
and event configuration.
The ABMC feature details are documented in APM [1] available from [2].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Babu Moger [Fri, 5 Sep 2025 21:34:14 +0000 (16:34 -0500)]
fs/resctrl: Introduce event configuration field in struct mon_evt
When supported, mbm_event counter assignment mode allows the user to configure
events to track specific types of memory transactions.
Introduce an evt_cfg field in struct mon_evt to define the type of memory
transactions tracked by a monitoring event. Also add a helper function to get
the evt_cfg value.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:13 +0000 (16:34 -0500)]
x86/resctrl: Add data structures and definitions for ABMC assignment
The ABMC feature allows users to assign a hardware counter to an RMID,
event pair and monitor bandwidth usage as long as it is assigned. The
hardware continues to track the assigned counter until it is explicitly
unassigned by the user.
The ABMC feature implements an MSR L3_QOS_ABMC_CFG (C000_03FDh).
ABMC counter assignment is done by setting the counter id, bandwidth
source (RMID) and bandwidth configuration.
Attempts to read or write the MSR when ABMC is not enabled will result
in a #GP(0) exception.
Introduce the data structures and definitions for MSR L3_QOS_ABMC_CFG
(0xC000_03FDh):
=========================================================================
Bits Mnemonic Description Access Reset
Type Value
=========================================================================
63 CfgEn Configuration Enable R/W 0
62 CtrEn Enable/disable counting R/W 0
61:53 – Reserved MBZ 0
52:48 CtrID Counter Identifier R/W 0
47 IsCOS BwSrc field is a CLOSID R/W 0
(not an RMID)
46:44 – Reserved MBZ 0
43:32 BwSrc Bandwidth Source R/W 0
(RMID or CLOSID)
31:0 BwType Bandwidth configuration R/W 0
tracked by the CtrID
==========================================================================
The ABMC feature details are documented in APM [1] available from [2].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).
[ bp: Touchups. ]
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Babu Moger [Fri, 5 Sep 2025 21:34:12 +0000 (16:34 -0500)]
fs/resctrl: Introduce interface to display number of free MBM counters
Introduce the "available_mbm_cntrs" resctrl file to display the number of
counters available for assignment in each domain when "mbm_event" mode is
enabled.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:11 +0000 (16:34 -0500)]
fs/resctrl: Introduce mbm_cntr_cfg to track assignable counters per domain
The "mbm_event" counter assignment mode allows users to assign a hardware
counter to an RMID, event pair and monitor bandwidth usage as long as it is
assigned. The hardware continues to track the assigned counter until it is
explicitly unassigned by the user. Counters are assigned/unassigned at
monitoring domain level.
Manage a monitoring domain's hardware counters using a per monitoring
domain array of struct mbm_cntr_cfg that is indexed by the hardware
counter ID. A hardware counter's configuration contains the MBM event
ID and points to the monitoring group that it is assigned to, with a NULL
pointer meaning that the hardware counter is available for assignment.
There is no direct way to determine which hardware counters are assigned
to a particular monitoring group. Check every entry of every hardware
counter configuration array in every monitoring domain to query which
MBM events of a monitoring group is tracked by hardware. Such queries are
acceptable because of a very small number of assignable counters (32
to 64).
Suggested-by: Peter Newman <peternewman@google.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:10 +0000 (16:34 -0500)]
fs/resctrl: Add resctrl file to display number of assignable counters
The "mbm_event" counter assignment mode allows users to assign a hardware
counter to an RMID, event pair and monitor bandwidth usage as long as it is
assigned. The hardware continues to track the assigned counter until it is
explicitly unassigned by the user.
Create 'num_mbm_cntrs' resctrl file that displays the number of counters
supported in each domain. 'num_mbm_cntrs' is only visible to user space when
the system supports "mbm_event" mode.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:09 +0000 (16:34 -0500)]
fs/resctrl: Introduce the interface to display monitoring modes
Introduce the resctrl file "mbm_assign_mode" to list the supported counter
assignment modes.
The "mbm_event" counter assignment mode allows users to assign a hardware
counter to an RMID, event pair and monitor bandwidth usage as long as it is
assigned. The hardware continues to track the assigned counter until it is
explicitly unassigned by the user. Each event within a resctrl group can be
assigned independently in this mode.
On AMD systems "mbm_event" mode is backed by the ABMC (Assignable Bandwidth
Monitoring Counters) hardware feature and is enabled by default.
The "default" mode is the existing mode that works without the explicit
counter assignment, instead relying on dynamic counter assignment by hardware
that may result in hardware not dedicating a counter resulting in monitoring
data reads returning "Unavailable".
Provide an interface to display the monitor modes on the system.
$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
[mbm_event]
default
Add IS_ENABLED(CONFIG_RESCTRL_ASSIGN_FIXED) check to support Arm64.
On x86, CONFIG_RESCTRL_ASSIGN_FIXED is not defined. On Arm64, it will be
defined when the "mbm_event" mode is supported.
Add IS_ENABLED(CONFIG_RESCTRL_ASSIGN_FIXED) check early to ensure the user
interface remains compatible with upcoming Arm64 support. IS_ENABLED() safely
evaluates to 0 when the configuration is not defined.
As a result, for MPAM, the display would be either:
[default]
or
[mbm_event]
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:08 +0000 (16:34 -0500)]
x86/resctrl: Add support to enable/disable AMD ABMC feature
Add the functionality to enable/disable the AMD ABMC feature.
The AMD ABMC feature is enabled by setting enabled bit(0) in the
L3_QOS_EXT_CFG MSR. When the state of ABMC is changed, the MSR needs to be
updated on all the logical processors in the QOS Domain.
Hardware counters will reset when ABMC state is changed.
[ bp: Massage commit message. ]
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Babu Moger [Fri, 5 Sep 2025 21:34:07 +0000 (16:34 -0500)]
x86,fs/resctrl: Detect Assignable Bandwidth Monitoring feature details
ABMC feature details are reported via CPUID Fn8000_0020_EBX_x5.
Bits Description
15:0 MAX_ABMC Maximum Supported Assignable Bandwidth
Monitoring Counter ID + 1
The ABMC feature details are documented in APM [1] available from [2].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).
Detect the feature and number of assignable counters supported. For backward
compatibility, upon detecting the assignable counter feature, enable the
mbm_total_bytes and mbm_local_bytes events that users are familiar with as
part of original L3 MBM support.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Babu Moger [Fri, 5 Sep 2025 21:34:06 +0000 (16:34 -0500)]
x86,fs/resctrl: Consolidate monitoring related data from rdt_resource
The cache allocation and memory bandwidth allocation feature properties are
consolidated into struct resctrl_cache and struct resctrl_membw respectively.
In preparation for more monitoring properties that will clobber the existing
resource struct more, re-organize the monitoring specific properties to also
be in a separate structure.
Also convert "bandwidth sources" terminology to "memory transactions" to have
consistency within resctrl for related monitoring features.
[ bp: Massage commit message. ]
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:05 +0000 (16:34 -0500)]
x86/resctrl: Add ABMC feature in the command line options
Add a kernel command-line parameter to enable or disable the exposure of
the ABMC (Assignable Bandwidth Monitoring Counters) hardware feature to
resctrl.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Babu Moger [Fri, 5 Sep 2025 21:34:04 +0000 (16:34 -0500)]
x86/cpufeatures: Add support for Assignable Bandwidth Monitoring Counters (ABMC)
Users can create as many monitor groups as RMIDs supported by the hardware.
However, the bandwidth monitoring feature on AMD only guarantees that RMIDs
currently assigned to a processor will be tracked by hardware. The counters of
any other RMIDs which are no longer being tracked will be reset to zero.
The MBM event counters return "Unavailable" for the RMIDs that are not tracked
by hardware. So, there can be only limited number of groups that can give
guaranteed monitoring numbers. With ever changing configurations there is no
way to definitely know which of these groups are being tracked during
a particular time. Users do not have the option to monitor a group or set of
groups for a certain period of time without worrying about RMID being reset in
between.
The ABMC feature allows users to assign a hardware counter to an RMID, event
pair and monitor bandwidth usage as long as it is assigned. The hardware
continues to track the assigned counter until it is explicitly unassigned by
the user. There is no need to worry about counters being reset during this
period. Additionally, the user can specify the type of memory transactions
(e.g., reads, writes) for the counter to track.
Without ABMC enabled, monitoring will work in current mode without assignment
option.
The Linux resctrl subsystem provides an interface that allows monitoring of up
to two memory bandwidth events per group, selected from a combination of
available total and local events. When ABMC is enabled, two events will be
assigned to each group by default, in line with the current interface design.
Users will also have the option to configure which types of memory
transactions are counted by these events.
Due to the limited number of available counters (32), users may quickly
exhaust the available counters. If the system runs out of assignable ABMC
counters, the kernel will report an error. In such cases, users will need to
unassign one or more active counters to free up counters for new assignments.
resctrl will provide options to assign or unassign events through the
group-specific interface file.
The feature is detected via CPUID_Fn80000020_EBX_x00 bit 5: ABMC (Assignable
Bandwidth Monitoring Counters).
The ABMC feature details are documented in APM [1] available from [2]. [1]
AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC).
[ bp: Massage commit message, fixup enumeration due to VMSCAPE ]
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Tony Luck [Fri, 5 Sep 2025 21:34:03 +0000 (16:34 -0500)]
x86,fs/resctrl: Prepare for more monitor events
There's a rule in computer programming that objects appear zero, once, or many
times. So code accordingly.
There are two MBM events and resctrl is coded with a lot of
if (local)
do one thing
if (total)
do a different thing
Change the rdt_mon_domain and rdt_hw_mon_domain structures to hold arrays of
pointers to per event data instead of explicit fields for total and local
bandwidth.
Simplify by coding for many events using loops on which are enabled.
Move resctrl_is_mbm_event() to <linux/resctrl.h> so it can be used more
widely. Also provide a for_each_mbm_event_id() helper macro.
Cleanup variable names in functions touched to consistently use "eventid" for
those with type enum resctrl_event_id.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Tony Luck [Fri, 5 Sep 2025 21:34:02 +0000 (16:34 -0500)]
x86/resctrl: Remove the rdt_mon_features global variable
rdt_mon_features is used as a bitmask of enabled monitor events. A monitor
event's status is now maintained in mon_evt::enabled with all monitor events'
mon_evt structures found in the filesystem's mon_event_all[] array.
Remove the remaining uses of rdt_mon_features.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Tony Luck [Fri, 5 Sep 2025 21:34:01 +0000 (16:34 -0500)]
x86,fs/resctrl: Replace architecture event enabled checks
The resctrl file system now has complete knowledge of the status of every
event. So there is no need for per-event function calls to check.
Replace each of the resctrl_arch_is_{event}enabled() calls with
resctrl_is_mon_event_enabled(QOS_{EVENT}).
No functional change.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Tony Luck [Fri, 5 Sep 2025 21:34:00 +0000 (16:34 -0500)]
x86,fs/resctrl: Consolidate monitor event descriptions
There are currently only three monitor events, all associated with the
RDT_RESOURCE_L3 resource. Growing support for additional events will be easier
with some restructuring to have a single point in file system code where all
attributes of all events are defined.
Place all event descriptions into an array mon_event_all[]. Doing this has the
beneficial side effect of removing the need for rdt_resource::evt_list.
Add resctrl_event_id::QOS_FIRST_EVENT for a lower bound on range checks for
event ids and as the starting index to scan mon_event_all[].
Drop the code that builds evt_list and change the two places where the list is
scanned to scan mon_event_all[] instead using a new helper macro
for_each_mon_event().
Architecture code now informs file system code which events are available with
resctrl_enable_mon_event().
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
Shaopeng Tan [Mon, 23 Jun 2025 07:50:50 +0000 (16:50 +0900)]
fs/resctrl: Optimize code in rdt_get_tree()
schemata_list_destroy() has to be called if schemata_list_create() fails.
rdt_get_tree() calls schemata_list_destroy() in two different ways:
directly if schemata_list_create() itself fails and
on the exit path via the out_schemata_free goto label.
Remove schemata_list_destroy() call on schemata_list_create() failure.
Use existing out_schemata_free goto label instead.
Signed-off-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Koba Ko <kobak@nvidia.com>
Link: https://lore.kernel.org/20250623075051.3610592-1-tan.shaopeng@jp.fujitsu.com
Linus Torvalds [Sun, 14 Sep 2025 21:21:14 +0000 (14:21 -0700)]
Linux 6.17-rc6
Linus Torvalds [Sun, 14 Sep 2025 20:17:30 +0000 (13:17 -0700)]
Merge tag 'phy-fix-6.17' of git://git./linux/kernel/git/phy/linux-phy
Pull generic phy driver fixes from Vinod Koul:
- Qualcomm repeater override properties, qmp pcie bindings fix for
clocks and initialization sequence for firmware power down case
- Marvell comphy bindings clock and child node constraints
- Tegra xusb device reference leaks fix
- TI omap usb device ref leak on unbind and RGMII IS settings fix
* tag 'phy-fix-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy:
phy: qcom: qmp-pcie: Fix PHY initialization when powered down by firmware
phy: ti: gmii-sel: Always write the RGMII ID setting
dt-bindings: phy: qcom,sc8280xp-qmp-pcie-phy: Update pcie phy bindings
phy: ti-pipe3: fix device leak at unbind
phy: ti: omap-usb2: fix device leak at unbind
phy: tegra: xusb: fix device and OF node leak at probe
dt-bindings: phy: marvell,comphy-cp110: Fix clock and child node constraints
phy: qualcomm: phy-qcom-eusb2-repeater: fix override properties
Linus Torvalds [Sun, 14 Sep 2025 20:06:06 +0000 (13:06 -0700)]
Merge tag 'dmaengine-fix-6.17' of git://git./linux/kernel/git/vkoul/dmaengine
Pull dmaengine fixes from Vinod Koul:
- Intel idxd fixes for idxd_free() handling, refcount underflow on
module unload, double free in idxd_setup_wqs()
- Qualcomm bam dma missing properties and handing for channels with ees
- dw device reference leak in rzn1_dmamux_route_allocate()
* tag 'dmaengine-fix-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine:
dmaengine: dw: dmamux: Fix device reference leak in rzn1_dmamux_route_allocate
dmaengine: ti: edma: Fix memory allocation size for queue_priority_map
dmaengine: idxd: Fix double free in idxd_setup_wqs()
dmaengine: idxd: Fix refcount underflow on module unload
dmaengine: idxd: Remove improper idxd_free
dmaengine: qcom: bam_dma: Fix DT error handling for num-channels/ees
dt-bindings: dma: qcom: bam-dma: Add missing required properties
Linus Torvalds [Sun, 14 Sep 2025 17:54:54 +0000 (10:54 -0700)]
Merge tag 'tty-6.17-rc6' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are some small tty and serial driver fixes for 6.17-rc6 that
resolve some reported problems. Included in here are:
- 8250 driver dt bindings fixes
- broadcom serial driver binding fixes
- hvc_console bugfix
- xilinx serial driver bugfix
- sc16is7xx serial driver bugfix
All of these have been in linux-next for the past week with no
reported issues"
* tag 'tty-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: xilinx_uartps: read reg size from DTS
tty: hvc_console: Call hvc_kick in hvc_write unconditionally
dt-bindings: serial: 8250: allow "main" and "uart" as clock names
dt-bindings: serial: 8250: move a constraint
dt-bindings: serial: brcm,bcm7271-uart: Constrain clocks
serial: sc16is7xx: fix bug in flow control levels init
Linus Torvalds [Sun, 14 Sep 2025 17:28:15 +0000 (10:28 -0700)]
Merge tag 'usb-6.17-rc6' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes and new device ids for 6.17-rc6.
Included in here are:
- new usb-serial driver device ids
- dummy-hcd locking bugfix for rt-enabled systems (which is crazy,
but people have odd testing requirements at times...)
- xhci driver bugfixes for reported issues
- typec driver bugfix
- midi2 gadget driver bugfixes
- usb core sysfs file regression fix from -rc1
All of these, except for the last usb sysfs file fix, have been in
linux-next with no reported issues. The sysfs fix was added to the
tree on Friday, and is "obviously correct" and should not have any
problems either, it just didn't have any time for linux-next to pick
up (0-day had no problems with it)"
* tag 'usb-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: core: remove the move buf action
usb: gadget: midi2: Fix MIDI2 IN EP max packet size
usb: gadget: midi2: Fix missing UMP group attributes initialization
usb: typec: tcpm: properly deliver cable vdms to altmode drivers
USB: gadget: dummy-hcd: Fix locking bug in RT-enabled kernels
xhci: fix memory leak regression when freeing xhci vdev devices depth first
xhci: dbc: Fix full DbC transfer ring after several reconnects
xhci: dbc: decouple endpoint allocation from initialization
USB: serial: option: add Telit Cinterion LE910C4-WWX new compositions
USB: serial: option: add Telit Cinterion FN990A w/audio compositions
Linus Torvalds [Sun, 14 Sep 2025 15:39:48 +0000 (08:39 -0700)]
Merge tag 'x86-urgent-2025-09-14' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Fix a CPU topology parsing bug on AMD guests, and address
a lockdep warning in the resctrl filesystem"
* tag 'x86-urgent-2025-09-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fs/resctrl: Eliminate false positive lockdep warning when reading SNC counters
x86/cpu/topology: Always try cpu_parse_topology_ext() on AMD/Hygon
Linus Torvalds [Sun, 14 Sep 2025 15:38:05 +0000 (08:38 -0700)]
Merge tag 'timers-urgent-2025-09-14' of git://git./linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"Fix a lost-timeout CPU hotplug bug in the hrtimer code, which can
trigger with certain hardware configs and regular HZ"
* tag 'timers-urgent-2025-09-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimers: Unconditionally update target CPU base after offline timer migration
Linus Torvalds [Sun, 14 Sep 2025 15:09:37 +0000 (08:09 -0700)]
Merge tag 'input-for-v6.17-rc5' of git://git./linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
- a quirk to i8042 for yet another TUXEDO laptop
- a fix to mtk-pmic-keys driver to properly handle MT6359
- a fix to iqs7222 driver to only enable proximity interrupt
if it is mapped to a key or a switch event
- an update to xpad controller driver to recognize Flydigi Apex 5
controller
- an update to maintainers file to drop bounding entry for Melfas
touch controller
* tag 'input-for-v6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
MAINTAINERS: Input: Drop melfas-mip4 section
Input: mtk-pmic-keys - MT6359 has a specific release irq
Input: i8042 - add TUXEDO InfinityBook Pro Gen10 AMD to i8042 quirk table
Input: iqs7222 - avoid enabling unused interrupts
Input: xpad - add support for Flydigi Apex 5
Krzysztof Kozlowski [Wed, 10 Sep 2025 14:25:27 +0000 (16:25 +0200)]
MAINTAINERS: Input: Drop melfas-mip4 section
Emails to the sole melfas-mip4 driver maintainer bounce:
550 <jeesw@melfas.com> No such user here (connected from melfas.com)
so clearly this is not a supported driver anymore.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20250910142526.105286-2-krzysztof.kozlowski@linaro.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Linus Torvalds [Sun, 14 Sep 2025 00:16:52 +0000 (17:16 -0700)]
Merge tag 'erofs-for-6.17-rc6-fixes' of git://git./linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
- Fix invalid algorithm dereference in encoded extents
- Add missing dax_break_layout_final(), since recent FSDAX fixes
didn't cover EROFS
- Arrange long xattr name prefixes more properly
* tag 'erofs-for-6.17-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix long xattr name prefix placement
erofs: fix runtime warning on truncate_folio_batch_exceptionals()
erofs: fix invalid algorithm for encoded extents
Linus Torvalds [Sat, 13 Sep 2025 17:45:11 +0000 (10:45 -0700)]
Merge tag 'ceph-for-6.17-rc6' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A fix for a race condition around r_parent tracking that took a long
time to track down from Alex and some fixes for potential crashes on
accessing invalid memory from Max and myself.
All marked for stable"
* tag 'ceph-for-6.17-rc6' of https://github.com/ceph/ceph-client:
libceph: fix invalid accesses to ceph_connection_v1_info
ceph: fix crash after fscrypt_encrypt_pagecache_blocks() error
ceph: always call ceph_shift_unused_folios_left()
ceph: fix race condition where r_parent becomes stale before sending message
ceph: fix race condition validating r_parent before applying state
Linus Torvalds [Sat, 13 Sep 2025 17:40:50 +0000 (10:40 -0700)]
Merge tag 'regulator-fix-v6.17-rc5' of git://git./linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"One fix for sy7636a which got confused about which device to use to
manage the lifecycle of the power good GPIO because it's looked up
from the parent device due to the way DT bindings work"
* tag 'regulator-fix-v6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: sy7636a: fix lifecycle of power good gpio
Linus Torvalds [Sat, 13 Sep 2025 17:36:06 +0000 (10:36 -0700)]
Merge tag 'driver-core-6.17-rc6' of git://git./linux/kernel/git/driver-core/driver-core
Pull driver core fixes from Danilo Krummrich:
- Fix UAF in cgroup pressure polling by using kernfs_get_active_of()
to prevent operations on released file descriptors
- Fix unresolved intra-doc link in the documentation of struct Device
when CONFIG_DRM != y
- Update the DMA Rust MAINTAINERS entry
* tag 'driver-core-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
MAINTAINERS: Update the DMA Rust entry
kernfs: Fix UAF in polling when open file is released
rust: device: fix unresolved link to drm::Device
Linus Torvalds [Fri, 12 Sep 2025 17:46:10 +0000 (10:46 -0700)]
Merge tag 'pci-v6.17-fixes-3' of git://git./linux/kernel/git/pci/pci
Pull pci fix from Bjorn Helgaas:
- Fix mvebu PCI enumeration regression caused by converting to
for_each_of_range() iterator (Klaus Kudielka)
* tag 'pci-v6.17-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI: mvebu: Fix use of for_each_of_range() iterator
Linus Torvalds [Fri, 12 Sep 2025 16:29:59 +0000 (09:29 -0700)]
Merge tag 'drm-fixes-2025-09-12' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Weekly pull fixes for drm, mostly amdgpu and xe, with a revert for
nouveau and some maintainers updates, and misc bits, doesn't seem too
out of the normal.
MAINTAINERS:
- add rust tree to MAINTAINERS
- fix X entries for nova/nouveau
nova:
- depend on 64-bit
i915:
- Fix size for for_each_set_bit() in abox iteration
xe:
- Don't touch survivability_mode on fini
- Fixes around eviction and suspend
- Extend Wa_13011645652 to PTL-H, WCL
amdgpu:
- PSP 11.x fix
- DPCD quirk handing fix
- DCN 3.5 PG fix
- Audio suspend fix
- OEM i2c clean up fix
- Module unload memory leak fix
- DC delay fix
- ISP firmware fix
- VCN fixes
amdkfd:
- P2P topology fix
- APU mem limit calculation fix
mediatek:
- fix potential OF node use-after-free
panthor:
- out-of-bounds check
nouveau:
- revert waitqueue removal for sched teardown
* tag 'drm-fixes-2025-09-12' of https://gitlab.freedesktop.org/drm/kernel: (25 commits)
MAINTAINERS: drm-misc: fix X: entries for nova/nouveau
drm/mediatek: clean up driver data initialisation
drm/mediatek: fix potential OF node use-after-free
drm/amdgpu/vcn: Allow limiting ctx to instance 0 for AV1 at any time
drm/amdgpu/vcn4: Fix IB parsing with multiple engine info packages
drm/amd/amdgpu: Declare isp firmware binary file
drm/amd/display: use udelay rather than fsleep
drm/amdgpu: fix a memory leak in fence cleanup when unloading
drm/xe: Extend Wa_13011645652 to PTL-H, WCL
drm/xe: Block exec and rebind worker while evicting for suspend / hibernate
drm/xe: Allow the pm notifier to continue on failure
drm/xe: Attempt to bring bos back to VRAM after eviction
drm/xe/configfs: Don't touch survivability_mode on fini
amd/amdkfd: correct mem limit calculation for small APUs
drm/amdkfd: fix p2p links bug in topology
drm/amd/display: remove oem i2c adapter on finish
drm/amd/display: Drop dm_prepare_suspend() and dm_complete()
drm/amd/display: Correct sequences and delays for DCN35 PG & RCG
drm/amd/display: Disable DPCD Probe Quirk
drm/i915/power: fix size for for_each_set_bit() in abox iteration
...
Linus Torvalds [Fri, 12 Sep 2025 16:03:01 +0000 (09:03 -0700)]
Merge tag 'v6.17-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
"Two smb3 client fixes, both for stable:
- Fix encryption problem with multiple compounded ops
- Fix rename error cases that could lead to data corruption"
* tag 'v6.17-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: fix data loss due to broken rename(2)
smb: client: fix compound alignment with encryption
Edward Adam Davis [Wed, 10 Sep 2025 07:58:47 +0000 (15:58 +0800)]
USB: core: remove the move buf action
The buffer size of sysfs is fixed at PAGE_SIZE, and the page offset
of the buf parameter of sysfs_emit_at() must be 0, there is no need
to manually manage the buf pointer offset.
Fixes:
711d41ab4a0e ("usb: core: Use sysfs_emit_at() when showing dynamic IDs")
Reported-by: syzbot+b6445765657b5855e869@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=
b6445765657b5855e869
Tested-by: syzbot+b6445765657b5855e869@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Link: https://lore.kernel.org/r/tencent_B32D6D8C9450EBFEEE5ACC2C7B0E6C402D0A@qq.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Greg Kroah-Hartman [Fri, 12 Sep 2025 08:41:15 +0000 (10:41 +0200)]
Merge tag 'usb-serial-6.17-rc6' of ssh://gitolite./linux/kernel/git/johan/usb-serial into usb-linus
Johan writes:
USB serial device ids for 6.17-rc6
Here are some new modem device ids.
Everything has been in linux-next with no reported issues.
* tag 'usb-serial-6.17-rc6' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial:
USB: serial: option: add Telit Cinterion LE910C4-WWX new compositions
USB: serial: option: add Telit Cinterion FN990A w/audio compositions
Dave Airlie [Thu, 11 Sep 2025 23:39:06 +0000 (09:39 +1000)]
Merge tag 'drm-xe-fixes-2025-09-11' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
- Don't touch survivability_mode on fini (Michal)
- Fixes around eviction and suspend (Thomas)
- Extend Wa_13011645652 to PTL-H, WCL (Julia)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/aMLq7QlaEPHGKXKX@intel.com
Linus Torvalds [Thu, 11 Sep 2025 23:35:06 +0000 (16:35 -0700)]
Merge tag 'mtd/fixes-for-6.17-rc6' of git://git./linux/kernel/git/mtd/linux
Pull mtd fixes from Miquel Raynal:
"SPI NAND fix:
- Wrong OOB layout for Winbond W25N01JW SPI NAND devices
Raw NAND fixes:
- Atmel raw NAND controller timings
- Buffer handling in stm32_fmc2 driver
- Error handling in Nuvoton's driver
MTD devices fixes:
- Wrong depends-on dependencies on the Intel DRM driver
* tag 'mtd/fixes-for-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
mtd: spinand: winbond: Fix oob_layout for W25N01JW
mtd: nand: raw: atmel: Respect tAR, tCLR in read setup timing
mtd: rawnand: stm32_fmc2: fix ECC overwrite
mtd: rawnand: stm32_fmc2: avoid overlapping mappings on ECC buffer
mtd: rawnand: nuvoton: Fix an error handling path in ma35_nand_chips_init()
mtd: MTD_INTEL_DG should depend on DRM_I915 or DRM_XE
Dave Airlie [Thu, 11 Sep 2025 23:34:36 +0000 (09:34 +1000)]
Merge tag 'drm-misc-fixes-2025-09-11' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
A maintainer update, an out-of-bound check for panthor and a revert for
nouveau to fix a race.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <mripard@redhat.com>
Link: https://lore.kernel.org/r/20250911-glistening-uakari-of-serendipity-06ceb1@houat
Dave Airlie [Thu, 11 Sep 2025 23:31:23 +0000 (09:31 +1000)]
Merge tag 'mediatek-drm-fixes-
20250910' of https://git./linux/kernel/git/chunkuang.hu/linux into drm-fixes
Mediatek DRM Fixes -
20250910
1. fix potential OF node use-after-free
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Chun-Kuang Hu <chunkuang.hu@kernel.org>
Link: https://lore.kernel.org/r/20250910231813.3526-1-chunkuang.hu@kernel.org
Dave Airlie [Thu, 11 Sep 2025 23:24:50 +0000 (09:24 +1000)]
Merge tag 'amd-drm-fixes-6.17-2025-09-10' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.17-2025-09-10:
amdgpu:
- PSP 11.x fix
- DPCD quirk handing fix
- DCN 3.5 PG fix
- Audio suspend fix
- OEM i2c clean up fix
- Module unload memory leak fix
- DC delay fix
- ISP firmware fix
- VCN fixes
amdkfd:
- P2P topology fix
- APU mem limit calculation fix
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250910162855.2507853-1-alexander.deucher@amd.com
Dave Airlie [Thu, 11 Sep 2025 23:21:19 +0000 (09:21 +1000)]
Merge tag 'drm-intel-fixes-2025-09-10' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes
- Fix size for for_each_set_bit() in abox iteration [display] (Jani Nikula)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Tvrtko Ursulin <tursulin@igalia.com>
Link: https://lore.kernel.org/r/aMFUtRdJ46qK-EXl@linux
Dave Airlie [Thu, 11 Sep 2025 22:40:15 +0000 (08:40 +1000)]
Merge tag 'drm-rust-fixes-2025-09-05' of https://gitlab.freedesktop.org/drm/rust/kernel into drm-fixes
- Add drm-rust tree to MAINTAINERS
- Require CONFIG_64BIT for Nova
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alice Ryhl <aliceryhl@google.com>
Link: https://lore.kernel.org/r/aLquN1YvdyI_6PJS@google.com
Danilo Krummrich [Wed, 10 Sep 2025 09:40:03 +0000 (11:40 +0200)]
MAINTAINERS: Update the DMA Rust entry
Update the DMA Rust maintainers entry in the following two aspects:
(1) Change Abdiel's entry to 'Reviewer'.
(2) Take patches through the driver-core tree.
Abdiel won't do any more maintainer work on the DMA (or scatterlist)
infrastructure, but he'd like to be kept in the loop, hence change is
entry to 'R:'.
Analogous to [1], the DMA (and scatterlist) helpers are closely coupled
with the core device infrastructure and the device lifecycle, hence take
patches through the driver-core tree by default.
Cc: Abdiel Janulgue <abdiel.janulgue@gmail.com>
Link: https://lore.kernel.org/r/20250725202840.2251768-1-ojeda@kernel.org
Acked-by: Abdiel Janulgue <abdiel.janulgue@gmail.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Gao Xiang [Thu, 11 Sep 2025 19:27:11 +0000 (03:27 +0800)]
erofs: fix long xattr name prefix placement
Currently, xattr name prefixes are forcibly placed into the packed
inode if the fragments feature is enabled, and users have no option
to put them in plain form directly on disk.
This is inflexible. First, as mentioned above, users should be able
to store unwrapped long xattr name prefixes unconditionally
(COMPAT_PLAIN_XATTR_PFX). Second, since we now have the new metabox
inode to store metadata, it should be used when available instead
of the packed inode.
Fixes:
414091322c63 ("erofs: implement metadata compression")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Linus Torvalds [Thu, 11 Sep 2025 15:54:42 +0000 (08:54 -0700)]
Merge tag 'net-6.17-rc6' of git://git./linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from CAN, netfilter and wireless.
We have an IPv6 routing regression with the relevant fix still a WiP.
This includes a last-minute revert to avoid more problems.
Current release - new code bugs:
- wifi: nl80211: completely disable per-link stats for now
Previous releases - regressions:
- dev_ioctl: take ops lock in hwtstamp lower paths
- netfilter:
- fix spurious set lookup failures
- fix lockdep splat due to missing annotation
- genetlink: fix genl_bind() invoking bind() after -EPERM
- phy: transfer phy_config_inband() locking responsibility to phylink
- can: xilinx_can: fix use-after-free of transmitted SKB
- hsr: fix lock warnings
- eth:
- igb: fix NULL pointer dereference in ethtool loopback test
- i40e: fix Jumbo Frame support after iPXE boot
- macsec: sync features on RTM_NEWLINK
Previous releases - always broken:
- tunnels: reset the GSO metadata before reusing the skb
- mptcp: make sync_socket_options propagate SOCK_KEEPOPEN
- can: j1939: implement NETDEV_UNREGISTER notification hanidler
- wifi: ath12k: fix WMI TLV header misalignment"
* tag 'net-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
Revert "net: usb: asix: ax88772: drop phylink use in PM to avoid MDIO runtime PM wakeups"
hsr: hold rcu and dev lock for hsr_get_port_ndev
hsr: use hsr_for_each_port_rtnl in hsr_port_get_hsr
hsr: use rtnl lock when iterating over ports
wifi: nl80211: completely disable per-link stats for now
net: usb: asix: ax88772: drop phylink use in PM to avoid MDIO runtime PM wakeups
net: ethtool: fix wrong type used in struct kernel_ethtool_ts_info
MAINTAINERS: add Phil as netfilter reviewer
netfilter: nf_tables: restart set lookup on base_seq change
netfilter: nf_tables: make nft_set_do_lookup available unconditionally
netfilter: nf_tables: place base_seq in struct net
netfilter: nft_set_rbtree: continue traversal if element is inactive
netfilter: nft_set_pipapo: don't check genbit from packetpath lookups
netfilter: nft_set_bitmap: fix lockdep splat due to missing annotation
can: rcar_can: rcar_can_resume(): fix s2ram with PSCI
can: xilinx_can: xcan_write_frame(): fix use-after-free of transmitted SKB
can: j1939: j1939_local_ecu_get(): undo increment when j1939_local_ecu_get() fails
can: j1939: j1939_sk_bind(): call j1939_priv_put() immediately when j1939_local_ecu_get() failed
can: j1939: implement NETDEV_UNREGISTER notification handler
selftests: can: enable CONFIG_CAN_VCAN as a module
...
Linus Torvalds [Thu, 11 Sep 2025 15:46:30 +0000 (08:46 -0700)]
Merge tag 's390-6.17-4' of git://git./linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- ptep_modify_prot_start() may be called in a loop, which might lead to
the preempt_count overflow due to the unnecessary preemption
disabling. Do not disable preemption to prevent the overflow
- Events of type PERF_TYPE_HARDWARE are not tested for sampling and
return -EOPNOTSUPP eventually.
Instead, deny all sampling events by CPUMF counter facility and
return -ENOENT to allow other PMUs to be tried
- The PAI PMU driver returns -EINVAL if an event out of its range. That
aborts a search for an alternative PMU driver.
Instead, return -ENOENT to allow other PMUs to be tried
* tag 's390-6.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/cpum_cf: Deny all sampling events by counter PMU
s390/pai: Deny all events not handled by this PMU
s390/mm: Prevent possible preempt_count overflow
Linus Torvalds [Thu, 11 Sep 2025 15:11:16 +0000 (08:11 -0700)]
Merge tag 'pm-6.17-rc6' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a nasty hibernation regression introduced during the 6.16
cycle, an issue related to energy model management occurring on Intel
hybrid systems where some CPUs are offline to start with, and two
regressions in the amd-pstate driver:
- Restore a pm_restrict_gfp_mask() call in hibernation_snapshot()
that was removed incorrectly during the 6.16 development cycle
(Rafael Wysocki)
- Introduce a function for registering a perf domain without
triggering a system-wide CPU capacity update and make the
intel_pstate driver use it to avoid reocurring unsuccessful
attempts to update capacities of all CPUs in the system (Rafael
Wysocki)
- Fix setting of CPPC.min_perf in the active mode with performance
governor in the amd-pstate driver to restore its expected behavior
changed recently (Gautham Shenoy)
- Avoid mistakenly setting EPP to 0 in the amd-pstate driver after
system resume as a result of recent code changes (Mario
Limonciello)"
* tag 'pm-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: Restrict GFP mask in hibernation_snapshot()
PM: EM: Add function for registering a PD without capacity update
cpufreq/amd-pstate: Fix a regression leading to EPP 0 after resume
cpufreq/amd-pstate: Fix setting of CPPC.min_perf in active mode for performance governor
Linus Torvalds [Thu, 11 Sep 2025 15:01:18 +0000 (08:01 -0700)]
Merge tag 'for-6.17-rc5-tag' of git://git./linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix delayed inode tracking in xarray, eviction can race with
insertion and leave behind a disconnected inode
- on systems with large page (64K) and small block size (4K) fix
compression read that can return partially filled folio
- slightly relax compression option format for backward compatibility,
allow to specify level for LZO although there's only one
- fix simple quota accounting of compressed extents
- validate minimum device size in 'device add'
- update maintainers' entry
* tag 'for-6.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't allow adding block device of less than 1 MB
MAINTAINERS: update btrfs entry
btrfs: fix subvolume deletion lockup caused by inodes xarray race
btrfs: fix corruption reading compressed range when block size is smaller than page size
btrfs: accept and ignore compression level for lzo
btrfs: fix squota compressed stats leak
Linus Torvalds [Thu, 11 Sep 2025 14:54:16 +0000 (07:54 -0700)]
Merge tag 'bpf-fixes' of git://git./linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
"A number of fixes accumulated due to summer vacations
- Fix out-of-bounds dynptr write in bpf_crypto_crypt() kfunc which
was misidentified as a security issue (Daniel Borkmann)
- Update the list of BPF selftests maintainers (Eduard Zingerman)
- Fix selftests warnings with icecc compiler (Ilya Leoshkevich)
- Disable XDP/cpumap direct return optimization (Jesper Dangaard
Brouer)
- Fix unexpected get_helper_proto() result in unusual configuration
BPF_SYSCALL=y and BPF_EVENTS=n (Jiri Olsa)
- Allow fallback to interpreter when JIT support is limited (KaFai
Wan)
- Fix rqspinlock and choose trylock fallback for NMI waiters. Pick
the simplest fix. More involved fix is targeted bpf-next (Kumar
Kartikeya Dwivedi)
- Fix cleanup when tcp_bpf_send_verdict() fails to allocate
psock->cork (Kuniyuki Iwashima)
- Disallow bpf_timer in PREEMPT_RT for now. Proper solution is being
discussed for bpf-next. (Leon Hwang)
- Fix XSK cq descriptor production (Maciej Fijalkowski)
- Tell memcg to use allow_spinning=false path in bpf_timer_init() to
avoid lockup in cgroup_file_notify() (Peilin Ye)
- Fix bpf_strnstr() to handle suffix match cases (Rong Tao)"
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Skip timer cases when bpf_timer is not supported
bpf: Reject bpf_timer for PREEMPT_RT
tcp_bpf: Call sk_msg_free() when tcp_bpf_send_verdict() fails to allocate psock->cork.
bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()
bpf: Allow fall back to interpreter for programs with stack size <= 512
rqspinlock: Choose trylock fallback for NMI waiters
xsk: Fix immature cq descriptor production
bpf: Update the list of BPF selftests maintainers
selftests/bpf: Add tests for bpf_strnstr
selftests/bpf: Fix "expression result unused" warnings with icecc
bpf: Fix bpf_strnstr() to handle suffix match cases better
selftests/bpf: Extend crypto_sanity selftest with invalid dst buffer
bpf: Fix out-of-bounds dynptr write in bpf_crypto_crypt
bpf: Check the helper function is valid in get_helper_proto
bpf, cpumap: Disable page_pool direct xdp_return need larger scope
Paolo Abeni [Thu, 11 Sep 2025 14:33:31 +0000 (16:33 +0200)]
Revert "net: usb: asix: ax88772: drop phylink use in PM to avoid MDIO runtime PM wakeups"
This reverts commit
5537a4679403 ("net: usb: asix: ax88772: drop
phylink use in PM to avoid MDIO runtime PM wakeups"), it breaks
operation of asix ethernet usb dongle after system suspend-resume
cycle.
Link: https://lore.kernel.org/all/b5ea8296-f981-445d-a09a-2f389d7f6fdd@samsung.com/
Fixes:
5537a4679403 ("net: usb: asix: ax88772: drop phylink use in PM to avoid MDIO runtime PM wakeups")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/2945b9dbadb8ee1fee058b19554a5cb14f1763c1.1757601118.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Rafael J. Wysocki [Thu, 11 Sep 2025 12:22:35 +0000 (14:22 +0200)]
Merge branches 'pm-sleep' and 'pm-em'
Merge a hibernation regression fix and an fix related to energy model
management for 6.17-rc6
* pm-sleep:
PM: hibernate: Restrict GFP mask in hibernation_snapshot()
* pm-em:
PM: EM: Add function for registering a PD without capacity update
Paolo Abeni [Thu, 11 Sep 2025 10:49:52 +0000 (12:49 +0200)]
Merge tag 'wireless-2025-09-11' of https://git./linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Some more fixes:
- iwlwifi: fix 130/1030 devices
- ath12k: fix alignment, power save
- virt_wifi: fix crash
- cfg80211: disable per-link stats due
to buffer size issues
* tag 'wireless-2025-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: nl80211: completely disable per-link stats for now
wifi: virt_wifi: Fix page fault on connect
wifi: cfg80211: Fix "no buffer space available" error in nl80211_get_station() for MLO
wifi: iwlwifi: fix 130/1030 configs
wifi: ath12k: fix WMI TLV header misalignment
wifi: ath12k: Fix missing station power save configuration
====================
Link: https://patch.msgid.link/20250911100345.20025-3-johannes@sipsolutions.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 11 Sep 2025 09:49:29 +0000 (11:49 +0200)]
Merge branch 'hsr-fix-lock-warnings'
Hangbin Liu says:
====================
hsr: fix lock warnings
hsr_for_each_port is called in many places without holding the RCU read
lock, this may trigger warnings on debug kernels like:
[ 40.457015] [ T201] WARNING: suspicious RCU usage
[ 40.457020] [ T201] 6.17.0-rc2-virtme #1 Not tainted
[ 40.457025] [ T201] -----------------------------
[ 40.457029] [ T201] net/hsr/hsr_main.c:137 RCU-list traversed in non-reader section!!
[ 40.457036] [ T201]
other info that might help us debug this:
[ 40.457040] [ T201]
rcu_scheduler_active = 2, debug_locks = 1
[ 40.457045] [ T201] 2 locks held by ip/201:
[ 40.457050] [ T201] #0:
ffffffff93040a40 (&ops->srcu){.+.+}-{0:0}, at: rtnl_link_ops_get+0xf2/0x280
[ 40.457080] [ T201] #1:
ffffffff92e7f968 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x5e1/0xb20
[ 40.457102] [ T201]
stack backtrace:
[ 40.457108] [ T201] CPU: 2 UID: 0 PID: 201 Comm: ip Not tainted 6.17.0-rc2-virtme #1 PREEMPT(full)
[ 40.457114] [ T201] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 40.457117] [ T201] Call Trace:
[ 40.457120] [ T201] <TASK>
[ 40.457126] [ T201] dump_stack_lvl+0x6f/0xb0
[ 40.457136] [ T201] lockdep_rcu_suspicious.cold+0x4f/0xb1
[ 40.457148] [ T201] hsr_port_get_hsr+0xfe/0x140
[ 40.457158] [ T201] hsr_add_port+0x192/0x940
[ 40.457167] [ T201] ? __pfx_hsr_add_port+0x10/0x10
[ 40.457176] [ T201] ? lockdep_init_map_type+0x5c/0x270
[ 40.457189] [ T201] hsr_dev_finalize+0x4bc/0xbf0
[ 40.457204] [ T201] hsr_newlink+0x3c3/0x8f0
[ 40.457212] [ T201] ? __pfx_hsr_newlink+0x10/0x10
[ 40.457222] [ T201] ? rtnl_create_link+0x173/0xe40
[ 40.457233] [ T201] rtnl_newlink_create+0x2cf/0x750
[ 40.457243] [ T201] ? __pfx_rtnl_newlink_create+0x10/0x10
[ 40.457247] [ T201] ? __dev_get_by_name+0x12/0x50
[ 40.457252] [ T201] ? rtnl_dev_get+0xac/0x140
[ 40.457259] [ T201] ? __pfx_rtnl_dev_get+0x10/0x10
[ 40.457285] [ T201] __rtnl_newlink+0x22c/0xa50
[ 40.457305] [ T201] rtnl_newlink+0x637/0xb20
Adding rcu_read_lock() for all hsr_for_each_port() looks confusing.
Introduce a new helper, hsr_for_each_port_rtnl(), that assumes the
RTNL lock is held. This allows callers in suitable contexts to iterate
ports safely without explicit RCU locking.
Other code paths that rely on RCU protection continue to use
hsr_for_each_port() with rcu_read_lock().
====================
Link: https://patch.msgid.link/20250905091533.377443-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Hangbin Liu [Fri, 5 Sep 2025 09:15:33 +0000 (09:15 +0000)]
hsr: hold rcu and dev lock for hsr_get_port_ndev
hsr_get_port_ndev calls hsr_for_each_port, which need to hold rcu lock.
On the other hand, before return the port device, we need to hold the
device reference to avoid UaF in the caller function.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Fixes:
9c10dd8eed74 ("net: hsr: Create and export hsr_get_port_ndev()")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250905091533.377443-4-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Hangbin Liu [Fri, 5 Sep 2025 09:15:32 +0000 (09:15 +0000)]
hsr: use hsr_for_each_port_rtnl in hsr_port_get_hsr
hsr_port_get_hsr() iterates over ports using hsr_for_each_port(),
but many of its callers do not hold the required RCU lock.
Switch to hsr_for_each_port_rtnl(), since most callers already hold
the rtnl lock. After review, all callers are covered by either the rtnl
lock or the RCU lock, except hsr_dev_xmit(). Fix this by adding an
RCU read lock there.
Fixes:
c5a759117210 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250905091533.377443-3-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Hangbin Liu [Fri, 5 Sep 2025 09:15:31 +0000 (09:15 +0000)]
hsr: use rtnl lock when iterating over ports
hsr_for_each_port is called in many places without holding the RCU read
lock, this may trigger warnings on debug kernels. Most of the callers
are actually hold rtnl lock. So add a new helper hsr_for_each_port_rtnl
to allow callers in suitable contexts to iterate ports safely without
explicit RCU locking.
This patch only fixed the callers that is hold rtnl lock. Other caller
issues will be fixed in later patches.
Fixes:
c5a759117210 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250905091533.377443-2-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Johannes Berg [Wed, 10 Sep 2025 13:11:21 +0000 (15:11 +0200)]
wifi: nl80211: completely disable per-link stats for now
After commit
8cc71fc3b82b ("wifi: cfg80211: Fix "no buffer
space available" error in nl80211_get_station() for MLO"),
the per-link data is only included in station dumps, where
the size limit is somewhat less of an issue. However, it's
still an issue, depending on how many links a station has
and how much per-link data there is. Thus, for now, disable
per-link statistics entirely.
A complete fix will need to take this into account, make it
opt-in by userspace, and change the dump format to be able
to split a single station's data across multiple netlink
dump messages, which all together is too much development
for a fix.
Fixes:
82d7f841d9bd ("wifi: cfg80211: extend to embed link level statistics in NL message")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Linus Torvalds [Thu, 11 Sep 2025 04:19:34 +0000 (21:19 -0700)]
Merge tag 'mm-hotfixes-stable-2025-09-10-20-00' of git://git./linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"20 hotfixes. 15 are cc:stable and the remainder address post-6.16
issues or aren't considered necessary for -stable kernels. 14 of these
fixes are for MM.
This includes
- kexec fixes from Breno for a recently introduced
use-uninitialized bug
- DAMON fixes from Quanmin Yan to avoid div-by-zero crashes
which can occur if the operator uses poorly-chosen insmod
parameters
and misc singleton fixes"
* tag 'mm-hotfixes-stable-2025-09-10-20-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add tree entry to numa memblocks and emulation block
mm/damon/sysfs: fix use-after-free in state_show()
proc: fix type confusion in pde_set_flags()
compiler-clang.h: define __SANITIZE_*__ macros only when undefined
mm/vmalloc, mm/kasan: respect gfp mask in kasan_populate_vmalloc()
ocfs2: fix recursive semaphore deadlock in fiemap call
mm/memory-failure: fix VM_BUG_ON_PAGE(PagePoisoned(page)) when unpoison memory
mm/mremap: fix regression in vrm->new_addr check
percpu: fix race on alloc failed warning limit
mm/memory-failure: fix redundant updates for already poisoned pages
s390: kexec: initialize kexec_buf struct
riscv: kexec: initialize kexec_buf struct
arm64: kexec: initialize kexec_buf struct in load_other_segments()
mm/damon/reclaim: avoid divide-by-zero in damon_reclaim_apply_parameters()
mm/damon/lru_sort: avoid divide-by-zero in damon_lru_sort_apply_parameters()
mm/damon/core: set quota->charged_from to jiffies at first charge window
mm/hugetlb: add missing hugetlb_lock in __unmap_hugepage_range()
init/main.c: fix boot time tracing crash
mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range()
mm/khugepaged: fix the address passed to notifier on testing young
Linus Torvalds [Thu, 11 Sep 2025 03:52:16 +0000 (20:52 -0700)]
Merge tag 'vmscape-for-linus-
20250904' of git://git./linux/kernel/git/tip/tip
Pull vmescape mitigation fixes from Dave Hansen:
"Mitigate vmscape issue with indirect branch predictor flushes.
vmscape is a vulnerability that essentially takes Spectre-v2 and
attacks host userspace from a guest. It particularly affects
hypervisors like QEMU.
Even if a hypervisor may not have any sensitive data like disk
encryption keys, guest-userspace may be able to attack the
guest-kernel using the hypervisor as a confused deputy.
There are many ways to mitigate vmscape using the existing Spectre-v2
defenses like IBRS variants or the IBPB flushes. This series focuses
solely on IBPB because it works universally across vendors and all
vulnerable processors. Further work doing vendor and model-specific
optimizations can build on top of this if needed / wanted.
Do the normal issue mitigation dance:
- Add the CPU bug boilerplate
- Add a list of vulnerable CPUs
- Use IBPB to flush the branch predictors after running guests"
* tag 'vmscape-for-linus-
20250904' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vmscape: Add old Intel CPUs to affected list
x86/vmscape: Warn when STIBP is disabled with SMT
x86/bugs: Move cpu_bugs_smt_update() down
x86/vmscape: Enable the mitigation
x86/vmscape: Add conditional IBPB mitigation
x86/vmscape: Enumerate VMSCAPE bug
Documentation/hw-vuln: Add VMSCAPE documentation
Jakub Kicinski [Thu, 11 Sep 2025 02:33:55 +0000 (19:33 -0700)]
Merge tag 'nf-25-09-10-v2' of https://git./linux/kernel/git/netfilter/nf
Florian Westpha says:
====================
netfilter pull request nf-25-09-10
First patch adds a lockdep annotation for a false-positive splat.
Last patch adds formal reviewer tag for Phil Sutter to MAINTAINERS.
Rest of the patches resolve spurious false negative results during set
lookups while another CPU is processing a transaction.
This has been broken at least since v4.18 when an unconditional
synchronize_rcu call was removed from the commit phase of nf_tables.
Quoting from Stefan Hanreichs original report:
It seems like we've found an issue with atomicity when reloading
nftables rulesets. Sometimes there is a small window where rules
containing sets do not seem to apply to incoming traffic, due to the set
apparently being empty for a short amount of time when flushing / adding
elements.
Exanple ruleset:
table ip filter {
set match {
type ipv4_addr
flags interval
elements = { 0.0.0.0-192.168.2.19, 192.168.2.21-255.255.255.255 }
}
chain pre {
type filter hook prerouting priority filter; policy accept;
ip saddr @match accept
counter comment "must never match"
}
}
Reproducer transaction:
while true:
nft -f -<<EOF
flush set ip filter match
create element ip filter match { \
0.0.0.0-192.168.2.19, 192.168.2.21-255.255.255.255 }
EOF
done
Then create traffic. to/from e.g. 192.168.2.1 to 192.168.3.10.
Once in a while the counter will increment even though the
'ip saddr @match' rule should have accepted the packet.
See individual patches for details.
Thanks to Stefan Hanreich for an initial description and reproducer for
this bug and to Pablo Neira Ayuso for reviewing earlier iterations of
the patchset.
* tag 'nf-25-09-10-v2' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
MAINTAINERS: add Phil as netfilter reviewer
netfilter: nf_tables: restart set lookup on base_seq change
netfilter: nf_tables: make nft_set_do_lookup available unconditionally
netfilter: nf_tables: place base_seq in struct net
netfilter: nft_set_rbtree: continue traversal if element is inactive
netfilter: nft_set_pipapo: don't check genbit from packetpath lookups
netfilter: nft_set_bitmap: fix lockdep splat due to missing annotation
====================
Link: https://patch.msgid.link/20250910190308.13356-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 11 Sep 2025 02:29:40 +0000 (19:29 -0700)]
Merge tag 'linux-can-fixes-for-6.17-
20250910' of git://git./linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2025-09-10
The 1st patch is by Alex Tran and fixes the Documentation of the
struct bcm_msg_head.
Davide Caratti's patch enabled the VCAN driver as a module for the
Linux self tests.
Tetsuo Handa contributes 3 patches that fix various problems in the
CAN j1939 protocol.
Anssi Hannula's patch fixes a potential use-after-free in the
xilinx_can driver.
Geert Uytterhoeven's patch fixes the rcan_can's suspend to RAM on
R-Car Gen3 using PSCI.
* tag 'linux-can-fixes-for-6.17-
20250910' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: rcar_can: rcar_can_resume(): fix s2ram with PSCI
can: xilinx_can: xcan_write_frame(): fix use-after-free of transmitted SKB
can: j1939: j1939_local_ecu_get(): undo increment when j1939_local_ecu_get() fails
can: j1939: j1939_sk_bind(): call j1939_priv_put() immediately when j1939_local_ecu_get() failed
can: j1939: implement NETDEV_UNREGISTER notification handler
selftests: can: enable CONFIG_CAN_VCAN as a module
docs: networking: can: change bcm_msg_head frames member to support flexible array
====================
Link: https://patch.msgid.link/20250910162907.948454-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 11 Sep 2025 02:21:11 +0000 (19:21 -0700)]
Merge branch '1GbE' of git://git./linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-09-09 (igb, i40e)
For igb:
Tianyu Xu removes passing of, no longer needed, NAPI id to avoid NULL
pointer dereference on ethtool loopback testing.
Kohei Enju corrects reporting/testing of link state when interface is
down.
For i40e:
Michal Schmidt corrects value being passed to free_irq().
Jake sets hardware maximum frame size on probe to ensure
expected/consistent state.
* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
i40e: fix Jumbo Frame support after iPXE boot
i40e: fix IRQ freeing in i40e_vsi_request_irq_msix error path
igb: fix link test skipping when interface is admin down
igb: Fix NULL pointer dereference in ethtool loopback test
====================
Link: https://patch.msgid.link/20250909203236.3603960-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Oleksij Rempel [Mon, 8 Sep 2025 11:26:19 +0000 (13:26 +0200)]
net: usb: asix: ax88772: drop phylink use in PM to avoid MDIO runtime PM wakeups
Drop phylink_{suspend,resume}() from ax88772 PM callbacks.
MDIO bus accesses have their own runtime-PM handling and will try to
wake the device if it is suspended. Such wake attempts must not happen
from PM callbacks while the device PM lock is held. Since phylink
{sus|re}sume may trigger MDIO, it must not be called in PM context.
No extra phylink PM handling is required for this driver:
- .ndo_open/.ndo_stop control the phylink start/stop lifecycle.
- ethtool/phylib entry points run in process context, not PM.
- phylink MAC ops program the MAC on link changes after resume.
Fixes:
e0bffe3e6894 ("net: asix: ax88772: migrate to phylink")
Reported-by: Hubert Wiśniewski <hubert.wisniewski.25632@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Tested-by: Hubert Wiśniewski <hubert.wisniewski.25632@gmail.com>
Tested-by: Xu Yang <xu.yang_2@nxp.com>
Link: https://patch.msgid.link/20250908112619.2900723-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Sun, 7 Sep 2025 20:43:20 +0000 (21:43 +0100)]
net: ethtool: fix wrong type used in struct kernel_ethtool_ts_info
In C, enumerated types do not have a defined size, apart from being
compatible with one of the standard types. This allows an ABI /
compiler to choose the type of an enum depending on the values it
needs to store, and storing larger values in it can lead to undefined
behaviour.
The tx_type and rx_filters members of struct kernel_ethtool_ts_info
are defined as enumerated types, but are bit arrays, where each bit
is defined by the enumerated type. This means they typically store
values in excess of the maximum value of the enumerated type, in
fact (1 << max_value) and thus must not be declared using the
enumated type.
Fix both of these to use u32, as per the corresponding __u32 UAPI type.
Fixes:
2111375b85ad ("net: Add struct kernel_ethtool_ts_info")
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/E1uvMEK-00000003Amd-2pWR@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 10 Sep 2025 19:38:41 +0000 (12:38 -0700)]
Merge tag 'nfs-for-6.17-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Stable patches:
- Revert "SUNRPC: Don't allow waiting for exiting tasks" as it is
breaking ltp tests
Bugfixes:
- Another set of fixes to the tracking of NFSv4 server capabilities
when crossing filesystem boundaries
- Localio fix to restore credentials and prevent triggering a
BUG_ON()
- Fix to prevent flapping of the localio on/off trigger
- Protections against 'eof page pollution' as demonstrated in
xfstests generic/363
- Series of patches to ensure correct ordering of O_DIRECT i/o and
truncate, fallocate and copy functions
- Fix a NULL pointer check in flexfiles reads that regresses 6.17
- Correct a typo that breaks flexfiles layout segment processing"
* tag 'nfs-for-6.17-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4/flexfiles: Fix layout merge mirror check.
SUNRPC: call xs_sock_process_cmsg for all cmsg
Revert "SUNRPC: Don't allow waiting for exiting tasks"
NFS: Fix the marking of the folio as up to date
NFS: nfs_invalidate_folio() must observe the offset and size arguments
NFSv4.2: Serialise O_DIRECT i/o and copy range
NFSv4.2: Serialise O_DIRECT i/o and clone range
NFSv4.2: Serialise O_DIRECT i/o and fallocate()
NFS: Serialise O_DIRECT i/o and truncate()
NFSv4.2: Protect copy offload and clone against 'eof page pollution'
NFS: Protect against 'eof page pollution'
flexfiles/pNFS: fix NULL checks on result of ff_layout_choose_ds_for_read
nfs/localio: avoid bouncing LOCALIO if nfs_client_is_local()
nfs/localio: restore creds before releasing pageio data
NFSv4: Clear the NFS_CAP_XATTR flag if not supported by the server
NFSv4: Clear NFS_CAP_OPEN_XOR and NFS_CAP_DELEGTIME if not supported
NFSv4: Clear the NFS_CAP_FS_LOCATIONS flag if it is not set
NFSv4: Don't clear capabilities that won't be reset
Alexei Starovoitov [Wed, 10 Sep 2025 19:34:09 +0000 (12:34 -0700)]
Merge branch 'bpf-reject-bpf_timer-for-preempt_rt'
Leon Hwang says:
====================
bpf: Reject bpf_timer for PREEMPT_RT
While running './test_progs -t timer' to validate the test case from
"selftests/bpf: Introduce experimental bpf_in_interrupt()"[0] for
PREEMPT_RT, I encountered a kernel warning:
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
To address this, reject bpf_timer usage in the verifier when
PREEMPT_RT is enabled, and skip the corresponding timer selftests.
Changes:
v2 -> v3:
* Drop skipping test case 'timer_interrupt'.
* Address comments from Alexei:
* Respin targeting bpf tree.
* Trim commit log.
v1 -> v2:
* Skip test case 'timer_interrupt'.
Links:
[0] https://lore.kernel.org/bpf/
20250903140438.59517-1-leon.hwang@linux.dev/
====================
Link: https://patch.msgid.link/20250910125740.52172-1-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Leon Hwang [Wed, 10 Sep 2025 12:57:40 +0000 (20:57 +0800)]
selftests/bpf: Skip timer cases when bpf_timer is not supported
When enable CONFIG_PREEMPT_RT, verifier will reject bpf_timer with
returning -EOPNOTSUPP.
Therefore, skip test cases when errno is EOPNOTSUPP.
cd tools/testing/selftests/bpf
./test_progs -t timer
125 free_timer:SKIP
456 timer:SKIP
457/1 timer_crash/array:SKIP
457/2 timer_crash/hash:SKIP
457 timer_crash:SKIP
458 timer_lockup:SKIP
459 timer_mim:SKIP
Summary: 5/0 PASSED, 6 SKIPPED, 0 FAILED
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20250910125740.52172-3-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Leon Hwang [Wed, 10 Sep 2025 12:57:39 +0000 (20:57 +0800)]
bpf: Reject bpf_timer for PREEMPT_RT
When enable CONFIG_PREEMPT_RT, the kernel will warn when run timer
selftests by './test_progs -t timer':
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
In order to avoid such warning, reject bpf_timer in verifier when
PREEMPT_RT is enabled.
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20250910125740.52172-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ilya Dryomov [Thu, 3 Jul 2025 10:10:50 +0000 (12:10 +0200)]
libceph: fix invalid accesses to ceph_connection_v1_info
There is a place where generic code in messenger.c is reading and
another place where it is writing to con->v1 union member without
checking that the union member is active (i.e. msgr1 is in use).
On 64-bit systems, con->v1.auth_retry overlaps with con->v2.out_iter,
so such a read is almost guaranteed to return a bogus value instead of
0 when msgr2 is in use. This ends up being fairly benign because the
side effect is just the invalidation of the authorizer and successive
fetching of new tickets.
con->v1.connect_seq overlaps with con->v2.conn_bufs and the fact that
it's being written to can cause more serious consequences, but luckily
it's not something that happens often.
Cc: stable@vger.kernel.org
Fixes:
cd1a677cad99 ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Linus Torvalds [Wed, 10 Sep 2025 19:03:47 +0000 (12:03 -0700)]
Merge tag 'trace-v6.17-rc4' of git://git./linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Remove redundant __GFP_NOWARN flag is kmalloc
As now __GFP_NOWARN is part of __GFP_NOWAIT, it can be removed from
kmalloc as it is redundant.
- Use copy_from_user_nofault() instead of _inatomic() for trace markers
The trace_marker files are written to to allow user space to quickly
write into the tracing ring buffer.
Back in 2016, the get_user_pages_fast() and the kmap() logic was
replaced by a __copy_from_user_inatomic(), but didn't properly
disable page faults around it.
Since the time this was added, copy_from_user_nofault() was added
which does the required page fault disabling for us.
- Fix the assembly markup in the ftrace direct sample code
The ftrace direct sample code (which is also used for selftests), had
the size directive between the "leave" and the "ret" instead of after
the ret. This caused objtool to think the code was unreachable.
- Only call unregister_pm_notifier() on outer most fgraph registration
There was an error path in register_ftrace_graph() that did not call
unregister_pm_notifier() on error, so it was added in the error path.
The problem with that fix, is that register_pm_notifier() is only
called by the initial user of fgraph. If that succeeds, but another
fgraph registration were to fail, then unregister_pm_notifier() would
be called incorrectly.
- Fix a crash in osnoise when zero size cpumask is passed in
If a zero size CPU mask is passed in, the kmalloc() would return
ZERO_SIZE_PTR which is not checked, and the code would continue
thinking it had real memory and crash. If zero is passed in as the
size of the write, simply return 0.
- Fix possible warning in trace_pid_write()
If while processing a series of numbers passed to the "set_event_pid"
file, and one of the updates fails to allocate (triggered by a fault
injection), it can cause a warning to trigger. Check the return value
of the call to trace_pid_list_set() and break out early with an error
code if it fails.
* tag 'trace-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Silence warning when chunk allocation fails in trace_pid_write
tracing/osnoise: Fix null-ptr-deref in bitmap_parselist()
trace/fgraph: Fix error handling
ftrace/samples: Fix function size computation
tracing: Fix tracing_marker may trigger page fault during preempt_disable
trace: Remove redundant __GFP_NOWARN
Rafael J. Wysocki [Wed, 10 Sep 2025 09:41:59 +0000 (11:41 +0200)]
PM: hibernate: Restrict GFP mask in hibernation_snapshot()
Commit
12ffc3b1513e ("PM: Restrict swap use to later in the suspend
sequence") incorrectly removed a pm_restrict_gfp_mask() call from
hibernation_snapshot(), so memory allocations involving swap are not
prevented from being carried out in this code path any more which may
lead to serious breakage.
The symptoms of such breakage have become visible after adding a
shrink_shmem_memory() call to hibernation_snapshot() in commit
2640e819474f ("PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()")
which caused this problem to be much more likely to manifest itself.
However, since commit
2640e819474f was initially present in the DRM
tree that did not include commit
12ffc3b1513e, the symptoms of this
issue were not visible until merge commit
260f6f4fda93 ("Merge tag
'drm-next-2025-07-30' of https://gitlab.freedesktop.org/drm/kernel")
that exposed it through an entirely reasonable merge conflict
resolution.
Fixes:
12ffc3b1513e ("PM: Restrict swap use to later in the suspend sequence")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220555
Reported-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Cc: 6.16+ <stable@vger.kernel.org> # 6.16+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Florian Westphal [Tue, 9 Sep 2025 21:52:31 +0000 (23:52 +0200)]
MAINTAINERS: add Phil as netfilter reviewer
Phil has contributed to netfilter with features, fixes and patch reviews
for a long time. Make this more formal and add Reviewer tag.
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Florian Westphal [Wed, 10 Sep 2025 08:02:22 +0000 (10:02 +0200)]
netfilter: nf_tables: restart set lookup on base_seq change
The hash, hash_fast, rhash and bitwise sets may indicate no result even
though a matching element exists during a short time window while other
cpu is finalizing the transaction.
This happens when the hash lookup/bitwise lookup function has picked up
the old genbit, right before it was toggled by nf_tables_commit(), but
then the same cpu managed to unlink the matching old element from the
hash table:
cpu0 cpu1
has added new elements to clone
has marked elements as being
inactive in new generation
perform lookup in the set
enters commit phase:
A) observes old genbit
increments base_seq
I) increments the genbit
II) removes old element from the set
B) finds matching element
C) returns no match: found
element is not valid in old
generation
Next lookup observes new genbit and
finds matching e2.
Consider a packet matching element e1, e2.
cpu0 processes following transaction:
1. remove e1
2. adds e2, which has same key as e1.
P matches both e1 and e2. Therefore, cpu1 should always find a match
for P. Due to above race, this is not the case:
cpu1 observed the old genbit. e2 will not be considered once it is found.
The element e1 is not found anymore if cpu0 managed to unlink it from the
hlist before cpu1 found it during list traversal.
The situation only occurs for a brief time period, lookups happening
after I) observe new genbit and return e2.
This problem exists in all set types except nft_set_pipapo, so fix it once
in nft_lookup rather than each set ops individually.
Sample the base sequence counter, which gets incremented right before the
genbit is changed.
Then, if no match is found, retry the lookup if the base sequence was
altered in between.
If the base sequence hasn't changed:
- No update took place: no-match result is expected.
This is the common case. or:
- nf_tables_commit() hasn't progressed to genbit update yet.
Old elements were still visible and nomatch result is expected, or:
- nf_tables_commit updated the genbit:
We picked up the new base_seq, so the lookup function also picked
up the new genbit, no-match result is expected.
If the old genbit was observed, then nft_lookup also picked up the old
base_seq: nft_lookup_should_retry() returns true and relookup is performed
in the new generation.
This problem was added when the unconditional synchronize_rcu() call
that followed the current/next generation bit toggle was removed.
Thanks to Pablo Neira Ayuso for reviewing an earlier version of this
patchset, for suggesting re-use of existing base_seq and placement of
the restart loop in nft_set_do_lookup().
Fixes:
0cbc06b3faba ("netfilter: nf_tables: remove synchronize_rcu in commit phase")
Signed-off-by: Florian Westphal <fw@strlen.de>
Florian Westphal [Wed, 10 Sep 2025 08:02:21 +0000 (10:02 +0200)]
netfilter: nf_tables: make nft_set_do_lookup available unconditionally
This function was added for retpoline mitigation and is replaced by a
static inline helper if mitigations are not enabled.
Enable this helper function unconditionally so next patch can add a lookup
restart mechanism to fix possible false negatives while transactions are
in progress.
Adding lookup restarts in nft_lookup_eval doesn't work as nft_objref would
then need the same copypaste loop.
This patch is separate to ease review of the actual bug fix.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Florian Westphal [Wed, 10 Sep 2025 08:02:20 +0000 (10:02 +0200)]
netfilter: nf_tables: place base_seq in struct net
This will soon be read from packet path around same time as the gencursor.
Both gencursor and base_seq get incremented almost at the same time, so
it makes sense to place them in the same structure.
This doesn't increase struct net size on 64bit due to padding.
Signed-off-by: Florian Westphal <fw@strlen.de>
Florian Westphal [Wed, 10 Sep 2025 08:02:19 +0000 (10:02 +0200)]
netfilter: nft_set_rbtree: continue traversal if element is inactive
When the rbtree lookup function finds a match in the rbtree, it sets the
range start interval to a potentially inactive element.
Then, after tree lookup, if the matching element is inactive, it returns
NULL and suppresses a matching result.
This is wrong and leads to false negative matches when a transaction has
already entered the commit phase.
cpu0 cpu1
has added new elements to clone
has marked elements as being
inactive in new generation
perform lookup in the set
enters commit phase:
I) increments the genbit
A) observes new genbit
B) finds matching range
C) returns no match: found
range invalid in new generation
II) removes old elements from the tree
C New nft_lookup happening now
will find matching element,
because it is no longer
obscured by old, inactive one.
Consider a packet matching range r1-r2:
cpu0 processes following transaction:
1. remove r1-r2
2. add r1-r3
P is contained in both ranges. Therefore, cpu1 should always find a match
for P. Due to above race, this is not the case:
cpu1 does find r1-r2, but then ignores it due to the genbit indicating
the range has been removed. It does NOT test for further matches.
The situation persists for all lookups until after cpu0 hits II) after
which r1-r3 range start node is tested for the first time.
Move the "interval start is valid" check ahead so that tree traversal
continues if the starting interval is not valid in this generation.
Thanks to Stefan Hanreich for providing an initial reproducer for this
bug.
Reported-by: Stefan Hanreich <s.hanreich@proxmox.com>
Fixes:
c1eda3c6394f ("netfilter: nft_rbtree: ignore inactive matching element with no descendants")
Signed-off-by: Florian Westphal <fw@strlen.de>
Florian Westphal [Wed, 10 Sep 2025 08:02:18 +0000 (10:02 +0200)]
netfilter: nft_set_pipapo: don't check genbit from packetpath lookups
The pipapo set type is special in that it has two copies of its
datastructure: one live copy containing only valid elements and one
on-demand clone used during transaction where adds/deletes happen.
This clone is not visible to the datapath.
This is unlike all other set types in nftables, those all link new
elements into their live hlist/tree.
For those sets, the lookup functions must skip the new elements while the
transaction is ongoing to ensure consistency.
As the clone is shallow, removal does have an effect on the packet path:
once the transaction enters the commit phase the 'gencursor' bit that
determines which elements are active and which elements should be ignored
(because they are no longer valid) is flipped.
This causes the datapath lookup to ignore these elements if they are found
during lookup.
This opens up a small race window where pipapo has an inconsistent view of
the dataset from when the transaction-cpu flipped the genbit until the
transaction-cpu calls nft_pipapo_commit() to swap live/clone pointers:
cpu0 cpu1
has added new elements to clone
has marked elements as being
inactive in new generation
perform lookup in the set
enters commit phase:
I) increments the genbit
A) observes new genbit
removes elements from the clone so
they won't be found anymore
B) lookup in datastructure
can't see new elements yet,
but old elements are ignored
-> Only matches elements that
were not changed in the
transaction
II) calls nft_pipapo_commit(), clone
and live pointers are swapped.
C New nft_lookup happening now
will find matching elements.
Consider a packet matching range r1-r2:
cpu0 processes following transaction:
1. remove r1-r2
2. add r1-r3
P is contained in both ranges. Therefore, cpu1 should always find a match
for P. Due to above race, this is not the case:
cpu1 does find r1-r2, but then ignores it due to the genbit indicating
the range has been removed.
At the same time, r1-r3 is not visible yet, because it can only be found
in the clone.
The situation persists for all lookups until after cpu0 hits II).
The fix is easy: Don't check the genbit from pipapo lookup functions.
This is possible because unlike the other set types, the new elements are
not reachable from the live copy of the dataset.
The clone/live pointer swap is enough to avoid matching on old elements
while at the same time all new elements are exposed in one go.
After this change, step B above returns a match in r1-r2.
This is fine: r1-r2 only becomes truly invalid the moment they get freed.
This happens after a synchronize_rcu() call and rcu read lock is held
via netfilter hook traversal (nf_hook_slow()).
Cc: Stefano Brivio <sbrivio@redhat.com>
Fixes:
3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Florian Westphal <fw@strlen.de>
Florian Westphal [Tue, 9 Sep 2025 12:45:21 +0000 (14:45 +0200)]
netfilter: nft_set_bitmap: fix lockdep splat due to missing annotation
Running new 'set_flush_add_atomic_bitmap' test case for nftables.git
with CONFIG_PROVE_RCU_LIST=y yields:
net/netfilter/nft_set_bitmap.c:231 RCU-list traversed in non-reader section!!
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by nft/4008:
#0:
ffff888147f79cd8 (&nft_net->commit_mutex){+.+.}-{4:4}, at: nf_tables_valid_genid+0x2f/0xd0
lockdep_rcu_suspicious+0x116/0x160
nft_bitmap_walk+0x22d/0x240
nf_tables_delsetelem+0x1010/0x1a00
..
This is a false positive, the list cannot be altered while the
transaction mutex is held, so pass the relevant argument to the iterator.
Fixes tag intentionally wrong; no point in picking this up if earlier
false-positive-fixups were not applied.
Fixes:
28b7a6b84c0a ("netfilter: nf_tables: avoid false-positive lockdep splats in set walker")
Signed-off-by: Florian Westphal <fw@strlen.de>
Geert Uytterhoeven [Thu, 14 Aug 2025 11:26:37 +0000 (13:26 +0200)]
can: rcar_can: rcar_can_resume(): fix s2ram with PSCI
On R-Car Gen3 using PSCI, s2ram powers down the SoC. After resume, the
CAN interface no longer works, until it is brought down and up again.
Fix this by calling rcar_can_start() from the PM resume callback, to
fully initialize the controller instead of just restarting it.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/699b2f7fcb60b31b6f976a37f08ce99c5ffccb31.1755165227.git.geert+renesas@glider.be
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Anssi Hannula [Fri, 22 Aug 2025 09:50:02 +0000 (12:50 +0300)]
can: xilinx_can: xcan_write_frame(): fix use-after-free of transmitted SKB
can_put_echo_skb() takes ownership of the SKB and it may be freed
during or after the call.
However, xilinx_can xcan_write_frame() keeps using SKB after the call.
Fix that by only calling can_put_echo_skb() after the code is done
touching the SKB.
The tx_lock is held for the entire xcan_write_frame() execution and
also on the can_get_echo_skb() side so the order of operations does not
matter.
An earlier fix commit
3d3c817c3a40 ("can: xilinx_can: Fix usage of skb
memory") did not move the can_put_echo_skb() call far enough.
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Fixes:
1598efe57b3e ("can: xilinx_can: refactor code in preparation for CAN FD support")
Link: https://patch.msgid.link/20250822095002.168389-1-anssi.hannula@bitwise.fi
[mkl: add "commit" in front of sha1 in patch description]
[mkl: fix indention]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Tetsuo Handa [Sun, 24 Aug 2025 10:27:40 +0000 (19:27 +0900)]
can: j1939: j1939_local_ecu_get(): undo increment when j1939_local_ecu_get() fails
Since j1939_sk_bind() and j1939_sk_release() call j1939_local_ecu_put()
when J1939_SOCK_BOUND was already set, but the error handling path for
j1939_sk_bind() will not set J1939_SOCK_BOUND when j1939_local_ecu_get()
fails, j1939_local_ecu_get() needs to undo priv->ents[sa].nusers++ when
j1939_local_ecu_get() returns an error.
Fixes:
9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/e7f80046-4ff7-4ce2-8ad8-7c3c678a42c9@I-love.SAKURA.ne.jp
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Tetsuo Handa [Sun, 24 Aug 2025 10:30:09 +0000 (19:30 +0900)]
can: j1939: j1939_sk_bind(): call j1939_priv_put() immediately when j1939_local_ecu_get() failed
Commit
25fe97cb7620 ("can: j1939: move j1939_priv_put() into sk_destruct
callback") expects that a call to j1939_priv_put() can be unconditionally
delayed until j1939_sk_sock_destruct() is called. But a refcount leak will
happen when j1939_sk_bind() is called again after j1939_local_ecu_get()
from previous j1939_sk_bind() call returned an error. We need to call
j1939_priv_put() before j1939_sk_bind() returns an error.
Fixes:
25fe97cb7620 ("can: j1939: move j1939_priv_put() into sk_destruct callback")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/4f49a1bc-a528-42ad-86c0-187268ab6535@I-love.SAKURA.ne.jp
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>