Mario Limonciello [Thu, 2 Oct 2025 17:42:40 +0000 (12:42 -0500)]
drm/amd: Stop exporting amdgpu_device_ip_suspend() outside amdgpu_device
amdgpu_device_ip_suspend() doesn't have a caller outside of
amdgpu_device.c. Make it static.
No intended functional changes.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Mario Limonciello [Thu, 2 Oct 2025 17:42:39 +0000 (12:42 -0500)]
drm/amd: Unify shutdown() callback behavior
[Why]
The shutdown() callback uses amdgpu_ip_suspend() which doesn't notify
drm clients during shutdown. This could lead to hangs.
[How]
Change amdgpu_pci_shutdown() to call the same sequence as suspend/resume.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Prike Liang [Fri, 19 Sep 2025 07:14:41 +0000 (15:14 +0800)]
drm/amdgpu: validate userq va for GEM unmap
When a user unmaps a userq VA, the driver must ensure
the queue has no in-flight jobs. If there is pending work,
the kernel should wait for the attached eviction (bookkeeping)
fence to signal before deleting the mapping.
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Prike Liang [Thu, 9 Oct 2025 08:45:27 +0000 (16:45 +0800)]
drm/amdgpu: validate the queue va for resuming the queue
It requires validating the userq VA whether is mapped before
trying to resume the queue.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Prike Liang [Tue, 22 Jul 2025 05:43:51 +0000 (13:43 +0800)]
drm/amdgpu: keeping waiting userq fence infinitely
Keeping waiting the userq fence infinitely until
hang detection, and then suspend the hang queue and
set the fence error.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Prike Liang [Thu, 9 Oct 2025 08:44:31 +0000 (16:44 +0800)]
drm/amdgpu: track the userq bo va for its obj management
Track the userq obj for its life time, and reference and
dereference the buffer flag at its creating and destroying
period.
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Prike Liang [Mon, 29 Sep 2025 05:52:13 +0000 (13:52 +0800)]
drm/amdgpu: add userq object va track helpers
Add the userq object virtual address list_add() helpers
for tracking the userq obj va address usage.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Christian König [Thu, 25 Sep 2025 10:09:56 +0000 (12:09 +0200)]
drm/amdgpu: reduce queue timeout to 2 seconds v2
There has been multiple complains that 10 seconds are usually to long.
The original requirement for longer timeout came from compute tests on
AMDVLK, since that is no longer a topic reduce the timeout back to 2
seconds for all queues.
While at it also remove any special handling for compute queues under
SRIOV or pass through.
v2: fix checkpatch warning.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Mario Limonciello [Wed, 1 Oct 2025 18:03:33 +0000 (13:03 -0500)]
drm/amd: Remove some unncessary header includes
Unnecessary headers can slow down the build, drop em.
No intended functional changes.
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Tested-by: Robert Beckett <bob.beckett@collabora.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Mario Limonciello [Mon, 6 Oct 2025 16:16:20 +0000 (11:16 -0500)]
drm/amd: Adjust whitespace for vangogh_ppt
A few changes have more whitespace than needed. Clean them up.
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Tested-by: Robert Beckett <bob.beckett@collabora.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Mon, 6 Oct 2025 18:23:59 +0000 (14:23 -0400)]
drm/amdgpu/mes: adjust the VMID masks
The firmware limits the max vmid, but align the
settings with the hw limits as well just to be safe.
Reviewed-by: Shaoyun liu <Shaoyun.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Lijo Lazar [Wed, 8 Oct 2025 04:56:47 +0000 (10:26 +0530)]
drm/amdgpu: Skip SDMA suspend during mode-2 reset
For SDMA IP versions >= v4.4.2, firmware will take care of quiescing SDMA
before mode-2 reset.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Pierre-Eric Pelloux-Prayer [Mon, 15 Sep 2025 12:25:44 +0000 (14:25 +0200)]
drm/amdgpu: remove gart_window_lock usage from gmc v12
This lock was part of the SDMA workaround originally implemented in
gmc_v10_0_flush_gpu_tlb (
a70cb2176f7ef6f moved it to
amdgpu_gmc_flush_gpu_tlb).
This means this lock is useless and be safely dropped.
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Pierre-Eric Pelloux-Prayer [Fri, 5 Sep 2025 08:19:29 +0000 (10:19 +0200)]
drm/amdgpu: make non-NULL out fence mandatory
amdgpu_ttm_copy_mem_to_mem has a single caller, make sure the out
fence is non-NULL to simplify the code.
Since none of the pointers should be NULL, we can enable
__attribute__((nonnull))__.
While at it make the function static since it's only used from
amdgpuu_ttm.c.
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Lijo Lazar [Tue, 16 Sep 2025 05:50:42 +0000 (11:20 +0530)]
drm/amdgpu: Remove redundant return value
gfx_v9_4_3_xcc_kcq_init_queue doesn't have a fail condition.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Prike Liang [Tue, 22 Jul 2025 03:14:28 +0000 (11:14 +0800)]
drm/amdgpu/userq: extend userq state
Extend the userq state for identifying the
userq invalid cases.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Taimur Hassan [Fri, 26 Sep 2025 23:58:54 +0000 (18:58 -0500)]
drm/amd/display: Promote DC to 3.2.353
- [FW Promotion] Release 0.1.30.0
- Driver implementation for cursor offloading to DMU
- Incorrect Mirror Cositing
- Enable Dynamic DTBCLK Switch
- Remove comparing uint32_t to zero
- Remove inaccessible URL
Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Taimur Hassan <Syed.Hassan@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Taimur Hassan [Fri, 26 Sep 2025 20:15:01 +0000 (16:15 -0400)]
drm/amd/display: [FW Promotion] Release 0.1.30.0
Add new SMART_POWER_HDR commands to optimize power consumption on
certain OLED LED panels by sending MaxCLL per frame to TCON.
Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Taimur Hassan <Syed.Hassan@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Nicholas Kazlauskas [Tue, 26 Aug 2025 21:12:44 +0000 (17:12 -0400)]
drm/amd/display: Driver implementation for cursor offloading to DMU
[Why]
We require an interlock between driver and firmware for upcoming
features and given that this could possibly happen on any single
cursor programming call (and that we can't asynchronously wait for
firmware to respond because of it) we'd be regressing cursor performance
by at least an extra 40us per call.
When we could possibly have cursor update every 20us - 100s from high
frequency gaming mice this means that we'd be stuttering or dropping
updates and impacting overall cursor performance.
We want a solution that can:
1. Interlock between other firmware features
2. Not stall out or require the DMCUB lock for every single update
[How]
When cursor offloading is enabled and supported by an ASIC driver will
route the cursor programming through to DMU as part of the regular
DC stream cursor programming interfaces for attributes and position.
The atomic pipe programming version will not be updated: this will still
follow the existing programming path by keeping track of a field that
specifies when the register writes should be deferred to DMU.
Cursor locking is not required when cursor offload is in progress since
the updates are consolidated and processed by DMU once at the end
of the frame in a periodic manner.
The shared buffer the firmware queries from is allocated along with the
rest of the scratch state region in an area that's accessible by
both firmware and driver.
The size of the cursor offload (v1) state will not change, but it does
have a unique union per ASIC version with room for expansion if needed.
When firmware features notifying DMU of DRR updates are not enabled we
now send an explicit vtotal min/max update via driver to DMU firmware
whenever the vtotal max changes. This is to allow the cursor programming
to determine the appropriate latch update point offset from vupdate.
Reviewed-by: Dillon Varone <dillon.varone@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Hung [Wed, 24 Sep 2025 15:27:35 +0000 (09:27 -0600)]
drm/amd/display: Remove comparing uint32_t to zero
[WHAT]
These *bypass are uint32_t and they will never be less than zero.
Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Clay King [Mon, 22 Sep 2025 14:33:21 +0000 (10:33 -0400)]
drm/amd/display: Remove inaccessible URL
[WHAT]
Remove inaccessible link.
Reviewed-by: Joshua Aberback <joshua.aberback@amd.com>
Signed-off-by: Clay King <clayking@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Taimur Hassan [Fri, 19 Sep 2025 22:20:17 +0000 (17:20 -0500)]
drm/amd/display: Promote DC to 3.2.352
- Fix slice width calculation for YCbCr420
- Fix DTBCLK gating
- Use NRD cap as lttpr cap
- Consolidate DML2 FP guards
- DML2.1 Update
- Firmware Release 0.1.29.0 changes
Reviewed-by: Sun peng (Leo) Li <sunpeng.li@amd.com>
Signed-off-by: Taimur Hassan <Syed.Hassan@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Taimur Hassan [Fri, 19 Sep 2025 20:22:48 +0000 (16:22 -0400)]
drm/amd/display: [FW Promotion] Release 0.1.29.0
Add new interface for offloading cursor programming to DMUB.
Reviewed-by: Sun peng (Leo) Li <sunpeng.li@amd.com>
Signed-off-by: Taimur Hassan <Syed.Hassan@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fangzhi Zuo [Thu, 18 Sep 2025 20:25:45 +0000 (16:25 -0400)]
drm/amd/display: Prevent Gating DTBCLK before It Is Properly Latched
[why]
1. With allow_0_dtb_clk enabled, the time required to latch DTBCLK to 600 MHz
depends on the SMU. If DTBCLK is not latched to 600 MHz before set_mode completes,
gating DTBCLK causes the DP2 sink to lose its clock source.
2. The existing DTBCLK gating sequence ungates DTBCLK based on both pix_clk and ref_dtbclk,
but gates DTBCLK when either pix_clk or ref_dtbclk is zero.
pix_clk can be zero outside the set_mode sequence before DTBCLK is properly latched,
which can lead to DTBCLK being gated by mistake.
[how]
Consider both pixel_clk and ref_dtbclk when determining when it is safe to gate DTBCLK;
this is more accurate.
Reviewed-by: Charlene Liu <charlene.liu@amd.com>
Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Fangzhi Zuo <Jerry.Zuo@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Peichen Huang [Thu, 11 Sep 2025 05:41:16 +0000 (13:41 +0800)]
drm/amd/display: lttpr cap should be nrd cap in bw_alloc mode
[WHY]
When bw allocation mode enabled, dpia may reports lttpr cap with
reduced common cap. It would cause driver not start pre-training with
max available bandwidth.
[How]
When bw allocation mode enabled, use NRD cap as lttpr cap.
Reviewed-by: Cruise Hung <cruise.hung@amd.com>
Reviewed-by: Wenjing Liu <wenjing.liu@amd.com>
Signed-off-by: Peichen Huang <PeiChen.Huang@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Nicholas Kazlauskas [Wed, 17 Sep 2025 17:00:58 +0000 (13:00 -0400)]
drm/amd/display: Rename FAMS2 global control lock to DMUB HW control lock
[Why]
FAMS2 dictates whether the inbox0 HW lock is required, but it is not the
only feature that may determine this.
In order to leverage the faster inbox0 HW lock in place of the inbox1
ringbuffer based control lock it's desirable to utilize the HWSS
based locking protocol FAMS2 has already implemented.
[How]
Rename the FAMS2 global control lock to DMUB HW control lock.
This is purely a refactor with no functional change, the logic that will
determine which features need to enable this HW lock will be added in a
future commit.
Reviewed-by: Alvin Lee <alvin.lee2@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Nicholas Kazlauskas [Wed, 17 Sep 2025 15:46:56 +0000 (11:46 -0400)]
drm/amd/display: Rename should_use_dmub_lock to reflect inbox1 usage
[Why]
Newer DCN use the DMCUB HW lock via inbox0 for performance reasons while
older ones will use inbox1.
The should_use_dmub_lock() function does not describe whether the lock
in general should be used, but whether it should be used via inbox1.
[How]
Rename the function to should_use_dmub_inbox1_lock() to reflect this.
Reviewed-by: Alvin Lee <alvin.lee2@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Nicholas Kazlauskas [Tue, 16 Sep 2025 17:57:17 +0000 (13:57 -0400)]
drm/amd/display: Support possibly NULL link for should_use_dmub_lock
[Why]
It's possible to have a stream enabled without a link or link encoder.
There are cases where we'd still like to interlock the driver
programming from firmware programming to ensure that we don't put the
hardware in an undefined (or error) state if two programming sequences
are simultaneously executed on the same hardware blocks.
[How]
Add an explicit DC parameter to should_use_dmub_lock().
Make pointers to should_use_dmub_lock() const since it's a checker
function that shouldn't modify state.
Update the callsites to pass in DC explicitly.
Check that the link is non-NULL before deferencing and performing link
based checks.
Reviewed-by: Alvin Lee <alvin.lee2@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Ivan Lipski [Wed, 17 Sep 2025 15:21:30 +0000 (11:21 -0400)]
drm/amd/display: Consolidate two DML2 FP guards
[Why&How]
Consolidate two FP guards into one in dml2 since they are separated by
one line of code, independent from the guard.
Reviewed-by: Dillon Varone <dillon.varone@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Relja Vojvodic [Wed, 17 Sep 2025 16:30:51 +0000 (12:30 -0400)]
drm/amd/display: Correct slice width calculation for YCbCr420
[Why]
-OVT compliance testing for 5120x2880p300Hz YCbCr420 was failing due to
incorrect slice width being calculated
[How]
-Ensure slice width is divisible by 2 for 420 to comply with spec
Reviewed-by: Wenjing Liu <wenjing.liu@amd.com>
Signed-off-by: Relja Vojvodic <rvojvodi@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Austin Zheng [Thu, 11 Sep 2025 19:37:32 +0000 (15:37 -0400)]
drm/amd/display: DML2.1 Reintegration
[Summary of changes]
- Updated structs
- Renaming of variables for clarity
Reviewed-by: Dillon Varone <dillon.varone@amd.com>
Signed-off-by: Austin Zheng <Austin.Zheng@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 24 Mar 2025 07:13:20 +0000 (15:13 +0800)]
drm/amd/ras: Add unified ras module top-level makefile
Add unified ras module top-level makefile.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 10:04:10 +0000 (18:04 +0800)]
drm/amd/ras: Add files to amdgpu ras manager makefile
Add files to amdgpu ras manager makefile.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 10:03:28 +0000 (18:03 +0800)]
drm/amd/ras: Add amdgpu ras management function.
Add amdgpu system configuration parameters and
functions needed by rascore.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 24 Mar 2025 10:46:06 +0000 (18:46 +0800)]
drm/amd/ras: Amdgpu preprocesses ras interrupts
Amdgpu preprocesses ras interrupts.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 24 Mar 2025 10:44:11 +0000 (18:44 +0800)]
drm/amd/ras: Add amdgpu ras system functions
Add amdgpu ras system functions.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 10:02:13 +0000 (18:02 +0800)]
drm/amd/ras: Amdgpu handle ras ioctl command
Amdgpu handle ras ioctl command.
V2:
Remove non-standard device information.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 10:00:48 +0000 (18:00 +0800)]
drm/amd/ras: Add amdgpu eeprom i2c configuration function
Add amdgpu eeprom i2c configuration function.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:59:25 +0000 (17:59 +0800)]
drm/amd/ras: Add amdgpu mp1 v13_0 configuration function
Add amdgpu mp1 v13_0 configuration function.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 24 Mar 2025 07:12:07 +0000 (15:12 +0800)]
drm/amd/ras: Add amdgpu nbio v7_9 configuration function
Add amdgpu nbio v7_9 configuration function.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:55:37 +0000 (17:55 +0800)]
drm/amd/ras: Add files to ras core Makefile
Add files to ras core Makefile.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:13:19 +0000 (17:13 +0800)]
drm/amd/ras: Add rascore unified interface function
1. Complete the initialization call of all
sub-functions.
2. Export common interfaces.
V2:
Remove the use of typedef to define function pointer.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:42:17 +0000 (17:42 +0800)]
drm/amd/ras: Add cper conversion function
Add cper conversion function.
V3:
Change commit message and update the calling function.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:37:35 +0000 (17:37 +0800)]
drm/amd/ras: Use ring buffer to record ras ecc data
Use ring buffer to record ras ecc data.
V3:
Change commit message and rename the file and
function names.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:33:58 +0000 (17:33 +0800)]
drm/amd/ras: Add thread to handle ras events
Add thread to handle ras events.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:31:24 +0000 (17:31 +0800)]
drm/amd/ras: Add ras ioctl command handler
Add ras ioctl command handler.
V2:
Remove ras global device list.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Thu, 5 Jun 2025 09:46:51 +0000 (17:46 +0800)]
drm/amd/ras: Add psp ras common functions
Add psp ras common functions.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Thu, 5 Jun 2025 09:46:08 +0000 (17:46 +0800)]
drm/amd/ras: Add psp v13_0 ras functions
Add psp v13_0 ras functions.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:30:48 +0000 (17:30 +0800)]
drm/amd/ras: Add eeprom ras functions
Add eeprom ras functions.
V5:
Remove duplicate data structure definition.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:28:14 +0000 (17:28 +0800)]
drm/amd/ras: Add gfx common ras functions
Add gfx common ras functions.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:29:02 +0000 (17:29 +0800)]
drm/amd/ras: Add gfx v9_0 ras functions
Add gfx v9_0 ras functions.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:26:49 +0000 (17:26 +0800)]
drm/amd/ras: Add umc common ras functions
Add umc common ras functions.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:27:44 +0000 (17:27 +0800)]
drm/amd/ras: Add umc v12_0 ras functions
Add umc v12_0 ras functions.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:25:04 +0000 (17:25 +0800)]
drm/amd/ras: Add nbio common ras functions
Add nbio common ras functions.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:25:59 +0000 (17:25 +0800)]
drm/amd/ras: Add nbio v7_9 ras functions
Add nbio v7_9 ras functions.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:20:28 +0000 (17:20 +0800)]
drm/amd/ras: Add mp1 common ras functions
Add mp1 common ras functions.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:21:41 +0000 (17:21 +0800)]
drm/amd/ras: Add mp1 v13_0 ras functions
Add mp1 v13_0 ras functions.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:15:59 +0000 (17:15 +0800)]
drm/amd/ras: Add aca common ras functions
Add aca common ras functions:
1. Aca hw init/fini.
2. Get ecc count of each ras block.
3. Update query ecc count from mp1.
4. Clear ras block ecc count.
V3:
Update the calling function.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
YiPeng Chai [Mon, 17 Mar 2025 09:16:54 +0000 (17:16 +0800)]
drm/amd/ras: Add ras aca parser v1.0
Add ras aca parser v1.0.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Sunil Khatri [Tue, 30 Sep 2025 08:15:11 +0000 (13:45 +0530)]
drm/amdgpu: clean up amdgpu hmm range functions
Clean up the amdgpu hmm range functions for clearer
definition of each.
a. Split amdgpu_ttm_tt_get_user_pages_done into two:
1. amdgpu_hmm_range_valid: To check if the user pages
are valid and update seq num
2. amdgpu_hmm_range_free: Clean up the hmm range
and pfn memory.
b. amdgpu_ttm_tt_get_user_pages_done and
amdgpu_ttm_tt_discard_user_pages are similar function so remove
discard and directly use amdgpu_hmm_range_free to clean up the
hmm range and pfn memory.
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Sunil Khatri [Wed, 24 Sep 2025 06:53:26 +0000 (12:23 +0530)]
drm/amdgpu: use user provided hmm_range buffer in amdgpu_ttm_tt_get_user_pages
update the amdgpu_ttm_tt_get_user_pages and all dependent function
along with it callers to use a user allocated hmm_range buffer instead
hmm layer allocates the buffer.
This is a need to get hmm_range pointers easily accessible
without accessing the bo and that is a requirement for the
userqueue to lock the userptrs effectively.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Jonathan Kim [Wed, 18 Jun 2025 14:31:15 +0000 (10:31 -0400)]
drm/amdkfd: fix suspend/resume all calls in mes based eviction path
Suspend/resume all gangs should be done with the device lock is held.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Harish Kasiviswanathan <harish.kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Jonathan Kim [Thu, 5 Jun 2025 14:18:37 +0000 (10:18 -0400)]
drm/amdgpu: enable suspend/resume all for gfx 12
Suspend/resume all gangs has been available for GFX12 for a while now
so enable it.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Jonathan Kim [Thu, 9 Oct 2025 15:28:19 +0000 (11:28 -0400)]
drm/amdgpu: fix hung reset queue array memory allocation
By design the MES will return an array result that is twice the number
of hung doorbells it can report.
i.e. if up k reported doorbells are supported, then the
second half of the array, also of length k, holds the HQD information
(type/queue/pipe) where queue 1 corresponds to index 0 and k,
queue 2 corresponds to index 1 and k + 1 etc ...
The driver will use the HDQ info to target queue/pipe reset for
hardware scheduled user compute queues.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Jonathan Kim [Thu, 9 Oct 2025 14:48:09 +0000 (10:48 -0400)]
drm/amdgpu: fix initialization of doorbell array for detect and hang
Initialized doorbells should be set to invalid rather than 0 to prevent
driver from over counting hung doorbells since it checks against the
invalid value to begin with.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Jonathan Kim [Thu, 9 Oct 2025 14:45:42 +0000 (10:45 -0400)]
drm/amdgpu: fix gfx12 mes packet status return check
GFX12 MES uses low 32 bits of status return for success (1 or 0)
and high bits for debug information if low bits are 0.
GFX11 MES doesn't do this so checking full 64-bit status return
for 1 or 0 is still valid.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Jesse.Zhang [Mon, 13 Oct 2025 05:46:12 +0000 (13:46 +0800)]
drm/amdgpu: Fix NULL pointer dereference in VRAM logic for APU devices
Previously, APU platforms (and other scenarios with uninitialized VRAM managers)
triggered a NULL pointer dereference in `ttm_resource_manager_usage()`. The root
cause is not that the `struct ttm_resource_manager *man` pointer itself is NULL,
but that `man->bdev` (the backing device pointer within the manager) remains
uninitialized (NULL) on APUs—since APUs lack dedicated VRAM and do not fully
set up VRAM manager structures. When `ttm_resource_manager_usage()` attempts to
acquire `man->bdev->lru_lock`, it dereferences the NULL `man->bdev`, leading to
a kernel OOPS.
1. **amdgpu_cs.c**: Extend the existing bandwidth control check in
`amdgpu_cs_get_threshold_for_moves()` to include a check for
`ttm_resource_manager_used()`. If the manager is not used (uninitialized
`bdev`), return 0 for migration thresholds immediately—skipping VRAM-specific
logic that would trigger the NULL dereference.
2. **amdgpu_kms.c**: Update the `AMDGPU_INFO_VRAM_USAGE` ioctl and memory info
reporting to use a conditional: if the manager is used, return the real VRAM
usage; otherwise, return 0. This avoids accessing `man->bdev` when it is
NULL.
3. **amdgpu_virt.c**: Modify the vf2pf (virtual function to physical function)
data write path. Use `ttm_resource_manager_used()` to check validity: if the
manager is usable, calculate `fb_usage` from VRAM usage; otherwise, set
`fb_usage` to 0 (APUs have no discrete framebuffer to report).
This approach is more robust than APU-specific checks because it:
- Works for all scenarios where the VRAM manager is uninitialized (not just APUs),
- Aligns with TTM's design by using its native helper function,
- Preserves correct behavior for discrete GPUs (which have fully initialized
`man->bdev` and pass the `ttm_resource_manager_used()` check).
v4: use ttm_resource_manager_used(&adev->mman.vram_mgr.manager) instead of checking the adev->gmc.is_app_apu flag (Christian)
Reviewed-by: Christian König <christian.koenig@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Christian König [Tue, 7 Oct 2025 08:10:52 +0000 (10:10 +0200)]
drm/amdgpu: hide VRAM sysfs attributes on GPUs without VRAM
Otherwise accessing them can cause a crash.
Signed-off-by: Christian König <christian.koenig@amd.com>
Tested-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Sathishkumar S [Fri, 10 Oct 2025 18:02:40 +0000 (23:32 +0530)]
drm/amdgpu: fix bit shift logic
BIT_ULL(n) sets nth bit, remove explicit shift and set the position
Fixes:
a7a411e24626 ("drm/amdgpu: fix shift-out-of-bounds in amdgpu_debugfs_jpeg_sched_mask_set")
Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Timur Kristóf [Mon, 13 Oct 2025 06:06:42 +0000 (08:06 +0200)]
drm/amd/powerplay: Fix CIK shutdown temperature
Remove extra multiplication.
CIK GPUs such as Hawaii appear to use PP_TABLE_V0 in which case
the shutdown temperature is hardcoded in smu7_init_dpm_defaults
and is already multiplied by 1000. The value was mistakenly
multiplied another time by smu7_get_thermal_temperature_range.
Fixes:
4ba082572a42 ("drm/amd/powerplay: export the thermal ranges of VI asics (V2)")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/1676
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Fri, 10 Oct 2025 20:40:57 +0000 (16:40 -0400)]
drm/amdgpu: drop unused structures in amdgpu_drm.h
These were never used and are duplicated with the
interface that is used. Maybe leftovers from a previous
revision of the patch that added them.
Fixes:
90c448fef312 ("drm/amdgpu: add new AMDGPU_INFO subquery for userq objects")
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Gui-Dong Han [Wed, 8 Oct 2025 03:43:27 +0000 (03:43 +0000)]
drm/amdgpu: use atomic functions with memory barriers for vm fault info
The atomic variable vm_fault_info_updated is used to synchronize access to
adev->gmc.vm_fault_info between the interrupt handler and
get_vm_fault_info().
The default atomic functions like atomic_set() and atomic_read() do not
provide memory barriers. This allows for CPU instruction reordering,
meaning the memory accesses to vm_fault_info and the vm_fault_info_updated
flag are not guaranteed to occur in the intended order. This creates a
race condition that can lead to inconsistent or stale data being used.
The previous implementation, which used an explicit mb(), was incomplete
and inefficient. It failed to account for all potential CPU reorderings,
such as the access of vm_fault_info being reordered before the atomic_read
of the flag. This approach is also more verbose and less performant than
using the proper atomic functions with acquire/release semantics.
Fix this by switching to atomic_set_release() and atomic_read_acquire().
These functions provide the necessary acquire and release semantics,
which act as memory barriers to ensure the correct order of operations.
It is also more efficient and idiomatic than using explicit full memory
barriers.
Fixes:
b97dfa27ef3a ("drm/amdgpu: save vm fault information for amdkfd")
Cc: stable@vger.kernel.org
Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 3 Sep 2025 17:48:23 +0000 (13:48 -0400)]
drm/amdgpu: set an error on all fences from a bad context
When we backup ring contents to reemit after a queue reset,
we don't backup ring contents from the bad context. When
we signal the fences, we should set an error on those
fences as well.
v2: misc cleanups
v3: add locking for fence error, fix comment (Christian)
v4: fix wrap around, locking (Christian)
Fixes:
77cc0da39c7c ("drm/amdgpu: track ring state associated with a fence")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Mon, 15 Sep 2025 16:37:32 +0000 (12:37 -0400)]
drm/amdgpu: handle wrap around in reemit handling
Compare the sequence numbers directly.
Fixes:
77cc0da39c7c ("drm/amdgpu: track ring state associated with a fence")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Fri, 26 Sep 2025 21:31:32 +0000 (17:31 -0400)]
drm/amdgpu: fix handling of harvesting for ip_discovery firmware
Chips which use the IP discovery firmware loaded by the driver
reported incorrect harvesting information in the ip discovery
table in sysfs because the driver only uses the ip discovery
firmware for populating sysfs and not for direct parsing for the
driver itself as such, the fields that are used to print the
harvesting info in sysfs report incorrect data for some IPs. Populate
the relevant fields for this case as well.
Fixes:
514678da56da ("drm/amdgpu/discovery: fix fw based ip discovery")
Acked-by: Tom St Denis <tom.stdenis@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Christian König [Mon, 22 Sep 2025 12:18:16 +0000 (14:18 +0200)]
drm/amdgpu: block CE CS if not explicitely allowed by module option
The Constant Engine found on gfx6-gfx10 HW has been a notorious source of
problems.
RADV never used it in the first place, radeonsi only used it for a few
releases around 2017 for gfx6-gfx9 before dropping support for it as
well.
While investigating another problem I just recently found that submitting
to the CE seems to be completely broken on gfx9 for quite a while.
Since nobody complained about that problem it most likely means that
nobody is using any of the affected radeonsi versions on current Linux
kernels any more.
So to potentially phase out the support for the CE and eliminate another
source of problems block submitting CE IBs unless it is enabled again
using a debug flag.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Christian König [Wed, 27 Aug 2025 12:47:23 +0000 (14:47 +0200)]
drm/amdgpu: remove two invalid BUG_ON()s
Those can be triggered trivially by userspace.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Timur Kristóf [Fri, 26 Sep 2025 18:26:13 +0000 (20:26 +0200)]
drm/amd: Disable ASPM on SI
Enabling ASPM causes randoms hangs on Tahiti and Oland on Zen4.
It's unclear if this is a platform-specific or GPU-specific issue.
Disable ASPM on SI for the time being.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Timur Kristóf [Fri, 26 Sep 2025 18:26:12 +0000 (20:26 +0200)]
drm/amd/pm: Disable MCLK switching on SI at high pixel clocks
On various SI GPUs, a flickering can be observed near the bottom
edge of the screen when using a single 4K 60Hz monitor over DP.
Disabling MCLK switching works around this problem.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Matthew Schwartz [Thu, 9 Oct 2025 12:19:00 +0000 (14:19 +0200)]
Revert "drm/amd/display: Only restore backlight after amdgpu_dm_init or dm_resume"
This fix regressed the original issue that commit
7875afafba84
("drm/amd/display: Fix brightness level not retained over reboot") solved,
so revert it until a different approach to solve the regression that
it caused with AMD_PRIVATE_COLOR is found.
Fixes:
a490c8d77d50 ("drm/amd/display: Only restore backlight after amdgpu_dm_init or dm_resume")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4620
Cc: stable@vger.kernel.org
Signed-off-by: Matthew Schwartz <matthew.schwartz@linux.dev>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Linus Torvalds [Sun, 12 Oct 2025 20:42:36 +0000 (13:42 -0700)]
Linux 6.18-rc1
Linus Torvalds [Sun, 12 Oct 2025 20:27:56 +0000 (13:27 -0700)]
Merge tag 'i2c-for-6.18-rc1-hotfix' of git://git./linux/kernel/git/wsa/linux
Pull i2c fix from Wolfram Sang:
"One revert because of a regression in the I2C core which has sadly not
showed up during its time in -next"
* tag 'i2c-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
Revert "i2c: boardinfo: Annotate code used in init phase only"
Linus Torvalds [Sun, 12 Oct 2025 15:45:52 +0000 (08:45 -0700)]
Merge tag 'irq_urgent_for_v6.18_rc1' of git://git./linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Skip interrupt ID 0 in sifive-plic during suspend/resume because
ID 0 is reserved and accessing reserved register space could result
in undefined behavior
- Fix a function's retval check in aspeed-scu-ic
* tag 'irq_urgent_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/sifive-plic: Avoid interrupt ID 0 handling during suspend/resume
irqchip/aspeed-scu-ic: Fix an IS_ERR() vs NULL check
Linus Torvalds [Sat, 11 Oct 2025 23:06:04 +0000 (16:06 -0700)]
Merge tag 'trace-v6.18-3' of git://git./linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
"The previous fix to trace_marker required updating trace_marker_raw as
well. The difference between trace_marker_raw from trace_marker is
that the raw version is for applications to write binary structures
directly into the ring buffer instead of writing ASCII strings. This
is for applications that will read the raw data from the ring buffer
and get the data structures directly. It's a bit quicker than using
the ASCII version.
Unfortunately, it appears that our test suite has several tests that
test writes to the trace_marker file, but lacks any tests to the
trace_marker_raw file (this needs to be remedied). Two issues came
about the update to the trace_marker_raw file that syzbot found:
- Fix tracing_mark_raw_write() to use per CPU buffer
The fix to use the per CPU buffer to copy from user space was
needed for both the trace_maker and trace_maker_raw file.
The fix for reading from user space into per CPU buffers properly
fixed the trace_marker write function, but the trace_marker_raw
file wasn't fixed properly. The user space data was correctly
written into the per CPU buffer, but the code that wrote into the
ring buffer still used the user space pointer and not the per CPU
buffer that had the user space data already written.
- Stop the fortify string warning from writing into trace_marker_raw
After converting the copy_from_user_nofault() into a memcpy(),
another issue appeared. As writes to the trace_marker_raw expects
binary data, the first entry is a 4 byte identifier. The entry
structure is defined as:
struct {
struct trace_entry ent;
int id;
char buf[];
};
The size of this structure is reserved on the ring buffer with:
size = sizeof(*entry) + cnt;
Then it is copied from the buffer into the ring buffer with:
memcpy(&entry->id, buf, cnt);
This use to be a copy_from_user_nofault(), but now converting it to
a memcpy() triggers the fortify-string code, and causes a warning.
The allocated space is actually more than what is copied, as the
cnt used also includes the entry->id portion. Allocating
sizeof(*entry) plus cnt is actually allocating 4 bytes more than
what is needed.
Change the size function to:
size = struct_size(entry, buf, cnt - sizeof(entry->id));
And update the memcpy() to unsafe_memcpy()"
* tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Stop fortify-string from warning in tracing_mark_raw_write()
tracing: Fix tracing_mark_raw_write() to use buf and not ubuf
Linus Torvalds [Sat, 11 Oct 2025 22:47:12 +0000 (15:47 -0700)]
Merge tag 'kbuild-fixes-6.18-1' of git://git./linux/kernel/git/kbuild/linux
Pull Kbuild fixes from Nathan Chancellor:
- Fix UAPI types check in headers_check.pl
- Only enable -Werror for hostprogs with CONFIG_WERROR / W=e
- Ignore fsync() error when output of gen_init_cpio is a pipe
- Several little build fixes for recent modules.builtin.modinfo series
* tag 'kbuild-fixes-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
kbuild: Use '--strip-unneeded-symbol' for removing module device table symbols
s390/vmlinux.lds.S: Move .vmlinux.info to end of allocatable sections
kbuild: Add '.rel.*' strip pattern for vmlinux
kbuild: Restore pattern to avoid stripping .rela.dyn from vmlinux
gen_init_cpio: Ignore fsync() returning EINVAL on pipes
scripts/Makefile.extrawarn: Respect CONFIG_WERROR / W=e for hostprogs
kbuild: uapi: Strip comments before size type check
Wolfram Sang [Sat, 11 Oct 2025 10:31:53 +0000 (12:31 +0200)]
Revert "i2c: boardinfo: Annotate code used in init phase only"
This reverts commit
1a2b423be6a89dd07d5fc27ea042be68697a6a49 because we
got a regression report and need time to find out the details.
Reported-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Closes: https://lore.kernel.org/r/
29ec0082-4dd4-4120-acd2-
44b35b4b9487@oss.qualcomm.com
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Linus Torvalds [Sat, 11 Oct 2025 18:56:47 +0000 (11:56 -0700)]
Merge tag 'rtc-6.18' of git://git./linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"This cycle, we have a new RTC driver, for the SpacemiT P1. The optee
driver gets alarm support. We also get a fix for a race condition that
was fairly rare unless while stress testing the alarms.
Subsystem:
- Fix race when setting alarm
- Ensure alarm irq is enabled when UIE is enabled
- remove unneeded 'fast_io' parameter in regmap_config
New driver:
- SpacemiT P1 RTC
Drivers:
- efi: Remove wakeup functionality
- optee: add alarms support
- s3c: Drop support for S3C2410
- zynqmp: Restore alarm functionality after kexec transition"
* tag 'rtc-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (29 commits)
rtc: interface: Ensure alarm irq is enabled when UIE is enabled
rtc: tps6586x: Fix initial enable_irq/disable_irq balance
rtc: cpcap: Fix initial enable_irq/disable_irq balance
rtc: isl12022: Fix initial enable_irq/disable_irq balance
rtc: interface: Fix long-standing race when setting alarm
rtc: pcf2127: fix watchdog interrupt mask on pcf2131
rtc: zynqmp: Restore alarm functionality after kexec transition
rtc: amlogic-a4: Optimize global variables
rtc: sd2405al: Add I2C address.
rtc: Kconfig: move symbols to proper section
rtc: optee: make optee_rtc_pm_ops static
rtc: optee: Fix error code in optee_rtc_read_alarm()
rtc: optee: fix error code in probe()
dt-bindings: rtc: Convert apm,xgene-rtc to DT schema
rtc: spacemit: support the SpacemiT P1 RTC
rtc: optee: add alarm related rtc ops to optee rtc driver
rtc: optee: remove unnecessary memory operations
rtc: optee: fix memory leak on driver removal
rtc: x1205: Fix Xicor X1205 vendor prefix
dt-bindings: rtc: Fix Xicor X1205 vendor prefix
...
Linus Torvalds [Sat, 11 Oct 2025 18:49:00 +0000 (11:49 -0700)]
Merge tag 'scsi-misc' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Fixes only in drivers (ufs, mvsas, qla2xxx, target) that came in just
before or during the merge window.
The most important one is the qla2xxx which reverts a conversion to
fix flexible array member warnings, that went up in this merge window
but which turned out on further testing to be causing data corruption"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Include UTP error in INT_FATAL_ERRORS
scsi: ufs: sysfs: Make HID attributes visible
scsi: mvsas: Fix use-after-free bugs in mvs_work_queue
scsi: ufs: core: Fix PM QoS mutex initialization
scsi: ufs: core: Fix runtime suspend error deadlock
Revert "scsi: qla2xxx: Fix memcpy() field-spanning write issue"
scsi: target: target_core_configfs: Add length check to avoid buffer overflow
Linus Torvalds [Sat, 11 Oct 2025 18:19:16 +0000 (11:19 -0700)]
Merge tag 'x86_core_for_v6.18_rc1' of git://git./linux/kernel/git/tip/tip
Pull more x86 updates from Borislav Petkov:
- Remove a bunch of asm implementing condition flags testing in KVM's
emulator in favor of int3_emulate_jcc() which is written in C
- Replace KVM fastops with C-based stubs which avoids problems with the
fastop infra related to latter not adhering to the C ABI due to their
special calling convention and, more importantly, bypassing compiler
control-flow integrity checking because they're written in asm
- Remove wrongly used static branches and other ugliness accumulated
over time in hyperv's hypercall implementation with a proper static
function call to the correct hypervisor call variant
- Add some fixes and modifications to allow running FRED-enabled
kernels in KVM even on non-FRED hardware
- Add kCFI improvements like validating indirect calls and prepare for
enabling kCFI with GCC. Add cmdline params documentation and other
code cleanups
- Use the single-byte 0xd6 insn as the official #UD single-byte
undefined opcode instruction as agreed upon by both x86 vendors
- Other smaller cleanups and touchups all over the place
* tag 'x86_core_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86,retpoline: Optimize patch_retpoline()
x86,ibt: Use UDB instead of 0xEA
x86/cfi: Remove __noinitretpoline and __noretpoline
x86/cfi: Add "debug" option to "cfi=" bootparam
x86/cfi: Standardize on common "CFI:" prefix for CFI reports
x86/cfi: Document the "cfi=" bootparam options
x86/traps: Clarify KCFI instruction layout
compiler_types.h: Move __nocfi out of compiler-specific header
objtool: Validate kCFI calls
x86/fred: KVM: VMX: Always use FRED for IRQs when CONFIG_X86_FRED=y
x86/fred: Play nice with invoking asm_fred_entry_from_kvm() on non-FRED hardware
x86/fred: Install system vector handlers even if FRED isn't fully enabled
x86/hyperv: Use direct call to hypercall-page
x86/hyperv: Clean up hv_do_hypercall()
KVM: x86: Remove fastops
KVM: x86: Convert em_salc() to C
KVM: x86: Introduce EM_ASM_3WCL
KVM: x86: Introduce EM_ASM_1SRC2
KVM: x86: Introduce EM_ASM_2CL
KVM: x86: Introduce EM_ASM_2W
...
Linus Torvalds [Sat, 11 Oct 2025 17:51:14 +0000 (10:51 -0700)]
Merge tag 'x86_cleanups_for_v6.18_rc1' of git://git./linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
- Simplify inline asm flag output operands now that the minimum
compiler version supports the =@ccCOND syntax
- Remove a bunch of AS_* Kconfig symbols which detect assembler support
for various instruction mnemonics now that the minimum assembler
version supports them all
- The usual cleanups all over the place
* tag 'x86_cleanups_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Remove code depending on __GCC_ASM_FLAG_OUTPUTS__
x86/sgx: Use ENCLS mnemonic in <kernel/cpu/sgx/encls.h>
x86/mtrr: Remove license boilerplate text with bad FSF address
x86/asm: Use RDPKRU and WRPKRU mnemonics in <asm/special_insns.h>
x86/idle: Use MONITORX and MWAITX mnemonics in <asm/mwait.h>
x86/entry/fred: Push __KERNEL_CS directly
x86/kconfig: Remove CONFIG_AS_AVX512
crypto: x86 - Remove CONFIG_AS_VPCLMULQDQ
crypto: X86 - Remove CONFIG_AS_VAES
crypto: x86 - Remove CONFIG_AS_GFNI
x86/kconfig: Drop unused and needless config X86_64_SMP
Linus Torvalds [Sat, 11 Oct 2025 17:40:24 +0000 (10:40 -0700)]
Merge tag 'slab-for-6.18-rc1-hotfix' of git://git./linux/kernel/git/vbabka/slab
Pull slab fix from Vlastimil Babka:
"A NULL pointer deref hotfix"
* tag 'slab-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
slab: fix barn NULL pointer dereference on memoryless nodes
Linus Torvalds [Sat, 11 Oct 2025 17:31:38 +0000 (10:31 -0700)]
Merge tag 'bpf-fixes' of git://git./linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Finish constification of 1st parameter of bpf_d_path() (Rong Tao)
- Harden userspace-supplied xdp_desc validation (Alexander Lobakin)
- Fix metadata_dst leak in __bpf_redirect_neigh_v{4,6}() (Daniel
Borkmann)
- Fix undefined behavior in {get,put}_unaligned_be32() (Eric Biggers)
- Use correct context to unpin bpf hash map with special types (KaFai
Wan)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add test for unpinning htab with internal timer struct
bpf: Avoid RCU context warning when unpinning htab with internal structs
xsk: Harden userspace-supplied xdp_desc validation
bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6}
libbpf: Fix undefined behavior in {get,put}_unaligned_be32()
bpf: Finish constification of 1st parameter of bpf_d_path()
Linus Torvalds [Sat, 11 Oct 2025 17:27:52 +0000 (10:27 -0700)]
Merge tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git./linux/kernel/git/akpm/mm
Pull more updates from Andrew Morton:
"Just one series here - Mike Rappoport has taught KEXEC handover to
preserve vmalloc allocations across handover"
* tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
lib/test_kho: use kho_preserve_vmalloc instead of storing addresses in fdt
kho: add support for preserving vmalloc allocations
kho: replace kho_preserve_phys() with kho_preserve_pages()
kho: check if kho is finalized in __kho_preserve_order()
MAINTAINERS, .mailmap: update Umang's email address
Linus Torvalds [Sat, 11 Oct 2025 17:14:55 +0000 (10:14 -0700)]
Merge tag 'mm-hotfixes-stable-2025-10-10-15-00' of git://git./linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"7 hotfixes. All 7 are cc:stable and all 7 are for MM.
All singletons, please see the changelogs for details"
* tag 'mm-hotfixes-stable-2025-10-10-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: hugetlb: avoid soft lockup when mprotect to large memory area
fsnotify: pass correct offset to fsnotify_mmap_perm()
mm/ksm: fix flag-dropping behavior in ksm_madvise
mm/damon/vaddr: do not repeat pte_offset_map_lock() until success
mm/rmap: fix soft-dirty and uffd-wp bit loss when remapping zero-filled mTHP subpage to shared zeropage
mm/thp: fix MTE tag mismatch when replacing zero-filled subpages
memcg: skip cgroup_file_notify if spinning is not allowed
Steven Rostedt [Sat, 11 Oct 2025 15:20:32 +0000 (11:20 -0400)]
tracing: Stop fortify-string from warning in tracing_mark_raw_write()
The way tracing_mark_raw_write() records its data is that it has the
following structure:
struct {
struct trace_entry;
int id;
char buf[];
};
But memcpy(&entry->id, buf, size) triggers the following warning when the
size is greater than the id:
------------[ cut here ]------------
memcpy: detected field-spanning write (size 6) of single field "&entry->id" at kernel/trace/trace.c:7458 (size 4)
WARNING: CPU: 7 PID: 995 at kernel/trace/trace.c:7458 write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0
Modules linked in:
CPU: 7 UID: 0 PID: 995 Comm: bash Not tainted
6.17.0-test-00007-g60b82183e78a-dirty #211 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
RIP: 0010:write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0
Code: 04 00 75 a7 b9 04 00 00 00 48 89 de 48 89 04 24 48 c7 c2 e0 b1 d1 b2 48 c7 c7 40 b2 d1 b2 c6 05 2d 88 6a 04 01 e8 f7 e8 bd ff <0f> 0b 48 8b 04 24 e9 76 ff ff ff 49 8d 7c 24 04 49 8d 5c 24 08 48
RSP: 0018:
ffff888104c3fc78 EFLAGS:
00010292
RAX:
0000000000000000 RBX:
0000000000000006 RCX:
0000000000000000
RDX:
0000000000000000 RSI:
1ffffffff6b363b4 RDI:
0000000000000001
RBP:
ffff888100058a00 R08:
ffffffffb041d459 R09:
ffffed1020987f40
R10:
0000000000000007 R11:
0000000000000001 R12:
ffff888100bb9010
R13:
0000000000000000 R14:
00000000000003e3 R15:
ffff888134800000
FS:
00007fa61d286740(0000) GS:
ffff888286cad000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000560d28d509f1 CR3:
00000001047a4006 CR4:
0000000000172ef0
Call Trace:
<TASK>
tracing_mark_raw_write+0x1fe/0x290
? __pfx_tracing_mark_raw_write+0x10/0x10
? security_file_permission+0x50/0xf0
? rw_verify_area+0x6f/0x4b0
vfs_write+0x1d8/0xdd0
? __pfx_vfs_write+0x10/0x10
? __pfx_css_rstat_updated+0x10/0x10
? count_memcg_events+0xd9/0x410
? fdget_pos+0x53/0x5e0
ksys_write+0x182/0x200
? __pfx_ksys_write+0x10/0x10
? do_user_addr_fault+0x4af/0xa30
do_syscall_64+0x63/0x350
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fa61d318687
Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
RSP: 002b:
00007ffd87fe0120 EFLAGS:
00000202 ORIG_RAX:
0000000000000001
RAX:
ffffffffffffffda RBX:
00007fa61d286740 RCX:
00007fa61d318687
RDX:
0000000000000006 RSI:
0000560d28d509f0 RDI:
0000000000000001
RBP:
0000560d28d509f0 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000202 R12:
0000000000000006
R13:
00007fa61d4715c0 R14:
00007fa61d46ee80 R15:
0000000000000000
</TASK>
---[ end trace
0000000000000000 ]---
This is because fortify string sees that the size of entry->id is only 4
bytes, but it is writing more than that. But this is OK as the
dynamic_array is allocated to handle that copy.
The size allocated on the ring buffer was actually a bit too big:
size = sizeof(*entry) + cnt;
But cnt includes the 'id' and the buffer data, so adding cnt to the size
of *entry actually allocates too much on the ring buffer.
Change the allocation to:
size = struct_size(entry, buf, cnt - sizeof(entry->id));
and the memcpy() to unsafe_memcpy() with an added justification.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20251011112032.77be18e4@gandalf.local.home
Fixes:
64cf7d058a00 ("tracing: Have trace_marker use per-cpu data to read user space")
Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/
68e973f5.
050a0220.1186a4.0010.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Vlastimil Babka [Sat, 11 Oct 2025 08:45:41 +0000 (10:45 +0200)]
slab: fix barn NULL pointer dereference on memoryless nodes
Phil reported a boot failure once sheaves become used in commits
59faa4da7cd4 ("maple_tree: use percpu sheaves for maple_node_cache") and
3accabda4da1 ("mm, vma: use percpu sheaves for vm_area_struct cache"):
BUG: kernel NULL pointer dereference, address:
0000000000000040
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
RSP: 0018:
ffffd2d10950bdb0 EFLAGS:
00010246
RAX:
0000000000000000 RBX:
ffff8a775dab74b0 RCX:
00000000ffffffff
RDX:
0000000000000cc0 RSI:
ffff8a6800804000 RDI:
ffff8a680004e300
RBP:
ffffd2d10950be40 R08:
0000000000000060 R09:
ffffffffb9367388
R10:
00000000000149e8 R11:
ffff8a6f87a38000 R12:
0000000000000cc0
R13:
0000000000000cc0 R14:
ffff8a680004e300 R15:
00000000000000c0
FS:
0000000000000000(0000) GS:
ffff8a77a3541000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000000000040 CR3:
0000000e1aa24000 CR4:
00000000003506f0
Call Trace:
<TASK>
? srso_return_thunk+0x5/0x5f
? vm_area_alloc+0x1e/0x60
kmem_cache_alloc_noprof+0x4ec/0x5b0
vm_area_alloc+0x1e/0x60
create_init_stack_vma+0x26/0x210
alloc_bprm+0x139/0x200
kernel_execve+0x4a/0x140
call_usermodehelper_exec_async+0xd0/0x190
? __pfx_call_usermodehelper_exec_async+0x10/0x10
ret_from_fork+0xf0/0x110
? __pfx_call_usermodehelper_exec_async+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Modules linked in:
CR2:
0000000000000040
---[ end trace
0000000000000000 ]---
RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
RSP: 0018:
ffffd2d10950bdb0 EFLAGS:
00010246
RAX:
0000000000000000 RBX:
ffff8a775dab74b0 RCX:
00000000ffffffff
RDX:
0000000000000cc0 RSI:
ffff8a6800804000 RDI:
ffff8a680004e300
RBP:
ffffd2d10950be40 R08:
0000000000000060 R09:
ffffffffb9367388
R10:
00000000000149e8 R11:
ffff8a6f87a38000 R12:
0000000000000cc0
R13:
0000000000000cc0 R14:
ffff8a680004e300 R15:
00000000000000c0
FS:
0000000000000000(0000) GS:
ffff8a77a3541000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000000000040 CR3:
0000000e1aa24000 CR4:
00000000003506f0
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x36a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]---
And noted "this is an AMD EPYC 7401 with 8 NUMA nodes configured such
that memory is only on 2 of them."
# numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 8 16 24 32 40 48 56 64 72 80 88
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 2 10 18 26 34 42 50 58 66 74 82 90
node 1 size: 31584 MB
node 1 free: 30397 MB
node 2 cpus: 4 12 20 28 36 44 52 60 68 76 84 92
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 6 14 22 30 38 46 54 62 70 78 86 94
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 1 9 17 25 33 41 49 57 65 73 81 89
node 4 size: 0 MB
node 4 free: 0 MB
node 5 cpus: 3 11 19 27 35 43 51 59 67 75 83 91
node 5 size: 32214 MB
node 5 free: 31625 MB
node 6 cpus: 5 13 21 29 37 45 53 61 69 77 85 93
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 7 15 23 31 39 47 55 63 71 79 87 95
node 7 size: 0 MB
node 7 free: 0 MB
Linus decoded the stacktrace to get_barn() and get_node() and determined
that kmem_cache->node[numa_mem_id()] is NULL.
The problem is due to a wrong assumption that memoryless nodes only
exist on systems with CONFIG_HAVE_MEMORYLESS_NODES, where numa_mem_id()
points to the nearest node that has memory. SLUB has been allocating its
kmem_cache_node structures only on nodes with memory and so it does with
struct node_barn.
For kmem_cache_node, get_partial_node() checks if get_node() result is
not NULL, which I assumed was for protection from a bogus node id passed
to kmalloc_node() but apparently it's also for systems where
numa_mem_id() (used when no specific node is given) might return a
memoryless node.
Fix the sheaves code the same way by checking the result of get_node()
and bailing out if it's NULL. Note that cpus on such memoryless nodes
will have degraded sheaves performance, which can be improved later,
preferably by making numa_mem_id() work properly on such systems.
Fixes:
2d517aa09bbc ("slab: add opt-in caching layer of percpu sheaves")
Reported-and-tested-by: Phil Auld <pauld@redhat.com>
Closes: https://lore.kernel.org/all/
20251010151116.GA436967@pauld.westford.csb/
Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/all/CAHk-%3Dwg1xK%2BBr%3DFJ5QipVhzCvq7uQVPt5Prze6HDhQQ%3DQD_BcQ@mail.gmail.com/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Steven Rostedt [Sat, 11 Oct 2025 03:51:42 +0000 (23:51 -0400)]
tracing: Fix tracing_mark_raw_write() to use buf and not ubuf
The fix to use a per CPU buffer to read user space tested only the writes
to trace_marker. But it appears that the selftests are missing tests to
the trace_maker_raw file. The trace_maker_raw file is used by applications
that writes data structures and not strings into the file, and the tools
read the raw ring buffer to process the structures it writes.
The fix that reads the per CPU buffers passes the new per CPU buffer to
the trace_marker file writes, but the update to the trace_marker_raw write
read the data from user space into the per CPU buffer, but then still used
then passed the user space address to the function that records the data.
Pass in the per CPU buffer and not the user space address.
TODO: Add a test to better test trace_marker_raw.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20251011035243.386098147@kernel.org
Fixes:
64cf7d058a00 ("tracing: Have trace_marker use per-cpu data to read user space")
Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/
68e973f5.
050a0220.1186a4.0010.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Nathan Chancellor [Fri, 10 Oct 2025 21:49:27 +0000 (14:49 -0700)]
kbuild: Use '--strip-unneeded-symbol' for removing module device table symbols
After commit
5ab23c7923a1 ("modpost: Create modalias for builtin
modules"), relocatable RISC-V kernels with CONFIG_KASAN=y start failing
when attempting to strip the module device table symbols:
riscv64-linux-objcopy: not stripping symbol `__mod_device_table__kmod_irq_starfive_jh8100_intc__of__starfive_intc_irqchip_match_table' because it is named in a relocation
make[4]: *** [scripts/Makefile.vmlinux:97: vmlinux] Error 1
The relocation appears to come from .LASANLOC5 in .data.rel.local:
$ llvm-objdump --disassemble-symbols=.LASANLOC5 --disassemble-all -r drivers/irqchip/irq-starfive-jh8100-intc.o
drivers/irqchip/irq-starfive-jh8100-intc.o: file format elf64-littleriscv
Disassembly of section .data.rel.local:
0000000000000180 <.LASANLOC5>:
...
1d0: 0000 unimp
00000000000001d0: R_RISCV_64 __mod_device_table__kmod_irq_starfive_jh8100_intc__of__starfive_intc_irqchip_match_table
...
This section appears to come from GCC for including additional
information about global variables that may be protected by KASAN.
There appears to be no way to opt out of the generation of these symbols
through either a flag or attribute. Attempting to remove '.LASANLOC*'
with '--strip-symbol' results in the same error as above because these
symbols may refer to (thus have relocation between) each other.
Avoid this build breakage by switching to '--strip-unneeded-symbol' for
removing __mod_device_table__ symbols, as it will only remove the symbol
when there is no relocation pointing to it. While this may result in a
little more bloat in the symbol table in certain configurations, it is
not as bad as outright build failures.
Fixes:
5ab23c7923a1 ("modpost: Create modalias for builtin modules")
Reported-by: Charles Mirabile <cmirabil@redhat.com>
Closes: https://lore.kernel.org/
20251007011637.
2512413-1-cmirabil@redhat.com/
Suggested-by: Alexey Gladkov <legion@kernel.org>
Tested-by: Nicolas Schier <nsc@kernel.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Linus Torvalds [Fri, 10 Oct 2025 21:06:02 +0000 (14:06 -0700)]
Merge tag 'for-6.18/hpfs-changes' of git://git./linux/kernel/git/device-mapper/linux-dm
Pull hpfs updates from Mikulas Patocka:
- Avoid -Wflex-array-member-not-at-end warnings
- Replace simple_strtoul with kstrtoint
- Fix error code for new_inode() failure
* tag 'for-6.18/hpfs-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
fs/hpfs: Fix error code for new_inode() failure in mkdir/create/mknod/symlink
hpfs: Replace simple_strtoul with kstrtoint in hpfs_parse_param
fs: hpfs: Avoid multiple -Wflex-array-member-not-at-end warnings
Linus Torvalds [Fri, 10 Oct 2025 21:02:14 +0000 (14:02 -0700)]
Merge tag 'drm-next-2025-10-11-1' of https://gitlab.freedesktop.org/drm/kernel
Pull more drm fixes from Dave Airlie:
"Just the follow up fixes for rc1 from the next branch, amdgpu and xe
mostly with a single v3d fix in there.
amdgpu:
- DC DCE6 fixes
- GPU reset fixes
- Secure diplay messaging cleanup
- MES fix
- GPUVM locking fixes
- PMFW messaging cleanup
- PCI US/DS switch handling fix
- VCN queue reset fix
- DC FPU handling fix
- DCN 3.5 fix
- DC mirroring fix
amdkfd:
- Fix kfd process ref leak
- mmap write lock handling fix
- Fix comments in IOCTL
xe:
- Fix build with clang 16
- Fix handling of invalid configfs syntax usage and spell out the
expected syntax in the documentation
- Do not try late bind firmware when running as VF since it shouldn't
handle firmware loading
- Fix idle assertion for local BOs
- Fix uninitialized variable for late binding
- Do not require perfmon_capable to expose free memory at page
granularity. Handle it like other drm drivers do
- Fix lock handling on suspend error path
- Fix I2C controller resume after S3
v3d:
- fix fence locking"
* tag 'drm-next-2025-10-11-1' of https://gitlab.freedesktop.org/drm/kernel: (34 commits)
drm/amd/display: Incorrect Mirror Cositing
drm/amd/display: Enable Dynamic DTBCLK Switch
drm/amdgpu: Report individual reset error
drm/amdgpu: partially revert "revert to old status lock handling v3"
drm/amd/display: Fix unsafe uses of kernel mode FPU
drm/amd/pm: Disable VCN queue reset on SMU v13.0.6 due to regression
drm/amdgpu: Fix general protection fault in amdgpu_vm_bo_reset_state_machine
drm/amdgpu: Check swus/ds for switch state save
drm/amdkfd: Fix two comments in kfd_ioctl.h
drm/amd/pm: Avoid interface mismatch messaging
drm/amdgpu: Merge amdgpu_vm_set_pasid into amdgpu_vm_init
drm/amd/amdgpu: Fix the mes version that support inv_tlbs
drm/amd: Check whether secure display TA loaded successfully
drm/amdkfd: Fix mmap write lock not release
drm/amdkfd: Fix kfd process ref leaking when userptr unmapping
drm/amdgpu: Fix for GPU reset being blocked by KIQ I/O.
drm/amd/display: Disable scaling on DCE6 for now
drm/amd/display: Properly disable scaling on DCE6
drm/amd/display: Properly clear SCL_*_FILTER_CONTROL on DCE6
drm/amd/display: Add missing DCE6 SCL_HORZ_FILTER_INIT* SRIs
...