Paolo Bonzini [Fri, 30 May 2025 07:45:13 +0000 (03:45 -0400)]
rtmutex_api: provide correct extern functions
Commit
fb49f07ba1d9 ("locking/mutex: implement mutex_lock_killable_nest_lock")
changed the set of functions that mutex.c defines when CONFIG_DEBUG_LOCK_ALLOC
is set.
- it removed the "extern" declaration of mutex_lock_killable_nested from
include/linux/mutex.h, and replaced it with a macro since it could be
treated as a special case of _mutex_lock_killable. It also removed a
definition of the function in kernel/locking/mutex.c.
- likewise, it replaced mutex_trylock() with the more generic
mutex_trylock_nest_lock() and replaced mutex_trylock() with a macro.
However, it left the old definitions in place in kernel/locking/rtmutex_api.c,
which causes failures when building with CONFIG_RT_MUTEXES=y. Bring over
the changes.
Fixes:
fb49f07ba1d9 ("locking/mutex: implement mutex_lock_killable_nest_lock")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 28 May 2025 17:21:16 +0000 (13:21 -0400)]
Merge tag 'kvm-s390-next-6.16-1' of https://git./linux/kernel/git/kvms390/linux into HEAD
* Fix interaction between some filesystems and Secure Execution
* Some cleanups and refactorings, preparing for an upcoming big series
Claudio Imbrenda [Wed, 28 May 2025 09:55:02 +0000 (11:55 +0200)]
KVM: s390: Simplify and move pv code
All functions in kvm/gmap.c fit better in kvm/pv.c instead.
Move and rename them appropriately, then delete the now empty
kvm/gmap.c and kvm/gmap.h.
Reviewed-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20250528095502.226213-5-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <
20250528095502.226213-5-imbrenda@linux.ibm.com>
Claudio Imbrenda [Wed, 28 May 2025 09:55:01 +0000 (11:55 +0200)]
KVM: s390: Refactor and split some gmap helpers
Refactor some gmap functions; move the implementation into a separate
file with only helper functions. The new helper functions work on vm
addresses, leaving all gmap logic in the gmap functions, which mostly
become just wrappers.
The whole gmap handling is going to be moved inside KVM soon, but the
helper functions need to touch core mm functions, and thus need to
stay in the core of kernel.
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20250528095502.226213-4-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <
20250528095502.226213-4-imbrenda@linux.ibm.com>
Claudio Imbrenda [Wed, 28 May 2025 09:55:00 +0000 (11:55 +0200)]
KVM: s390: Remove unneeded srcu lock
All paths leading to handle_essa() already hold the kvm->srcu.
Remove unneeded srcu locking from handle_essa().
Add lockdep assertion to make sure we will always be holding kvm->srcu
when entering handle_essa().
Reviewed-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20250528095502.226213-3-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <
20250528095502.226213-3-imbrenda@linux.ibm.com>
Claudio Imbrenda [Wed, 28 May 2025 09:54:59 +0000 (11:54 +0200)]
s390: Remove unneeded includes
Many files don't need to include asm/tlb.h or asm/gmap.h.
On the other hand, asm/tlb.h does need to include asm/gmap.h.
Remove all unneeded includes so that asm/tlb.h is not directly used by
s390 arch code anymore. Remove asm/gmap.h from a few other files as
well, so that now only KVM code, mm/gmap.c, and asm/tlb.h include it.
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20250528095502.226213-2-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <
20250528095502.226213-2-imbrenda@linux.ibm.com>
David Hildenbrand [Fri, 16 May 2025 12:39:46 +0000 (14:39 +0200)]
s390/uv: Improve splitting of large folios that cannot be split while dirty
Currently, starting a PV VM on an iomap-based filesystem with large
folio support, such as XFS, will not work. We'll be stuck in
unpack_one()->gmap_make_secure(), because we can't seem to make progress
splitting the large folio.
The problem is that we require a writable PTE but a writable PTE under such
filesystems will imply a dirty folio.
So whenever we have a writable PTE, we'll have a dirty folio, and dirty
iomap folios cannot currently get split, because
split_folio()->split_huge_page_to_list_to_order()->filemap_release_folio()
will fail in iomap_release_folio().
So we will not make any progress splitting such large folios.
Until dirty folios can be split more reliably, let's manually trigger
writeback of the problematic folio using
filemap_write_and_wait_range(), and retry the split immediately
afterwards exactly once, before looking up the folio again.
Should this logic be part of split_folio()? Likely not; most split users
don't have to split so eagerly to make any progress.
For now, this seems to affect xfs, zonefs and erofs, and this patch
makes it work again (tested on xfs only).
While this could be considered a fix for commit
6795801366da ("xfs: Support
large folios"), commit
df2f9708ff1f ("zonefs: enable support for large
folios") and commit
ce529cc25b18 ("erofs: enable large folios for iomap
mode"), before commit
eef88fe45ac9 ("s390/uv: Split large folios in
gmap_make_secure()"), we did not try splitting large folios at all. So it's
all rather part of making SE compatible with file systems that support
large folios. But to have some "Fixes:" tag, let's just use
eef88fe45ac9.
Not CCing stable, because there are a lot of dependencies, and it simply
not working is not critical in stable kernels.
Reported-by: Sebastian Mitterle <smitterl@redhat.com>
Closes: https://issues.redhat.com/browse/RHEL-58218
Fixes:
eef88fe45ac9 ("s390/uv: Split large folios in gmap_make_secure()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250516123946.1648026-4-david@redhat.com
Message-ID: <
20250516123946.
1648026-4-david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
David Hildenbrand [Fri, 16 May 2025 12:39:45 +0000 (14:39 +0200)]
s390/uv: Always return 0 from s390_wiggle_split_folio() if successful
Let's consistently return 0 if the operation was successful, and just
detect ourselves whether splitting is required -- folio_test_large() is
a cheap operation.
Update the documentation.
Should we simply always return -EAGAIN instead of 0, so we don't have
to handle it in the caller? Not sure, staring at the documentation, this
way looks a bit cleaner.
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250516123946.1648026-3-david@redhat.com
Message-ID: <
20250516123946.
1648026-3-david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
David Hildenbrand [Fri, 16 May 2025 12:39:44 +0000 (14:39 +0200)]
s390/uv: Don't return 0 from make_hva_secure() if the operation was not successful
If s390_wiggle_split_folio() returns 0 because splitting a large folio
succeeded, we will return 0 from make_hva_secure() even though a retry
is required. Return -EAGAIN in that case.
Otherwise, we'll return 0 from gmap_make_secure(), and consequently from
unpack_one(). In kvm_s390_pv_unpack(), we assume that unpacking
succeeded and skip unpacking this page. Later on, we run into issues
and fail booting the VM.
So far, this issue was only observed with follow-up patches where we
split large pagecache XFS folios. Maybe it can also be triggered with
shmem?
We'll cleanup s390_wiggle_split_folio() a bit next, to also return 0
if no split was required.
Fixes:
d8dfda5af0be ("KVM: s390: pv: fix race when making a page secure")
Cc: stable@vger.kernel.org
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20250516123946.1648026-2-david@redhat.com
Message-ID: <
20250516123946.
1648026-2-david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Paolo Bonzini [Tue, 27 May 2025 16:17:06 +0000 (12:17 -0400)]
Merge branch 'kvm-lockdep-common' into HEAD
Introduce new mutex locking functions mutex_trylock_nest_lock() and
mutex_lock_killable_nest_lock() and use them to clean up locking
of all vCPUs for a VM.
For x86, this removes some complex code that was used instead
of lockdep's "nest_lock" feature.
For ARM and RISC-V, this removes a lockdep warning when the VM is
configured to have more than MAX_LOCK_DEPTH vCPUs, and removes a fair
amount of duplicate code by sharing the logic across all architectures.
Signed-off-by: Paolo BOnzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 28 May 2025 08:34:30 +0000 (10:34 +0200)]
rust: add helper for mutex_trylock
After commit
c5b6ababd21a ("locking/mutex: implement mutex_trylock_nested",
currently in the KVM tree) mutex_trylock() will be a macro when lockdep is
enabled. Rust therefore needs the corresponding helper. Just add it and
the rust/bindings/bindings_helpers_generated.rs Makefile rules will do
their thing.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20250528083431.
1875345-1-pbonzini@redhat.com>
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Mon, 12 May 2025 18:04:07 +0000 (14:04 -0400)]
RISC-V: KVM: use kvm_trylock_all_vcpus when locking all vCPUs
Use kvm_trylock_all_vcpus instead of a custom implementation when locking
all vCPUs of a VM.
Compile tested only.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Anup Patel <anup@brainfault.org>
Tested-by: Anup Patel <anup@brainfault.org>
Message-ID: <
20250512180407.659015-7-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Mon, 12 May 2025 18:04:06 +0000 (14:04 -0400)]
KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs
Use kvm_trylock_all_vcpus instead of a custom implementation when locking
all vCPUs of a VM, to avoid triggering a lockdep warning, in the case in
which the VM is configured to have more than MAX_LOCK_DEPTH vCPUs.
This fixes the following false lockdep warning:
[ 328.171264] BUG: MAX_LOCK_DEPTH too low!
[ 328.175227] turning off the locking correctness validator.
[ 328.180726] Please attach the output of /proc/lock_stat to the bug report
[ 328.187531] depth: 48 max: 48!
[ 328.190678] 48 locks held by qemu-kvm/11664:
[ 328.194957] #0:
ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
[ 328.204048] #1:
ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.212521] #2:
ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.220991] #3:
ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.229463] #4:
ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.237934] #5:
ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.246405] #6:
ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <
20250512180407.659015-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Mon, 12 May 2025 18:04:05 +0000 (14:04 -0400)]
x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementation
Use kvm_lock_all_vcpus instead of sev's own implementation.
Because kvm_lock_all_vcpus uses the _nest_lock feature of lockdep, which
ignores subclasses, there is no longer a need to use separate subclasses
for source and target VMs.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <
20250512180407.659015-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Mon, 12 May 2025 18:04:04 +0000 (14:04 -0400)]
KVM: add kvm_lock_all_vcpus and kvm_trylock_all_vcpus
In a few cases, usually in the initialization code, KVM locks all vCPUs
of a VM to ensure that userspace doesn't do funny things while KVM performs
an operation that affects the whole VM.
Until now, all these operations were implemented using custom code,
and all of them share the same problem:
Lockdep can't cope with simultaneous locking of a large number of locks of
the same class.
However if these locks are taken while another lock is already held,
which is luckily the case, it is possible to take advantage of little known
_nest_lock feature of lockdep which allows in this case to have an
unlimited number of locks of same class to be taken.
To implement this, create two functions:
kvm_lock_all_vcpus() and kvm_trylock_all_vcpus()
Both functions are needed because some code that will be replaced in
the subsequent patches, uses mutex_trylock, instead of regular mutex_lock.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <
20250512180407.659015-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Mon, 12 May 2025 18:04:03 +0000 (14:04 -0400)]
locking/mutex: implement mutex_lock_killable_nest_lock
KVM's SEV intra-host migration code needs to lock all vCPUs
of the source and the target VM, before it proceeds with the migration.
The number of vCPUs that belong to each VM is not bounded by anything
except a self-imposed KVM limit of CONFIG_KVM_MAX_NR_VCPUS vCPUs which is
significantly larger than the depth of lockdep's lock stack.
Luckily, the locks in both of the cases mentioned above, are held under
the 'kvm->lock' of each VM, which means that we can use the little
known lockdep feature called a "nest_lock" to support this use case in
a cleaner way, compared to the way it's currently done.
Implement and expose 'mutex_lock_killable_nest_lock' for this
purpose.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <
20250512180407.659015-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Maxim Levitsky [Mon, 12 May 2025 18:04:02 +0000 (14:04 -0400)]
locking/mutex: implement mutex_trylock_nested
Despite the fact that several lockdep-related checks are skipped when
calling trylock* versions of the locking primitives, for example
mutex_trylock, each time the mutex is acquired, a held_lock is still
placed onto the lockdep stack by __lock_acquire() which is called
regardless of whether the trylock* or regular locking API was used.
This means that if the caller successfully acquires more than
MAX_LOCK_DEPTH locks of the same class, even when using mutex_trylock,
lockdep will still complain that the maximum depth of the held lock stack
has been reached and disable itself.
For example, the following error currently occurs in the ARM version
of KVM, once the code tries to lock all vCPUs of a VM configured with more
than MAX_LOCK_DEPTH vCPUs, a situation that can easily happen on modern
systems, where having more than 48 CPUs is common, and it's also common to
run VMs that have vCPU counts approaching that number:
[ 328.171264] BUG: MAX_LOCK_DEPTH too low!
[ 328.175227] turning off the locking correctness validator.
[ 328.180726] Please attach the output of /proc/lock_stat to the bug report
[ 328.187531] depth: 48 max: 48!
[ 328.190678] 48 locks held by qemu-kvm/11664:
[ 328.194957] #0:
ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
[ 328.204048] #1:
ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.212521] #2:
ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.220991] #3:
ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.229463] #4:
ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.237934] #5:
ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.246405] #6:
ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
Luckily, in all instances that require locking all vCPUs, the
'kvm->lock' is taken a priori, and that fact makes it possible to use
the little known feature of lockdep, called a 'nest_lock', to avoid this
warning and subsequent lockdep self-disablement.
The action of 'nested lock' being provided to lockdep's lock_acquire(),
causes the lockdep to detect that the top of the held lock stack contains
a lock of the same class and then increment its reference counter instead
of pushing a new held_lock item onto that stack.
See __lock_acquire for more information.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <
20250512180407.659015-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 27 May 2025 16:15:49 +0000 (12:15 -0400)]
Merge tag 'kvm-x86-svm-6.16' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.16:
- Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
fix a race between AP destroy and VMRUN.
- Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
- Add support for ALLOWED_SEV_FEATURES.
- Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
- Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
that utilize those bits.
- Don't account temporary allocations in sev_send_update_data().
- Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
Paolo Bonzini [Tue, 27 May 2025 16:15:38 +0000 (12:15 -0400)]
Merge tag 'kvm-x86-vmx-6.16' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.16:
- Explicitly check MSR load/store list counts to fix a potential overflow on
32-bit kernels.
- Flush shadow VMCSes on emergency reboot.
- Revert mem_enc_ioctl() back to an optional hook, as it's nullified when
SEV or TDX is disabled via Kconfig.
- Macrofy the handling of vt_x86_ops to eliminate a pile of boilerplate code
needed for TDX, and to optimize CONFIG_KVM_INTEL_TDX=n builds.
Paolo Bonzini [Tue, 27 May 2025 16:15:26 +0000 (12:15 -0400)]
Merge tag 'kvm-x86-selftests-6.16' of https://github.com/kvm-x86/linux into HEAD
KVM selftests changes for 6.16:
- Add support for SNP to the various SEV selftests.
- Add a selftest to verify fastops instructions via forced emulation.
- Add MGLRU support to the access tracking perf test.
Paolo Bonzini [Tue, 27 May 2025 16:15:01 +0000 (12:15 -0400)]
Merge tag 'kvm-x86-pir-6.16' of https://github.com/kvm-x86/linux into HEAD
KVM x86 posted interrupt changes for 6.16:
Refine and optimize KVM's software processing of the PIR, and ultimately share
PIR harvesting code between KVM and the kernel's Posted MSI handler
Paolo Bonzini [Tue, 27 May 2025 16:14:47 +0000 (12:14 -0400)]
Merge tag 'kvm-x86-mmu-6.16' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.16:
- Refine and harden handling of spurious faults.
- Use kvm_x86_call() instead of open coding static_call().
Paolo Bonzini [Tue, 27 May 2025 16:14:36 +0000 (12:14 -0400)]
Merge tag 'kvm-x86-misc-6.16' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.16:
- Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
SVM and VMX.
- Advertise support to userspace for WRMSRNS and PREFETCHI.
- Rescan I/O APIC routes after handling EOI that needed to be intercepted due
to the old/previous routing, but not the new/current routing.
- Add a module param to control and enumerate support for device posted
interrupts.
- Misc cleanups.
Edward Adam Davis [Tue, 27 May 2025 08:44:37 +0000 (16:44 +0800)]
KVM: VMX: use __always_inline for is_td_vcpu and is_td
is_td() and is_td_vcpu() are used in no-instrumentation sections; use
__always_inline instead of inline.
vmlinux.o: error: objtool: vmx_handle_nmi+0x47:
call to is_td_vcpu.isra.0() leaves .noinstr.text section
Fixes:
7172c753c26a ("KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct")
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Message-ID: <tencent_1A767567C83C1137829622362E4A72756F09@qq.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 26 May 2025 20:31:13 +0000 (16:31 -0400)]
x86/tdx: mark tdh_vp_enter() as __flatten
In some cases tdx_tdvpr_pa() is not fully inlined into tdh_vp_enter(), which
causes the following warning:
vmlinux.o: warning: objtool: tdh_vp_enter+0x8: call to tdx_tdvpr_pa() leaves .noinstr.text section
This happens if the compiler considers tdx_tdvpr_pa() to be "large", for example
because CONFIG_SPARSEMEM adds two function calls to page_to_section() and
__section_mem_map_addr():
({ const struct page *__pg = (pg); \
int __sec = page_to_section(__pg); \
(unsigned long)(__pg - __section_mem_map_addr(__nr_to_section(__sec)));
\
})
Because exiting the noinstr section is a no-no, just mark tdh_vp_enter() for
full inlining.
Reported-by: kernel test robot <lkp@intel.com>
Analyzed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/
202505240530.5KktQ5mX-lkp@intel.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 26 May 2025 20:27:00 +0000 (16:27 -0400)]
Merge tag 'kvm-riscv-6.16-1' of https://github.com/kvm-riscv/linux into HEAD
KVM/riscv changes for 6.16
- Add vector registers to get-reg-list selftest
- VCPU reset related improvements
- Remove scounteren initialization from VCPU reset
- Support VCPU reset from userspace using set_mpstate() ioctl
Paolo Bonzini [Mon, 26 May 2025 20:19:46 +0000 (16:19 -0400)]
Merge tag 'kvmarm-6.16' of git://git./linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.16
* New features:
- Add large stage-2 mapping support for non-protected pKVM guests,
clawing back some performance.
- Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and
protected modes.
- Enable nested virtualisation support on systems that support it
(yes, it has been a long time coming), though it is disabled by
default.
* Improvements, fixes and cleanups:
- Large rework of the way KVM tracks architecture features and links
them with the effects of control bits. This ensures correctness of
emulation (the data is automatically extracted from the published
JSON files), and helps dealing with the evolution of the
architecture.
- Significant changes to the way pKVM tracks ownership of pages,
avoiding page table walks by storing the state in the hypervisor's
vmemmap. This in turn enables the THP support described above.
- New selftest checking the pKVM ownership transition rules
- Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
even if the host didn't have it.
- Fixes for the address translation emulation, which happened to be
rather buggy in some specific contexts.
- Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
from the number of counters exposed to a guest and addressing a
number of issues in the process.
- Add a new selftest for the SVE host state being corrupted by a
guest.
- Keep HCR_EL2.xMO set at all times for systems running with the
kernel at EL2, ensuring that the window for interrupts is slightly
bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
- Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
from a pretty bad case of TLB corruption unless accesses to HCR_EL2
are heavily synchronised.
- Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
tables in a human-friendly fashion.
- and the usual random cleanups.
Paolo Bonzini [Mon, 26 May 2025 20:12:13 +0000 (16:12 -0400)]
Merge tag 'loongarch-kvm-6.16' of git://git./linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.16
1. Don't flush tlb if HW PTW supported.
2. Add LoongArch KVM selftests support.
Paolo Bonzini [Mon, 26 May 2025 20:12:01 +0000 (16:12 -0400)]
Documentation: virt/kvm: remove unreferenced footnote
Replace it with just the URL.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Radim Krčmář [Fri, 23 May 2025 10:47:28 +0000 (12:47 +0200)]
RISC-V: KVM: lock the correct mp_state during reset
Currently, the kvm_riscv_vcpu_sbi_system_reset() function locks
vcpu->arch.mp_state_lock when updating tmp->arch.mp_state.mp_state
which is incorrect hence fix it.
Fixes:
2121cadec45a ("RISCV: KVM: Introduce mp_state_lock to avoid lock inversion")
Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20250523104725.2894546-4-rkrcmar@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Marc Zyngier [Fri, 23 May 2025 09:59:43 +0000 (10:59 +0100)]
Merge branch kvm-arm64/misc-6.16 into kvmarm-master/next
* kvm-arm64/misc-6.16:
: .
: Misc changes and improvements for 6.16:
:
: - Add a new selftest for the SVE host state being corrupted by a guest
:
: - Keep HCR_EL2.xMO set at all times for systems running with the kernel at EL2,
: ensuring that the window for interrupts is slightly bigger, and avoiding
: a pretty bad erratum on the AmpereOne HW
:
: - Replace a couple of open-coded on/off strings with str_on_off()
:
: - Get rid of the pKVM memblock sorting, which now appears to be superflous
:
: - Drop superflous clearing of ICH_LR_EOI in the LR when nesting
:
: - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers from
: a pretty bad case of TLB corruption unless accesses to HCR_EL2 are
: heavily synchronised
:
: - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS tables
: in a human-friendly fashion
: .
KVM: arm64: Fix documentation for vgic_its_iter_next()
KVM: arm64: vgic-its: Add debugfs interface to expose ITS tables
arm64: errata: Work around AmpereOne's erratum AC04_CPU_23
KVM: arm64: nv: Remove clearing of ICH_LR<n>.EOI if ICH_LR<n>.HW == 1
KVM: arm64: Drop sort_memblock_regions()
KVM: arm64: selftests: Add test for SVE host corruption
KVM: arm64: Force HCR_EL2.xMO to 1 at all times in VHE mode
KVM: arm64: Replace ternary flags with str_on_off() helper
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Fri, 23 May 2025 09:58:57 +0000 (10:58 +0100)]
Merge branch kvm-arm64/nv-nv into kvmarm-master/next
* kvm-arm64/nv-nv:
: .
: Flick the switch on the NV support by adding the missing piece
: in the form of the VNCR page management. From the cover letter:
:
: "This is probably the most interesting bit of the whole NV adventure.
: So far, everything else has been a walk in the park, but this one is
: where the real fun takes place.
:
: With FEAT_NV2, most of the NV support revolves around tricking a guest
: into accessing memory while it tries to access system registers. The
: hypervisor's job is to handle the context switch of the actual
: registers with the state in memory as needed."
: .
KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section
KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held
KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating
KVM: arm64: Document NV caps and vcpu flags
KVM: arm64: Allow userspace to request KVM_ARM_VCPU_EL2*
KVM: arm64: nv: Remove dead code from ERET handling
KVM: arm64: nv: Plumb TLBI S1E2 into system instruction dispatch
KVM: arm64: nv: Add S1 TLB invalidation primitive for VNCR_EL2
KVM: arm64: nv: Program host's VNCR_EL2 to the fixmap address
KVM: arm64: nv: Handle VNCR_EL2 invalidation from MMU notifiers
KVM: arm64: nv: Handle mapping of VNCR_EL2 at EL2
KVM: arm64: nv: Handle VNCR_EL2-triggered faults
KVM: arm64: nv: Add userspace and guest handling of VNCR_EL2
KVM: arm64: nv: Add pseudo-TLB backing VNCR_EL2
KVM: arm64: nv: Don't adjust PSTATE.M when L2 is nesting
KVM: arm64: nv: Move TLBI range decoding to a helper
KVM: arm64: nv: Snapshot S1 ASID tagging information during walk
KVM: arm64: nv: Extract translation helper from the AT code
KVM: arm64: nv: Allocate VNCR page when required
arm64: sysreg: Add layout for VNCR_EL2
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Fri, 23 May 2025 09:58:34 +0000 (10:58 +0100)]
Merge branch kvm-arm64/at-fixes-6.16 into kvmarm-master/next
* kvm-arm64/at-fixes-6.16:
: .
: Set of fixes for Address Translation (AT) instruction emulation,
: which affect the (not yet upstream) NV support.
:
: From the cover letter:
:
: "Here's a small series of fixes for KVM's implementation of address
: translation (aka the AT S1* instructions), addressing a number of
: issues in increasing levels of severity:
:
: - We misreport PAR_EL1.PTW in a number of occasions, including state
: that is not possible as per the architecture definition
:
: - We don't handle access faults at all, and that doesn't play very
: well with the rest of the VNCR stuff
:
: - AT S1E{0,1} from EL2 with HCR_EL2.{E2H,TGE}={1,1} will absolutely
: take the host down, no questions asked"
: .
KVM: arm64: Don't feed uninitialised data to HCR_EL2
KVM: arm64: Teach address translation about access faults
KVM: arm64: Fix PAR_EL1.{PTW,S} reporting on AT S1E*
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Fri, 23 May 2025 09:58:15 +0000 (10:58 +0100)]
Merge branch kvm-arm64/fgt-masks into kvmarm-master/next
* kvm-arm64/fgt-masks: (43 commits)
: .
: Large rework of the way KVM deals with trap bits in conjunction with
: the CPU feature registers. It now draws a direct link between which
: the feature set, the system registers that need to UNDEF to match
: the configuration and bits that need to behave as RES0 or RES1 in
: the trap registers that are visible to the guest.
:
: Best of all, these definitions are mostly automatically generated
: from the JSON description published by ARM under a permissive
: license.
: .
KVM: arm64: Handle TSB CSYNC traps
KVM: arm64: Add FGT descriptors for FEAT_FGT2
KVM: arm64: Allow sysreg ranges for FGT descriptors
KVM: arm64: Add context-switch for FEAT_FGT2 registers
KVM: arm64: Add trap routing for FEAT_FGT2 registers
KVM: arm64: Add sanitisation for FEAT_FGT2 registers
KVM: arm64: Add FEAT_FGT2 registers to the VNCR page
KVM: arm64: Use HCR_EL2 feature map to drive fixed-value bits
KVM: arm64: Use HCRX_EL2 feature map to drive fixed-value bits
KVM: arm64: Allow kvm_has_feat() to take variable arguments
KVM: arm64: Use FGT feature maps to drive RES0 bits
KVM: arm64: Validate FGT register descriptions against RES0 masks
KVM: arm64: Switch to table-driven FGU configuration
KVM: arm64: Handle PSB CSYNC traps
KVM: arm64: Use KVM-specific HCRX_EL2 RES0 mask
KVM: arm64: Remove hand-crafted masks for FGT registers
KVM: arm64: Use computed FGT masks to setup FGT registers
KVM: arm64: Propagate FGT masks to the nVHE hypervisor
KVM: arm64: Unconditionally configure fine-grain traps
KVM: arm64: Use computed masks as sanitisers for FGT registers
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Fri, 23 May 2025 09:57:44 +0000 (10:57 +0100)]
Merge branch kvm-arm64/mte-frac into kvmarm-master/next
* kvm-arm64/mte-frac:
: .
: Prevent FEAT_MTE_ASYNC from being accidently exposed to a guest,
: courtesy of Ben Horgan. From the cover letter:
:
: "The ID_AA64PFR1_EL1.MTE_frac field is currently hidden from KVM.
: However, when ID_AA64PFR1_EL1.MTE==2, ID_AA64PFR1_EL1.MTE_frac==0
: indicates that MTE_ASYNC is supported. On a host with
: ID_AA64PFR1_EL1.MTE==2 but without MTE_ASYNC support a guest with the
: MTE capability enabled will incorrectly see MTE_ASYNC advertised as
: supported. This series fixes that."
: .
KVM: selftests: Confirm exposing MTE_frac does not break migration
KVM: arm64: Make MTE_frac masking conditional on MTE capability
arm64/sysreg: Expose MTE_frac so that it is visible to KVM
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Fri, 23 May 2025 09:57:32 +0000 (10:57 +0100)]
Merge branch kvm-arm64/ubsan-el2 into kvmarm-master/next
* kvm-arm64/ubsan-el2:
: .
: Add UBSAN support to the EL2 portion of KVM, reusing most of the
: existing logic provided by CONFIG_IBSAN_TRAP.
:
: Patches courtesy of Mostafa Saleh.
: .
KVM: arm64: Handle UBSAN faults
KVM: arm64: Introduce CONFIG_UBSAN_KVM_EL2
ubsan: Remove regs from report_ubsan_failure()
arm64: Introduce esr_is_ubsan_brk()
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Fri, 23 May 2025 09:56:25 +0000 (10:56 +0100)]
Merge branch kvm-arm64/pkvm-np-thp-6.16 into kvmarm-master/next
* kvm-arm64/pkvm-np-thp-6.16: (21 commits)
: .
: Large mapping support for non-protected pKVM guests, courtesy of
: Vincent Donnefort. From the cover letter:
:
: "This series adds support for stage-2 huge mappings (PMD_SIZE) to pKVM
: np-guests, that is installing PMD-level mappings in the stage-2,
: whenever the stage-1 is backed by either Hugetlbfs or THPs."
: .
KVM: arm64: np-guest CMOs with PMD_SIZE fixmap
KVM: arm64: Stage-2 huge mappings for np-guests
KVM: arm64: Add a range to pkvm_mappings
KVM: arm64: Convert pkvm_mappings to interval tree
KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest()
KVM: arm64: Add a range to __pkvm_host_wrprotect_guest()
KVM: arm64: Add a range to __pkvm_host_unshare_guest()
KVM: arm64: Add a range to __pkvm_host_share_guest()
KVM: arm64: Introduce for_each_hyp_page
KVM: arm64: Handle huge mappings for np-guest CMOs
KVM: arm64: Extend pKVM selftest for np-guests
KVM: arm64: Selftest for pKVM transitions
KVM: arm64: Don't WARN from __pkvm_host_share_guest()
KVM: arm64: Add .hyp.data section
KVM: arm64: Unconditionally cross check hyp state
KVM: arm64: Defer EL2 stage-1 mapping on share
KVM: arm64: Move hyp state to hyp_vmemmap
KVM: arm64: Introduce {get,set}_host_state() helpers
KVM: arm64: Use 0b11 for encoding PKVM_NOPAGE
KVM: arm64: Fix pKVM page-tracking comments
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Thu, 22 May 2025 08:15:02 +0000 (09:15 +0100)]
KVM: arm64: Fix documentation for vgic_its_iter_next()
As reported by the build robot, the documentation for vgic_its_iter_next()
contains a typo. Fix it.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/
202505221421.KAuWlmSr-lkp@intel.com/
Signed-off-by: Marc Zyngier <maz@kernel.org>
Vincent Donnefort [Wed, 21 May 2025 12:48:34 +0000 (13:48 +0100)]
KVM: arm64: np-guest CMOs with PMD_SIZE fixmap
With the introduction of stage-2 huge mappings in the pKVM hypervisor,
guest pages CMO is needed for PMD_SIZE size. Fixmap only supports
PAGE_SIZE and iterating over the huge-page is time consuming (mostly due
to TLBI on hyp_fixmap_unmap) which is a problem for EL2 latency.
Introduce a shared PMD_SIZE fixmap (hyp_fixblock_map/hyp_fixblock_unmap)
to improve guest page CMOs when stage-2 huge mappings are installed.
On a Pixel6, the iterative solution resulted in a latency of ~700us,
while the PMD_SIZE fixmap reduces it to ~100us.
Because of the horrendous private range allocation that would be
necessary, this is disabled for 64KiB pages systems.
Suggested-by: Quentin Perret <qperret@google.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-11-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Vincent Donnefort [Wed, 21 May 2025 12:48:33 +0000 (13:48 +0100)]
KVM: arm64: Stage-2 huge mappings for np-guests
Now np-guests hypercalls with range are supported, we can let the
hypervisor to install block mappings whenever the Stage-1 allows it,
that is when backed by either Hugetlbfs or THPs. The size of those block
mappings is limited to PMD_SIZE.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-10-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Quentin Perret [Wed, 21 May 2025 12:48:32 +0000 (13:48 +0100)]
KVM: arm64: Add a range to pkvm_mappings
In preparation for supporting stage-2 huge mappings for np-guest, add a
nr_pages member for pkvm_mappings to allow EL1 to track the size of the
stage-2 mapping.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-9-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Quentin Perret [Wed, 21 May 2025 12:48:31 +0000 (13:48 +0100)]
KVM: arm64: Convert pkvm_mappings to interval tree
In preparation for supporting stage-2 huge mappings for np-guest, let's
convert pgt.pkvm_mappings to an interval tree.
No functional change intended.
Suggested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-8-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Vincent Donnefort [Wed, 21 May 2025 12:48:30 +0000 (13:48 +0100)]
KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest()
In preparation for supporting stage-2 huge mappings for np-guest. Add a
nr_pages argument to the __pkvm_host_test_clear_young_guest hypercall.
This range supports only two values: 1 or PMD_SIZE / PAGE_SIZE (that is
512 on a 4K-pages system).
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-7-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Vincent Donnefort [Wed, 21 May 2025 12:48:29 +0000 (13:48 +0100)]
KVM: arm64: Add a range to __pkvm_host_wrprotect_guest()
In preparation for supporting stage-2 huge mappings for np-guest. Add a
nr_pages argument to the __pkvm_host_wrprotect_guest hypercall. This
range supports only two values: 1 or PMD_SIZE / PAGE_SIZE (that is 512
on a 4K-pages system).
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-6-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Vincent Donnefort [Wed, 21 May 2025 12:48:28 +0000 (13:48 +0100)]
KVM: arm64: Add a range to __pkvm_host_unshare_guest()
In preparation for supporting stage-2 huge mappings for np-guest. Add a
nr_pages argument to the __pkvm_host_unshare_guest hypercall. This range
supports only two values: 1 or PMD_SIZE / PAGE_SIZE (that is 512 on a
4K-pages system).
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-5-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Vincent Donnefort [Wed, 21 May 2025 12:48:27 +0000 (13:48 +0100)]
KVM: arm64: Add a range to __pkvm_host_share_guest()
In preparation for supporting stage-2 huge mappings for np-guest. Add a
nr_pages argument to the __pkvm_host_share_guest hypercall. This range
supports only two values: 1 or PMD_SIZE / PAGE_SIZE (that is 512 on a
4K-pages system).
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-4-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Vincent Donnefort [Wed, 21 May 2025 12:48:26 +0000 (13:48 +0100)]
KVM: arm64: Introduce for_each_hyp_page
Add a helper to iterate over the hypervisor vmemmap. This will be
particularly handy with the introduction of huge mapping support
for the np-guest stage-2.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-3-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Vincent Donnefort [Wed, 21 May 2025 12:48:25 +0000 (13:48 +0100)]
KVM: arm64: Handle huge mappings for np-guest CMOs
clean_dcache_guest_page() and invalidate_icache_guest_page() accept a
size as an argument. But they also rely on fixmap, which can only map a
single PAGE_SIZE page.
With the upcoming stage-2 huge mappings for pKVM np-guests, those
callbacks will get size > PAGE_SIZE. Loop the CMOs on a PAGE_SIZE basis
until the whole range is done.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-2-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 21 May 2025 13:33:43 +0000 (14:33 +0100)]
Merge branch kvm-arm64/pkvm-selftest-6.16 into kvm-arm64/pkvm-np-thp-6.16
* kvm-arm64/pkvm-selftest-6.16:
: .
: pKVM selftests covering the memory ownership transitions by
: Quentin Perret. From the initial cover letter:
:
: "We have recently found a bug [1] in the pKVM memory ownership
: transitions by code inspection, but it could have been caught with a
: test.
:
: Introduce a boot-time selftest exercising all the known pKVM memory
: transitions and importantly checks the rejection of illegal transitions.
:
: The new test is hidden behind a new Kconfig option separate from
: CONFIG_EL2_NVHE_DEBUG on purpose as that has side effects on the
: transition checks ([1] doesn't reproduce with EL2 debug enabled).
:
: [1] https://lore.kernel.org/kvmarm/
20241128154406.602875-1-qperret@google.com/"
: .
KVM: arm64: Extend pKVM selftest for np-guests
KVM: arm64: Selftest for pKVM transitions
KVM: arm64: Don't WARN from __pkvm_host_share_guest()
KVM: arm64: Add .hyp.data section
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 21 May 2025 13:33:39 +0000 (14:33 +0100)]
Merge branch kvm-arm64/pkvm-6.16 into kvm-arm64/pkvm-np-thp-6.16
* kvm-arm64/pkvm-6.16:
: .
: pKVM memory management cleanups, courtesy of Quentin Perret.
: From the cover letter:
:
: "This series moves the hypervisor's ownership state to the hyp_vmemmap,
: as discussed in [1]. The two main benefits are:
:
: 1. much cheaper hyp state lookups, since we can avoid the hyp stage-1
: page-table walk;
:
: 2. de-correlates the hyp state from the presence of a mapping in the
: linear map range of the hypervisor; which enables a bunch of
: clean-ups in the existing code and will simplify the introduction of
: other features in the future (hyp tracing, ...)"
: .
KVM: arm64: Unconditionally cross check hyp state
KVM: arm64: Defer EL2 stage-1 mapping on share
KVM: arm64: Move hyp state to hyp_vmemmap
KVM: arm64: Introduce {get,set}_host_state() helpers
KVM: arm64: Use 0b11 for encoding PKVM_NOPAGE
KVM: arm64: Fix pKVM page-tracking comments
KVM: arm64: Track SVE state in the hypervisor vcpu structure
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 21 May 2025 10:04:11 +0000 (11:04 +0100)]
KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section
The conversion to kvm_release_faultin_page() missed the requirement
for this to be called within a critical section with mmu_lock held
for write. Move this call up to satisfy this requirement.
Fixes:
069a05e535496 ("KVM: arm64: nv: Handle VNCR_EL2-triggered faults")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 21 May 2025 09:58:29 +0000 (10:58 +0100)]
KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held
Calling invalidate_vncr_va() without the mmu_lock held for write
is a bad idea, and lockdep tells you about that.
Fixes:
4ffa72ad8f37e ("KVM: arm64: nv: Add S1 TLB invalidation primitive for VNCR_EL2")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Tue, 20 May 2025 14:41:16 +0000 (15:41 +0100)]
KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating
When translating a VNCR translation fault, we start by marking the
current SW-managed TLB as invalid, so that we can populate it
in place. This is, however, done without the mmu_lock held.
A consequence of this is that another CPU dealing with TLBI
emulation can observe a translation still flagged as valid, but
with invalid walk results (such as pgshift being 0). Bad things
can result from this, such as a BUG() in pgshift_level_to_ttl().
Fix it by taking the mmu_lock for write to perform this local
invalidation, and use invalidate_vncr() instead of open-coding
the write to the 'valid' flag.
Fixes:
069a05e535496 ("KVM: arm64: nv: Handle VNCR_EL2-triggered faults")
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250520144116.3667978-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Radim Krčmář [Thu, 15 May 2025 14:37:25 +0000 (16:37 +0200)]
RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET
Add a toggleable VM capability to reset the VCPU from userspace by
setting MP_STATE_INIT_RECEIVED through IOCTL.
Reset through a mp_state to avoid adding a new IOCTL.
Do not reset on a transition from STOPPED to RUNNABLE, because it's
better to avoid side effects that would complicate userspace adoption.
The MP_STATE_INIT_RECEIVED is not a permanent mp_state -- IOCTL resets
the VCPU while preserving the original mp_state -- because we wouldn't
gain much from having a new state it in the rest of KVM, but it's a very
non-standard use of the IOCTL.
Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20250515143723.2450630-5-rkrcmar@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Atish Patra [Thu, 15 May 2025 23:11:18 +0000 (16:11 -0700)]
RISC-V: KVM: Remove scounteren initialization
Scounteren CSR controls the direct access the hpmcounters and cycle/
instret/time from the userspace. It's the supervisor's responsibility
to set it up correctly for it's user space. They hypervisor doesn't
need to decide the policy on behalf of the supervisor.
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20250515-fix_scounteren_vs-v3-1-729dc088943e@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Radim Krčmář [Thu, 3 Apr 2025 11:25:22 +0000 (13:25 +0200)]
KVM: RISC-V: remove unnecessary SBI reset state
The SBI reset state has only two variables -- pc and a1.
The rest is known, so keep only the necessary information.
The reset structures make sense if we want userspace to control the
reset state (which we do), but I'd still remove them now and reintroduce
with the userspace interface later -- we could probably have just a
single reset state per VM, instead of a reset state for each VCPU.
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
Link: https://lore.kernel.org/r/20250403112522.1566629-6-rkrcmar@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Radim Krčmář [Thu, 3 Apr 2025 11:25:21 +0000 (13:25 +0200)]
KVM: RISC-V: refactor sbi reset request
The same code is used twice and SBI reset sets only two variables.
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
Link: https://lore.kernel.org/r/20250403112522.1566629-5-rkrcmar@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Radim Krčmář [Thu, 3 Apr 2025 11:25:20 +0000 (13:25 +0200)]
KVM: RISC-V: refactor vector state reset
Do not depend on the reset structures.
vector.datap is a kernel memory pointer that needs to be preserved as it
is not a part of the guest vector data.
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
Link: https://lore.kernel.org/r/20250403112522.1566629-4-rkrcmar@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Atish Patra [Mon, 5 May 2025 19:46:53 +0000 (12:46 -0700)]
RISC-V: KVM: Remove experimental tag for RISC-V
RISC-V KVM port is no longer experimental. Let's remove it to avoid
confusion.
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20250505-kvm_tag_change-v1-1-6dbf6af240af@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Atish Patra [Wed, 30 Apr 2025 08:16:30 +0000 (01:16 -0700)]
KVM: riscv: selftests: Add vector extension tests
Add vector related tests with the ISA extension standard template.
However, the vector registers are bit tricky as the register length is
variable based on vlenb value of the system. That's why the macros are
defined with a default and overidden with actual value at runtime.
Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20250430-kvm_selftest_improve-v3-3-eea270ff080b@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Atish Patra [Wed, 30 Apr 2025 08:16:29 +0000 (01:16 -0700)]
KVM: riscv: selftests: Decode stval to identify exact exception type
Currently, the sbi_pmu_test continues if the exception type is illegal
instruction because access to hpmcounter will generate that. However
illegal instruction exception may occur due to the other reasons
which should result in test assertion.
Use the stval to decode the exact type of instructions and which csrs are
being accessed if it is csr access instructions. Assert in all cases
except if it is a csr access instructions that access valid PMU related
registers.
Take this opportunity to remove the CSR_CYCLEH reference as the test is
compiled for RV64 only.
Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20250430-kvm_selftest_improve-v3-2-eea270ff080b@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Atish Patra [Wed, 30 Apr 2025 08:16:28 +0000 (01:16 -0700)]
KVM: riscv: selftests: Align the trap information wiht pt_regs
The current exeception register structure in selftests are missing
few registers (e.g stval). Instead of adding it manually, change
the ex_regs to align with pt_regs to make it future proof.
Suggested-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20250430-kvm_selftest_improve-v3-1-eea270ff080b@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Bibo Mao [Tue, 20 May 2025 12:20:26 +0000 (20:20 +0800)]
KVM: selftests: Add supported test cases for LoongArch
Some common KVM test cases are supported on LoongArch now as following:
coalesced_io_test
demand_paging_test
dirty_log_perf_test
dirty_log_test
guest_print_test
hardware_disable_test
kvm_binary_stats_test
kvm_create_max_vcpus
kvm_page_table_test
memslot_modification_stress_test
memslot_perf_test
set_memory_region_test
And other test cases are not supported by LoongArch such as rseq_test,
since it is not supported on LoongArch physical machine either.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Bibo Mao [Tue, 20 May 2025 12:20:26 +0000 (20:20 +0800)]
KVM: selftests: Add ucall test support for LoongArch
Add ucall test support for LoongArch, ucall method on LoongArch uses
undefined mmio area. It will cause vCPU exiting to hypervisor so that
hypervisor can communicate with vCPU.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Bibo Mao [Tue, 20 May 2025 12:20:26 +0000 (20:20 +0800)]
KVM: selftests: Add core KVM selftests support for LoongArch
Add core KVM selftests support for LoongArch, it includes exception
handler, mmu page table setup and vCPU startup entry support.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Bibo Mao [Tue, 20 May 2025 12:20:23 +0000 (20:20 +0800)]
KVM: selftests: Add KVM selftests header files for LoongArch
Add KVM selftests header files for LoongArch, including processor.h
and kvm_util_arch.h. It mainly contains LoongArch CSR register and page
table entry definition.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Bibo Mao [Tue, 20 May 2025 12:20:23 +0000 (20:20 +0800)]
KVM: selftests: Add VM_MODE_P47V47_16K VM mode
On LoongArch system, 16K page is used in general and GVA width is 47 bit
while GPA width is 47 bit also, here add new VM mode VM_MODE_P47V47_16K.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Bibo Mao [Tue, 20 May 2025 12:20:18 +0000 (20:20 +0800)]
LoongArch: KVM: Do not flush tlb if HW PTW supported
With HW PTW supported, invalid TLB is not added when page fault happens.
But for EXCCODE_TLBM exception, stale TLB may exist because of the last
read access. Thus TLB flush operation is necessary for the EXCCODE_TLBM
exception, but not necessary for other tyeps of page fault exceptions.
With SW PTW supported, invalid TLB is added in the TLB refill exception.
TLB flush operation is necessary for all types of page fault exceptions.
Here remove unnecessary TLB flush opereation with HW PTW supported.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Bibo Mao [Tue, 20 May 2025 12:20:18 +0000 (20:20 +0800)]
LoongArch: KVM: Add ecode parameter for exception handlers
For some KVM exception types, they share the same exception handler. To
show the difference, ecode (exception code) is added as a new parameter
in exception handlers.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Nikunj A Dadhania [Fri, 2 May 2025 05:03:46 +0000 (05:03 +0000)]
KVM: selftests: Add test to verify KVM_CAP_X86_BUS_LOCK_EXIT
Add a test case to verify x86's bus lock exit functionality, which is now
supported on both Intel and AMD. Trigger bus lock exits by performing a
split-lock access, i.e. an atomic access that splits two cache lines.
Verify that the correct number of bus lock exits are generated, and that
the counter is incremented correctly and at the appropriate time based on
the underlying architecture.
Generate bus locks in both L1 and L2 (if nested virtualization is enabled),
as SVM's functionality in particular requires non-trivial logic to do the
right thing when running nested VMs.
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Co-developed-by: Manali Shukla <manali.shukla@amd.com>
Signed-off-by: Manali Shukla <manali.shukla@amd.com>
Link: https://lore.kernel.org/r/20250502050346.14274-6-manali.shukla@amd.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Manali Shukla [Fri, 2 May 2025 05:03:45 +0000 (05:03 +0000)]
KVM: SVM: Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs
Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock
Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to
allow reusing KVM_CAP_X86_BUS_LOCK_EXIT.
The biggest difference between the two features is that Threshold is
fault-like, whereas Detection is trap-like. To allow the guest to make
forward progress, Threshold provides a per-VMCB counter which is
decremented every time a bus lock occurs, and a VM-Exit is triggered if
and only if the counter is '0'.
To provide Detection-like semantics, initialize the counter to '0', i.e.
exit on every bus lock, and when re-executing the guilty instruction, set
the counter to '1' to effectively step past the instruction.
Note, in the unlikely scenario that re-executing the instruction doesn't
trigger a bus lock, e.g. because the guest has changed memory types or
patched the guilty instruction, the bus lock counter will be left at '1',
i.e. the guest will be able to do a bus lock on a different instruction.
In a perfect world, KVM would ensure the counter is '0' if the guest has
made forward progress, e.g. if RIP has changed. But trying to close that
hole would incur non-trivial complexity, for marginal benefit; the intent
of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks,
not to allow for precise detection of problematic guest code. And, it's
simply not feasible to fully close the hole, e.g. if an interrupt arrives
before the original instruction can re-execute, the guest could step past
a different bus lock.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Manali Shukla <manali.shukla@amd.com>
Link: https://lore.kernel.org/r/20250502050346.14274-5-manali.shukla@amd.com
[sean: fix typo in comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Jing Zhang [Thu, 20 Feb 2025 22:42:46 +0000 (14:42 -0800)]
KVM: arm64: vgic-its: Add debugfs interface to expose ITS tables
This commit introduces a debugfs interface to display the contents of the
VGIC Interrupt Translation Service (ITS) tables.
The ITS tables map Device/Event IDs to Interrupt IDs and target processors.
Exposing this information through debugfs allows for easier inspection and
debugging of the interrupt routing configuration.
The debugfs interface presents the ITS table data in a tabular format:
Device ID: 0x0, Event ID Range: [0 - 31]
EVENT_ID INTID HWINTID TARGET COL_ID HW
-----------------------------------------------
0 8192 0 0 0 0
1 8193 0 0 0 0
2 8194 0 2 2 0
Device ID: 0x18, Event ID Range: [0 - 3]
EVENT_ID INTID HWINTID TARGET COL_ID HW
-----------------------------------------------
0 8225 0 0 0 0
1 8226 0 1 1 0
2 8227 0 3 3 0
Device ID: 0x10, Event ID Range: [0 - 7]
EVENT_ID INTID HWINTID TARGET COL_ID HW
-----------------------------------------------
0 8229 0 3 3 1
1 8230 0 0 0 1
2 8231 0 1 1 1
3 8232 0 2 2 1
4 8233 0 3 3 1
The output is generated using the seq_file interface, allowing for efficient
handling of potentially large ITS tables.
This interface is read-only and does not allow modification of the ITS
tables. It is intended for debugging and informational purposes only.
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20250220224247.2017205-1-jingzhangos@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
D Scott Phillips [Tue, 13 May 2025 18:45:14 +0000 (11:45 -0700)]
arm64: errata: Work around AmpereOne's erratum AC04_CPU_23
On AmpereOne AC04, updates to HCR_EL2 can rarely corrupt simultaneous
translations for data addresses initiated by load/store instructions.
Only instruction initiated translations are vulnerable, not translations
from prefetches for example. A DSB before the store to HCR_EL2 is
sufficient to prevent older instructions from hitting the window for
corruption, and an ISB after is sufficient to prevent younger
instructions from hitting the window for corruption.
Signed-off-by: D Scott Phillips <scott@os.amperecomputing.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250513184514.2678288-1-scott@os.amperecomputing.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Mon, 27 Jan 2025 11:58:38 +0000 (11:58 +0000)]
KVM: arm64: Handle TSB CSYNC traps
The architecture introduces a trap for TSB CSYNC that fits in
the same EC as LS64 and PSB CSYNC. Let's deal with it in a similar
way.
It's not that we expect this to be useful any time soon anyway.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Fri, 25 Apr 2025 13:00:01 +0000 (14:00 +0100)]
KVM: arm64: Add FGT descriptors for FEAT_FGT2
Bulk addition of all the FGT2 traps reported with EC == 0x18,
as described in the 2025-03 JSON drop.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Fri, 25 Apr 2025 12:53:18 +0000 (13:53 +0100)]
KVM: arm64: Allow sysreg ranges for FGT descriptors
Just like we allow sysreg ranges for Coarse Grained Trap descriptors,
allow them for Fine Grain Traps as well.
This comes with a warning that not all ranges are suitable for this
particular definition of ranges.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Tue, 22 Apr 2025 20:20:18 +0000 (21:20 +0100)]
KVM: arm64: Add context-switch for FEAT_FGT2 registers
Just like the rest of the FGT registers, perform a switch of the
FGT2 equivalent. This avoids the host configuration leaking into
the guest...
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Fri, 25 Apr 2025 16:42:49 +0000 (17:42 +0100)]
KVM: arm64: Add trap routing for FEAT_FGT2 registers
Similarly to the FEAT_FGT registers, pick the correct FEAT_FGT2
register when a sysreg trap indicates they could be responsible
for the exception.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Tue, 22 Apr 2025 20:16:34 +0000 (21:16 +0100)]
KVM: arm64: Add sanitisation for FEAT_FGT2 registers
Just like the FEAT_FGT registers, treat the FGT2 variant the same
way. THis is a large update, but a fairly mechanical one.
The config dependencies are extracted from the 2025-03 JSON drop.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Tue, 22 Apr 2025 18:21:46 +0000 (19:21 +0100)]
KVM: arm64: Add FEAT_FGT2 registers to the VNCR page
The FEAT_FGT2 registers are part of the VNCR page. Describe the
corresponding offsets and add them to the vcpu sysreg enumeration.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Tue, 4 Feb 2025 10:46:41 +0000 (10:46 +0000)]
KVM: arm64: Use HCR_EL2 feature map to drive fixed-value bits
Similarly to other registers, describe which HCR_EL2 bit depends
on which feature, and use this to compute the RES0 status of these
bits.
An additional complexity stems from the status of some bits such
as E2H and RW, which do not had a RESx status, but still take
a fixed value due to implementation choices in KVM.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Sun, 9 Feb 2025 14:51:23 +0000 (14:51 +0000)]
KVM: arm64: Use HCRX_EL2 feature map to drive fixed-value bits
Similarly to other registers, describe which HCR_EL2 bit depends
on which feature, and use this to compute the RES0 status of these
bits.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Sun, 9 Feb 2025 13:38:35 +0000 (13:38 +0000)]
KVM: arm64: Allow kvm_has_feat() to take variable arguments
In order to be able to write more compact (and easier to read) code,
let kvm_has_feat() and co take variable arguments. This enables
constructs such as:
#define FEAT_SME ID_AA64PFR1_EL1, SME, IMP
if (kvm_has_feat(kvm, FEAT_SME))
[...]
which is admitedly more readable.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Sun, 9 Feb 2025 14:45:29 +0000 (14:45 +0000)]
KVM: arm64: Use FGT feature maps to drive RES0 bits
Another benefit of mapping bits to features is that it becomes trivial
to define which bits should be handled as RES0.
Let's apply this principle to the guest's view of the FGT registers.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:35:00 +0000 (11:35 +0100)]
KVM: arm64: Document NV caps and vcpu flags
Describe the two new vcpu flags that control NV, together with
the capabilities that advertise them.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250514103501.2225951-18-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:59 +0000 (11:34 +0100)]
KVM: arm64: Allow userspace to request KVM_ARM_VCPU_EL2*
Since we're (almost) feature complete, let's allow userspace to
request KVM_ARM_VCPU_EL2* by bumping KVM_VCPU_MAX_FEATURES up.
We also now advertise the features to userspace with new capabilities.
It's going to be great...
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20250514103501.2225951-17-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:58 +0000 (11:34 +0100)]
KVM: arm64: nv: Remove dead code from ERET handling
Cleanly, this code cannot trigger, since we filter this from the
caller. Drop it.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-16-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:57 +0000 (11:34 +0100)]
KVM: arm64: nv: Plumb TLBI S1E2 into system instruction dispatch
Now that we have to handle TLBI S1E2 in the core code, plumb the
sysinsn dispatch code into it, so that these instructions don't
just UNDEF anymore.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-15-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:56 +0000 (11:34 +0100)]
KVM: arm64: nv: Add S1 TLB invalidation primitive for VNCR_EL2
A TLBI by VA for S1 must take effect on our pseudo-TLB for VNCR
and potentially knock the fixmap mapping. Even worse, that TLBI
must be able to work cross-vcpu.
For that, we track on a per-VM basis if any VNCR is mapped, using
an atomic counter. Whenever a TLBI S1E2 occurs and that this counter
is non-zero, we take the long road all the way back to the core code.
There, we iterate over all vcpus and check whether this particular
invalidation has any damaging effect. If it does, we nuke the pseudo
TLB and the corresponding fixmap.
Yes, this is costly.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-14-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:55 +0000 (11:34 +0100)]
KVM: arm64: nv: Program host's VNCR_EL2 to the fixmap address
Since we now have a way to map the guest's VNCR_EL2 on the host,
we can point the host's VNCR_EL2 to it and go full circle!
Note that we unconditionally assign the fixmap to VNCR_EL2,
irrespective of the guest's version being mapped or not. We want
to take a fault on first access, so the fixmap either contains
something guranteed to be either invalid or a guest mapping.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-13-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:54 +0000 (11:34 +0100)]
KVM: arm64: nv: Handle VNCR_EL2 invalidation from MMU notifiers
During an invalidation triggered by an MMU notifier, we need to
make sure we can drop the *host* mapping that would have been
translated by the stage-2 mapping being invalidated.
For the moment, the invalidation is pretty brutal, as we nuke
the full IPA range, and therefore any VNCR_EL2 mapping.
At some point, we'll be more light-weight, and the code is able
to deal with something more targetted.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-12-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:53 +0000 (11:34 +0100)]
KVM: arm64: nv: Handle mapping of VNCR_EL2 at EL2
Now that we can handle faults triggered through VNCR_EL2, we need
to map the corresponding page at EL2. But where, you'll ask?
Since each CPU in the system can run a vcpu, we need a per-CPU
mapping. For that, we carve a NR_CPUS range in the fixmap, giving
us a per-CPU va at which to map the guest's VNCR's page.
The mapping occurs both on vcpu load and on the back of a fault,
both generating a request that will take care of the mapping.
That mapping will also get dropped on vcpu put.
Yes, this is a bit heavy handed, but it is simple. Eventually,
we may want to have a per-VM, per-CPU mapping, which would avoid
all the TLBI overhead.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-11-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:52 +0000 (11:34 +0100)]
KVM: arm64: nv: Handle VNCR_EL2-triggered faults
As VNCR_EL2.BADDR contains a VA, it is bound to trigger faults.
These faults can have multiple source:
- We haven't mapped anything on the host: we need to compute the
resulting translation, populate a TLB, and eventually map
the corresponding page
- The permissions are out of whack: we need to tell the guest about
this state of affairs
Note that the kernel doesn't support S1POE for itself yet, so
the particular case of a VNCR page mapped with no permissions
or with write-only permissions is not correctly handled yet.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-10-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:51 +0000 (11:34 +0100)]
KVM: arm64: nv: Add userspace and guest handling of VNCR_EL2
Plug VNCR_EL2 in the vcpu_sysreg enum, define its RES0/RES1 bits,
and make it accessible to userspace when the VM is configured to
support FEAT_NV2.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-9-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:50 +0000 (11:34 +0100)]
KVM: arm64: nv: Add pseudo-TLB backing VNCR_EL2
FEAT_NV2 introduces an interesting problem for NV, as VNCR_EL2.BADDR
is a virtual address in the EL2&0 (or EL2, but we thankfully ignore
this) translation regime.
As we need to replicate such mapping in the real EL2, it means that
we need to remember that there is such a translation, and that any
TLBI affecting EL2 can possibly affect this translation.
It also means that any invalidation driven by an MMU notifier must
be able to shoot down any such mapping.
All in all, we need a data structure that represents this mapping,
and that is extremely close to a TLB. Given that we can only use
one of those per vcpu at any given time, we only allocate one.
No effort is made to keep that structure small. If we need to
start caching multiple of them, we may want to revisit that design
point. But for now, it is kept simple so that we can reason about it.
Oh, and add a braindump of how things are supposed to work, because
I will definitely page this out at some point. Yes, pun intended.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-8-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:49 +0000 (11:34 +0100)]
KVM: arm64: nv: Don't adjust PSTATE.M when L2 is nesting
We currently check for HCR_EL2.NV being set to decide whether we
need to repaint PSTATE.M to say EL2 instead of EL1 on exit.
However, this isn't correct when L2 is itself a hypervisor, and
that L1 as set its own HCR_EL2.NV. That's because we "flatten"
the state and inherit parts of the guest's own setup. In that case,
we shouldn't adjust PSTATE.M, as this is really EL1 for both us
and the guest.
Instead of trying to try and work out how we ended-up with HCR_EL2.NV
being set by introspecting both the host and guest states, use
a per-CPU flag to remember the context (HYP or not), and use that
information to decide whether PSTATE needs tweaking.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-7-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:48 +0000 (11:34 +0100)]
KVM: arm64: nv: Move TLBI range decoding to a helper
As we are about to expand out TLB invalidation capabilities to support
recursive virtualisation, move the decoding of a TLBI by range into
a helper that returns the base, the range and the ASID.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-6-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:47 +0000 (11:34 +0100)]
KVM: arm64: nv: Snapshot S1 ASID tagging information during walk
We currently completely ignore any sort of ASID tagging during a S1
walk, as AT doesn't care about it.
However, such information is required if we are going to create
anything that looks like a TLB from this walk.
Let's capture it both the nG and ASID information while walking
the page tables.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-5-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:46 +0000 (11:34 +0100)]
KVM: arm64: nv: Extract translation helper from the AT code
The address translation infrastructure is currently pretty tied to
the AT emulation.
However, we also need to features that require the use of VAs, such
as VNCR_EL2 (and maybe one of these days SPE), meaning that we need
a slightly more generic infrastructure.
Start this by introducing a new helper (__kvm_translate_va()) that
performs a S1 walk for a given translation regime, EL and PAN
settings.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 14 May 2025 10:34:45 +0000 (11:34 +0100)]
KVM: arm64: nv: Allocate VNCR page when required
If running a NV guest on an ARMv8.4-NV capable system, let's
allocate an additional page that will be used by the hypervisor
to fulfill system register accesses.
Reviewed-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>