Paolo Bonzini [Wed, 12 Mar 2025 11:59:07 +0000 (07:59 -0400)]
Merge branch 'kvm-tdx-finish-initial' into HEAD
This patch ties the remaining loose ends and finally enables TDX guests to
run inside KVM. It implements handling of EPT violation/misconfig and of
several TDVMCALL leaves that are handled in the kernel (CPUID, HLT, RDMSR/WRMSR,
GetTdVmCallInfo); it also adds a bunch of wrappers in vmx/main.c to
ignore operations not supported by TDX guests(*)
Finally, it introduces documentation for the new APIs that have been
added along the way.
(*) access to CPU state, VMX preemption timer, accesses to TSC offset or
multiplier, LMCE enable/disable, hypercall patching.
Paolo Bonzini [Wed, 12 Mar 2025 11:41:52 +0000 (07:41 -0400)]
Merge branch 'kvm-tdx-interrupts' into HEAD
Introduces support for interrupt handling for TDX guests, including
virtual interrupt injection and VM-Exits caused by vectored events.
Injection
=========
TDX supports non-NMI interrupt injection only by posted interrupt. Posted
interrupt descriptors (PIDs) are allocated in shared memory, KVM
can update them directly. To post pending interrupts in the PID, KVM
can generate a self-IPI with notification vector prior to TD entry.
TDX guest status is protected, KVM can't get the interrupt status of
TDX guest. For now, assume the interrupt is always allowed. A later
patch set will let TDX guests to call TDVMCALL with HLT, which passes
the interrupt block flag, so that whether interrupt is allowed in HLT
will checked against the interrupt block flag.
For NMIs, KVM can request the TDX module to inject a NMI into a TDX vCPU
by setting the PEND_NMI TDVPS field to 1. Following that, KVM can call
TDH.VP.ENTER to run the vCPU and the TDX module will attempt to inject
the NMI as soon as possible. PEND_NMI TDVPS field is a 1-bit filed,
i.e. KVM can only pend one NMI in the TDX module. Also, TDX doesn't
allow KVM to request NMI-window exit directly. When there is already
one NMI pending in the TDX module, i.e. it has not been delivered to
TDX guest yet, if there is NMI pending in KVM, collapse the pending
NMI in KVM into the one pending in the TDX module. Such collapse is OK
considering on X86 bare metal, multiple NMIs could collapse into one NMI,
e.g. when NMI is blocked by SMI. It's OS's responsibility to poll all
NMI sources in the NMI handler to avoid missing handling of some NMI
events. More details can be found in the changelog of the patch "KVM:
TDX: Implement methods to inject NMI".
TDX doesn't support system-management mode (SMM) and system-management
interrupt (SMI) in guest TDs because TDX module doesn't provide a way for
VMM to inject SMI into guest TD or switch guest vCPU mode into SMM.
SMI requests return -ENOTTY similar to CONFIG_KVM_SMM=n. Likewise,
INIT and SIPI events are not used and are blocked for TDX guests;
TDX defines its own vCPU creation and initialization sequence, which
is done on the host via SEAMCALLs at TD build time.
VM-exit for external events
===========================
Similar to the VMX case, external interrupts are with interrupts off:
in the .handle_exit_irqoff() callback for external interrupts and in
the noinstr region for NMIs. Just like VMX, NMI remains blocked after
exiting from TDX guest for NMI-induced exits.
Machine check, which is handled in the .handle_exit_irqoff() callback, is
the only exception type KVM handles for TDX guests. For other exceptions,
because TDX guest state is protected, exceptions in TDX guests can't be
intercepted. TDX VMM isn't supposed to handle these exceptions. Exit to
userspace with KVM_EXIT_EXCEPTION If unexpected exception occurs.
Host SMIs also cause an exit to KVM. This is needed because in SEAM
root mode (TDX module) all interrupts are blocked. An SMI can be "I/O
SMI" or "other SMI". For TDX, there will be no I/O SMI because I/O
instructions inside TDX guest trigger #VE and TDX guest needs to use
TDVMCALL to request VMM to do I/O emulation. The only case of interest
for "other SMI" is an #MC occurring in the guest when MCE-SMI morphing
is enabled in the host firmware. Such "MSMI" is marked by having bit 0
set in the exit qualification; MSMI exits are fatal for the TD and
are eventually handled by the kernel machine check handler (
7911f14
x86/mce: Implement recovery for errors in TDX/SEAM non-root mode),
which marks the page as poisoned. It is not possible right now to
pass machine check exceptions to the guest.
SMIs other than machine check SMIs are handled just by leaving SEAM
root mode and KVM doesn't need to do anything.
Paolo Bonzini [Wed, 12 Mar 2025 11:41:52 +0000 (07:41 -0400)]
Merge branch 'kvm-tdx-userspace-exit' into HEAD
Introduces support for VM exits that are forwarded the host VMM in
userspace. These are initiated from the TDCALL exit code; although
these userspace exits have the same TDX exit code, they result in several
different types of exits to userspace.
When a guest TD issues a TDVMCALL, it
exits to VMM with a new exit reason. The arguments from the guest TD and
return values from the VMM are passed through the guest registers. The
ABI details for the guest TD hypercalls are specified in the TDX GHCI
specification.
There are two types of hypercalls defined in the GHCI specification:
- Standard TDVMCALLs: When input of R10 from guest TD is set to 0, it
indicates that the TDVMCALL sub-function used in R11 is defined in GHCI
specification.
- Vendor-Specific TDVMCALLs: When input of R10 from guest TD is non-zero,
it indicates a vendor-specific TDVMCALL. KVM hypercalls from the guest
follow this interface, using R10 as KVM hypercall number and R11-R14 as
4 arguments. The error code returned in R10.
This series includes basic standard TDVMCALLs that map to existing eixt
reasons:
- TDG.VP.VMCALL<MapGPA> reuses exit reason KVM_EXIT_HYPERCALL with the
hypercall number KVM_HC_MAP_GPA_RANGE.
- TDG.VP.VMCALL<ReportFatalError> reuses exit reason KVM_EXIT_SYSTEM_EVENT
with a new event type KVM_SYSTEM_EVENT_TDX_FATAL.
- TDG.VP.VMCALL<Instruction.IO> reuses exit reason KVM_EXIT_IO.
- TDG.VP.VMCALL<#VE.RequestMMIO> reuses exit reason KVM_EXIT_MMIO.
Notably, handling for TDG.VP.VMCALL<SetupEventNotifyInterrupt> and
TDG.VP.VMCALL<GetQuote> is not included yet.
Paolo Bonzini [Wed, 12 Mar 2025 11:38:46 +0000 (07:38 -0400)]
Merge branch 'kvm-tdx-enter-exit' into HEAD
This series introduces callbacks to facilitate the entry of a TD VCPU
and the corresponding save/restore of host state.
A TD VCPU is entered via the SEAMCALL TDH.VP.ENTER. The TDX Module manages
the save/restore of guest state and, in conjunction with the SEAMCALL
interface, handles certain aspects of host state. However, there are
specific elements of the host state that require additional attention, as
detailed in the Intel TDX ABI documentation for TDH.VP.ENTER.
TDX is quite different from VMX in this regard. For VMX, the host VMM is
heavily involved in restoring, managing and saving guest CPU state, whereas
for TDX this is handled by the TDX Module. In that way, the TDX Module can
protect the confidentiality and integrity of TD CPU state.
The TDX Module does not save/restore all host CPU state because the host
VMM can do it more efficiently and selectively. CPU state referred to
below is host CPU state. Often values are already held in memory so no
explicit save is needed, and restoration may not be needed if the kernel
is not using a feature.
TDX does not support PAUSE-loop exiting. According to the TDX Module
Base arch. spec., hypercalls are expected to be used instead. Note that
the Linux TDX guest supports existing hypercalls via TDG.VP.VMCALL.
This series requires TDX module 1.5.06.00.0744, or later, due to removal
of the workarounds for the lack of the NO_RBP_MOD feature required by the
kernel. NO_RBP_MOD is now required.
Paolo Bonzini [Thu, 6 Mar 2025 16:21:46 +0000 (11:21 -0500)]
Merge branch 'kvm-tdx-mmu' into HEAD
This series picks up from commit
86eb1aef7279 ("Merge branch
'kvm-mirror-page-tables' into HEAD", 2025-01-20), which focused on
changes to the generic x86 parts of the KVM MMU code, and adds support
for TDX's secure page tables to the Intel side of KVM.
Confidential computing solutions have concepts of private and shared
memory. Often the guest accesses either private or shared memory via a bit
in the guest PTE. Solutions like SEV treat this bit more like a permission
bit, where solutions like TDX and ARM CCA treat it more like a GPA bit. In
the latter case, the host maps private memory in one half of the address
space and shared in another. For TDX these two halves are mapped by
different EPT roots. The private half (also called Secure EPT in Intel
documentation) gets managed by the privileged TDX Module. The shared half
is managed by the untrusted part of the VMM (KVM).
In addition to the separate roots for private and shared, there are
limitations on what operations can be done on the private side. Like SNP,
TDX wants to protect against protected memory being reset or otherwise
scrambled by the host. In order to prevent this, the guest has to take
specific action to “accept” memory after changes are made by the VMM to
the private EPT. This prevents the VMM from performing many of the usual
memory management operations that involve zapping and refaulting memory.
The private memory also is always RWX and cannot have VMM specified cache
attribute attributes applied.
TDX memory implementation
=========================
Creating shared EPT
-------------------
Shared EPT handling is relatively simple compared to private memory. It is
managed from within KVM. The main differences between shared EPT and EPT
in a normal VM are that the root is set with a TDVMCS field (via SEAMCALL),
and that the GFN specified in the memslot perspective needs to be mapped
at an offset in the EPT. For the former, this series plumbs in the
load_mmu_pgd() operation to the correct field for the shared EPT. For the
latter, previous patches have laid the groundwork for mapping so called
“direct roots” roots at an offset specified in kvm->arch.gfn_direct_bits.
Creating private EPT
--------------------
In previous patches, the concept of “mirrored roots” were introduced. Such
roots maintain a KVM side “mirror” of the “external” EPT by keeping an
unmapped EPT tree within the KVM MMU code. When changing these mirror
EPTs, the KVM MMU code calls out via x86_ops to update the external EPT.
This series adds implementations for these “external” ops for TDX to
create and manage “private” memory via TDX module APIs.
Managing S-EPT with the TDX Module
----------------------------------
The TDX module allows the TD’s private memory to be managed via SEAMCALLs.
This management consists of operating on two internal elements:
1. The private EPT, which the TDX module calls the S-EPT. It maps the
actual mapped, private half of the GPA space using an EPT tree.
2. The HKID, which represents private encryption keys used for encrypting
TD memory. The CPU doesn’t guarantee cache coherency between these
encryption keys, so memory that is encrypted with one of these keys
needs to be reclaimed for use on the host in special ways.
This series will primarily focus on the SEAMCALLs for managing the private
EPT. Consideration of the HKID is needed for when the TD is torn down.
Populating TDX Private memory
-----------------------------
TDX allows the EPT mapping the TD's private memory to be modified in
limited ways. There are SEAMCALLs for building and tearing down the EPT
tree, as well as mapping pages into the private EPT.
As for building and tearing down the EPT page tables, it is relatively
simple. There are SEAMCALLs for installing and removing them. However, the
current implementation only supports adding private EPT page tables, and
leaves them installed for the lifetime of the TD. For teardown, the
details are discussed in a later section.
As for populating and zapping private SPTE, there are SEAMCALLs for this
as well. The zapping case will be described in detail later. As for the
populating case, there are two categories: before TD is finalized and
after TD is finalized. Both of these scenarios go through the TDP MMU map
path. The changes done previously to introduce “mirror” and “external”
page tables handle directing SPTE installation operations through the
set_external_spte() op.
In the “after” case, the TDX set_external_spte() handler simply calls a
SEAMCALL (TDX.MEM.PAGE.AUG).
For the before case, it is a bit more complicated as it requires both
setting the private SPTE *and* copying in the initial contents of the page
at the same time. For TDX this is done via the KVM_TDX_INIT_MEM_REGION
ioctl, which is effectively the kvm_gmem_populate() operation.
For SNP, the private memory can be pre-populated first, and faulted in
later like normal. But for TDX these need to both happen both at the same
time and the setting of the private SPTE needs to happen in a different
way than the “after” case described above. It needs to use the
TDH.MEM.SEPT.ADD SEAMCALL which does both the copying in of the data and
setting the SPTE.
Without extensive modification to the fault path, it’s not possible
utilize this callback from the set_external_spte() handler because it the
source page for the data to be copied in is not known deep down in this
callchain. So instead the post-populate callback does a three step
process.
1. Pre-fault the memory into the mirror EPT, but have the
set_external_spte() not make any SEAMCALLs.
2. Check that the page is still faulted into the mirror EPT under read
mmu_lock that is held over this and the following step.
3. Call TDH.MEM.SEPT.ADD with the HPA of the page to copy data from, and
the private page installed in the mirror EPT to use for the private
mapping.
The scheme involves some assumptions about the operations that might
operate on the mirrored EPT before the VM is finalized. It assumes that no
other memory will be faulted into the mirror EPT, that is not also added
via TDH.MEM.SEPT.ADD). If this is violated the KVM MMU may not see private
memory faulted in there later and so not make the proper external spte
callbacks. To check this, KVM enforces that the number of
pre-faulted pages is the same as the number of pages added via
KVM_TDX_INIT_MEM_REGION.
TDX TLB flushing
----------------
For TDX, TLB flushing needs to happen in different ways depending on
whether private and/or shared EPT needs to be flushed. Shared EPT can be
flushed like normal EPT with INVEPT. To avoid reading TD's EPTP out from
TDX module, this series flushes shared EPT with type 2 INVEPT. Private TLB
entries can be flushed this way too (via type 2). However, since the TDX
module needs to enforce some guarantees around which private memory is
mapped in the TD, it requires these operations to be done in special ways
for private memory.
For flushing private memory, two methods are possible. The simple one
is the TDH.VP.FLUSH SEAMCALL; this flush is of the INVEPT type 1 variety
(i.e. mappings associated with the TD).
The second method is part of a sequence of SEAMCALLs for removing a guest
page. The sequence looks like:
1. TDH.MEM.RANGE.BLOCK - Remove RWX bits from entry (similar to KVM’s zap).
2. TDH.MEM.TRACK - Increment the TD TLB epoch, which is a per-TD counter
3. Kick off all vCPUs - In order to force them to have to re-enter.
4. TDH.MEM.PAGE.REMOVE - Actually remove the page and make it available for
other use.
5. TDH.VP.ENTER - On re-entering TDX module will see the epoch is
incremented and flush the TLB.
On top of this, during TDX module init TDH.SYS.LP.INIT (which is used
to online a CPU for TDX usage) invokes INVEPT to flush all mappings in
the TLB.
During runtime, for normal (TDP MMU, non-nested) guests, KVM will do a TLB
flushes in 4 scenarios:
(1) kvm_mmu_load()
After EPT is loaded, call kvm_x86_flush_tlb_current() to invalidate
TLBs for current vCPU loaded EPT on current pCPU.
(2) Loading vCPU to a new pCPU
Send request KVM_REQ_TLB_FLUSH to current vCPU, the request handler
will call kvm_x86_flush_tlb_all() to flush all EPTs assocated with the
new pCPU.
(3) When EPT mapping has changed (after removing or permission reduction)
(e.g. in kvm_flush_remote_tlbs())
Send request KVM_REQ_TLB_FLUSH to all vCPUs by kicking all them off,
the request handler on each vCPU will call kvm_x86_flush_tlb_all() to
invalidate TLBs for all EPTs associated with the pCPU.
(4) When EPT changes only affects current vCPU, e.g. virtual apic mode
changed.
Send request KVM_REQ_TLB_FLUSH_CURRENT, the request handler will call
kvm_x86_flush_tlb_current() to invalidate TLBs for current vCPU loaded
EPT on current pCPU.
Only the first 3 are relevant to TDX. They are implemented as follows.
(1) kvm_mmu_load()
Only the shared EPT root is loaded in this path. The TDX module does
not require any assurances about the operation, so the
flush_tlb_current()->ept_sync_global() can be called as normal.
(2) vCPU load
When a vCPU migrates to a new logical processor, it has to be flushed
on the *old* pCPU, unlike normal VMs where the INVEPT is executed on
the new pCPU to remove stale mappings from previous usage of the same
EPTP on the new pCPU. The TDX behavior comes from a requirement
that a vCPU can only be associated with one pCPU at at time. This
flush happens via an IPI that invokes TDH.VP.FLUSH SEAMCALL, during
the vcpu_load callback.
(3) Removing a private SPTE
This is the more complicated flow. It is done in a simple way for now
and is especially inefficient during VM teardown. The plan is to get a
basic functional version working and optimize some of these flows
later.
When a private page mapping is removed, the core MMU code calls the
newly remove_external_spte() op, and flushes the TLB on all vCPUs. But
TDX can’t rely on doing that for private memory, so it has it’s own
process for making sure the private page is removed. This flow
(TDH.MEM.RANGE.BLOCK, TDH.MEM.TRACK, TDH.MEM.PAGE.REMOVE) is done
withing the remove_external_spte() implementation as described in the
“TDX TLB flushing” section above.
After that, back in the core MMU code, KVM will call
kvm_flush_remote_tlbs*() resulting in an INVEPT. Despite that, when
the vCPUs re-enter (TDH.VP.ENTER) the TD, the TDX module will do
another INVEPT for its own reassurance.
Private memory teardown
-----------------------
Tearing down private memory involves reclaiming three types of resources
from the TDX module:
1. TD’s HKID
To reclaim the TD’s HKID, no mappings may be mapped with it.
2. Private guest pages (mapped with HKID)
3. Private page tables that map private pages (mapped with HKID)
From the TDX module’s perspective, to reclaim guest private pages they
need to be prevented from be accessed via the HKID (unmapped and TLB
flushed), their HKID associated cachelines need to be flushed, and
they need to be marked as no longer use by the TD in the TDX modules
internal tracking (PAMT)
During runtime private PTEs can be zapped as part of memslot deletion or
when memory coverts from shared to private, but private page tables and
HKIDs are not torn down until the TD is being destructed. The means the
operation to zap private guest mapped pages needs to do the required cache
writeback under the assumption that other vCPU’s may be active, but the
PTs do not.
TD teardown resource reclamation
--------------------------------
The code that does the TD teardown is organized such that when an HKID is
reclaimed:
1. vCPUs will no longer enter the TD
2. The TLB is flushed on all CPUs
3. The HKID associated cachelines have been flushed.
So at that point most of the steps needed to reclaim TD private pages and
page tables have already been done and the reclaim operation only needs to
update the TDX module’s tracking of page ownership. For simplicity each
operation only supports one scenario: before or after HKID reclaim. Since
zapping and reclaiming private pages has to function during runtime for
memslot deletion and converting from shared to private, the TD teardown is
arranged so this happens before HKID reclaim. Since private page tables
are never torn down during TD runtime, they can happen in a simpler and
more efficient way after HKID reclaim. The private page reclaim is
initiated from the kvm fd release. The callchain looks like this:
do_exit
|->exit_mm --> tdx_mmu_release_hkid() was called here previously in v19
|->exit_files
|->1.release vcpu fd
|->2.kvm_gmem_release
| |->kvm_gmem_invalidate_begin --> unmap all leaf entries, causing
| zapping of private guest pages
|->3.release kvmfd
|->kvm_destroy_vm
|->kvm_arch_pre_destroy_vm
| | kvm_x86_call(vm_pre_destroy)(kvm) -->tdx_mmu_release_hkid()
|->kvm_arch_destroy_vm
|->kvm_unload_vcpu_mmus
| kvm_destroy_vcpus(kvm)
| |->kvm_arch_vcpu_destroy
| |->kvm_x86_call(vcpu_free)(vcpu)
| | kvm_mmu_destroy(vcpu) -->unref mirror root
| kvm_mmu_uninit_vm(kvm) --> mirror root ref is 1 here,
| zap private page tables
| static_call_cond(kvm_x86_vm_destroy)(kvm);
Paolo Bonzini [Thu, 6 Mar 2025 16:06:04 +0000 (11:06 -0500)]
Merge branch 'kvm-tdx-initialization' into HEAD
This series kicks off the actual interaction of KVM with the TDX module.
This series encompasses the basic setup for using the TDX module from KVM,
and the creation of TD VMs and vCPUs.
The TDX Module is a software component that runs in a special CPU mode
called SEAM (Secure Arbitration Mode). Loading it is mostly handled
outside of KVM by the core kernel. Once it’s loaded KVM can interact with
the TDX Module via a new instruction called SEAMCALL to virtualize a TD
guests. This instruction can be used to make various types of seamcalls,
with names organized into a hierarchy. The format is TDH.[AREA].[ACTION],
where “TDH” stands for “Trust Domain Host”, and differentiates from
another set of calls that can be done by the guest “TDG”. The KVM relevant
areas of SEAMCALLs are:
SYS – TDX module management, static metadata reading.
MNG – TD management. VM scoped things that operate on a TDX module
controlled structure called the TDCS.
VP – vCPU management. vCPU scoped things that operate on TDX module
controlled structures called the TDVPS.
PHYMEM - Operations related to physical memory management (page
reclaiming, cache operations, etc).
This series introduces some TDX specific KVM APIs and stops short of
fully “finalizing” the creation of a TD VM. The part of initializing
a guest where initial private memory is loaded is left to a separate
MMU related series.
Isaku Yamahata [Thu, 27 Feb 2025 01:20:21 +0000 (09:20 +0800)]
Documentation/virt/kvm: Document on Trust Domain Extensions (TDX)
Add documentation to Intel Trusted Domain Extensions (TDX) support.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-21-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 10 Dec 2024 00:49:43 +0000 (08:49 +0800)]
KVM: TDX: Make TDX VM type supported
Now all the necessary code for TDX is in place, it's ready to run TDX
guest. Advertise the VM type of KVM_X86_TDX_VM so that the user space
VMM like QEMU can start to use it.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX "the rest" v2:
- No change.
TDX "the rest" v1:
- Move down to the end of patch series.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yan Zhao [Mon, 24 Feb 2025 07:10:39 +0000 (15:10 +0800)]
KVM: TDX: KVM: TDX: Always honor guest PAT on TDX enabled guests
Always honor guest PAT in KVM-managed EPTs on TDX enabled guests by
making self-snoop feature a hard dependency for TDX and making quirk
KVM_X86_QUIRK_IGNORE_GUEST_PAT not a valid quirk once TDX is enabled.
The quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT only affects memory type of
KVM-managed EPTs. For the TDX-module-managed private EPT, memory type is
always forced to WB now.
Honoring guest PAT in KVM-managed EPTs ensures KVM does not invoke
kvm_zap_gfn_range() when attaching/detaching non-coherent DMA devices,
which would cause mirrored EPTs for TDs to be zapped, leading to the
TDX-module-managed private EPT being incorrectly zapped.
As a new feature, TDX always comes with support for self-snoop, and does
not have to worry about unmodifiable but buggy guests. So, simply ignore
KVM_X86_QUIRK_IGNORE_GUEST_PAT on TDX guests just like kvm-amd.ko already
does.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20250224071039.31511-1-yan.y.zhao@intel.com>
[Only apply to TDX guests. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 3 Mar 2025 16:31:50 +0000 (11:31 -0500)]
KVM: x86: remove shadow_memtype_mask
The IGNORE_GUEST_PAT quirk is inapplicable, and thus always-disabled,
if shadow_memtype_mask is zero. As long as vmx_get_mt_mask is not
called for the shadow paging case, there is no need to consult
shadow_memtype_mask and it can be removed altogether.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yan Zhao [Mon, 24 Feb 2025 07:09:45 +0000 (15:09 +0800)]
KVM: x86: Introduce Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT
Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have
KVM ignore guest PAT when this quirk is enabled.
On AMD platforms, KVM always honors guest PAT. On Intel however there are
two issues. First, KVM *cannot* honor guest PAT if CPU feature self-snoop
is not supported. Second, UC access on certain Intel platforms can be very
slow[1] and honoring guest PAT on those platforms may break some old
guests that accidentally specify video RAM as UC. Those old guests may
never expect the slowness since KVM always forces WB previously. See [2].
So, introduce a quirk that KVM can enable by default on all Intel platforms
to avoid breaking old unmodifiable guests. Newer userspace can disable this
quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail
if self-snoop is not supported, i.e. if KVM cannot obey the wish.
The quirk is a no-op on AMD and also if any assigned devices have
non-coherent DMA. This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is
another example of a quirk that is sometimes automatically disabled.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com
Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com
Message-ID: <
20250224070946.31482-1-yan.y.zhao@intel.com>
[Use supported_quirks/inapplicable_quirks to support both AMD and
no-self-snoop cases, as well as to remove the shadow_memtype_mask check
from kvm_mmu_may_ignore_guest_pat(). - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yan Zhao [Mon, 24 Feb 2025 07:08:32 +0000 (15:08 +0800)]
KVM: x86: Introduce supported_quirks to block disabling quirks
Introduce supported_quirks in kvm_caps to store platform-specific force-enabled
quirks.
No functional changes intended.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20250224070832.31394-1-yan.y.zhao@intel.com>
[Remove unsupported quirks at KVM_ENABLE_CAP time. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 3 Mar 2025 16:18:38 +0000 (11:18 -0500)]
KVM: x86: Allow vendor code to disable quirks
In some cases, the handling of quirks is split between platform-specific
code and generic code, or it is done entirely in generic code, but the
relevant bug does not trigger on some platforms; for example,
this will be the case for "ignore guest PAT". Allow unaffected vendor
modules to disable handling of a quirk for all VMs via a new entry in
kvm_caps.
Such quirks remain available in KVM_CAP_DISABLE_QUIRKS2, because that API
tells userspace that KVM *knows* that some of its past behavior was bogus
or just undesirable. In other words, it's plausible for userspace to
refuse to run if a quirk is not listed by KVM_CAP_DISABLE_QUIRKS2, so
preserve that and make it part of the API.
As an example, mark KVM_X86_QUIRK_CD_NW_CLEARED as auto-disabled on
Intel systems.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 3 Mar 2025 14:09:37 +0000 (09:09 -0500)]
KVM: x86: do not allow re-enabling quirks
Allowing arbitrary re-enabling of quirks puts a limit on what the
quirks themselves can do, since you cannot assume that the quirk
prevents a particular state. More important, it also prevents
KVM from disabling a quirk at VM creation time, because userspace
can always go back and re-enable that.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Binbin Wu [Thu, 27 Feb 2025 01:20:19 +0000 (09:20 +0800)]
KVM: TDX: Enable guest access to MTRR MSRs
Allow TDX guests to access MTRR MSRs as what KVM does for normal VMs, i.e.,
KVM emulates accesses to MTRR MSRs, but doesn't virtualize guest MTRR
memory types.
TDX module exposes MTRR feature to TDX guests unconditionally. KVM needs
to support MTRR MSRs accesses for TDX guests to match the architectural
behavior.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-19-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 27 Feb 2025 01:20:18 +0000 (09:20 +0800)]
KVM: TDX: Add a method to ignore hypercall patching
Because guest TD memory is protected, VMM patching guest binary for
hypercall instruction isn't possible. Add a method to ignore hypercall
patching. Note: guest TD kernel needs to be modified to use
TDG.VP.VMCALL for hypercall.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-18-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 27 Feb 2025 01:20:17 +0000 (09:20 +0800)]
KVM: TDX: Ignore setting up mce
Because vmx_set_mce function is VMX specific and it cannot be used for TDX.
Add vt stub to ignore setting up mce for TDX.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-17-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 27 Feb 2025 01:20:16 +0000 (09:20 +0800)]
KVM: TDX: Add methods to ignore accesses to TSC
TDX protects TDX guest TSC state from VMM. Implement access methods to
ignore guest TSC.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-16-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 27 Feb 2025 01:20:15 +0000 (09:20 +0800)]
KVM: TDX: Add methods to ignore VMX preemption timer
TDX doesn't support VMX preemption timer. Implement access methods for VMM
to ignore VMX preemption timer.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-15-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 27 Feb 2025 01:20:14 +0000 (09:20 +0800)]
KVM: TDX: Add method to ignore guest instruction emulation
Skip instruction emulation and let the TDX guest retry for MMIO emulation
after installing the MMIO SPTE with suppress #VE bit cleared.
TDX protects TDX guest state from VMM, instructions in guest memory cannot
be emulated. MMIO emulation is the only case that triggers the instruction
emulation code path for TDX guest.
The MMIO emulation handling flow as following:
- The TDX guest issues a vMMIO instruction. (The GPA must be shared and is
not covered by KVM memory slot.)
- The default SPTE entry for shared-EPT by KVM has suppress #VE bit set. So
EPT violation causes TD exit to KVM.
- Trigger KVM page fault handler and install a new SPTE with suppress #VE
bit cleared.
- Skip instruction emulation and return X86EMU_RETRY_INSTR to let the vCPU
retry.
- TDX guest re-executes the vMMIO instruction.
- TDX guest gets #VE because KVM has cleared #VE suppress bit.
- TDX guest #VE handler converts MMIO into TDG.VP.VMCALL<MMIO>
Return X86EMU_RETRY_INSTR in the callback check_emulate_instruction() for
TDX guests to retry the MMIO instruction. Also, the instruction emulation
handling will be skipped, so that the callback check_intercept() will never
be called for TDX guest.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-14-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 27 Feb 2025 01:20:13 +0000 (09:20 +0800)]
KVM: TDX: Add methods to ignore accesses to CPU state
TDX protects TDX guest state from VMM. Implement access methods for TDX
guest state to ignore them or return zero. Because those methods can be
called by kvm ioctls to set/get cpu registers, they don't have KVM_BUG_ON.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-13-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 27 Feb 2025 01:20:12 +0000 (09:20 +0800)]
KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
Implement TDG.VP.VMCALL<GetTdVmCallInfo> hypercall. If the input value is
zero, return success code and zero in output registers.
TDG.VP.VMCALL<GetTdVmCallInfo> hypercall is a subleaf of TDG.VP.VMCALL to
enumerate which TDG.VP.VMCALL sub leaves are supported. This hypercall is
for future enhancement of the Guest-Host-Communication Interface (GHCI)
specification. The GHCI version of 344426-001US defines it to require
input R12 to be zero and to return zero in output registers, R11, R12, R13,
and R14 so that guest TD enumerates no enhancement.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-12-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 27 Feb 2025 01:20:11 +0000 (09:20 +0800)]
KVM: TDX: Enable guest access to LMCE related MSRs
Allow TDX guest to configure LMCE (Local Machine Check Event) by handling
MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL.
MCE and MCA are advertised via cpuid based on the TDX module spec. Guest
kernel can access IA32_FEAT_CTL to check whether LMCE is opted-in by the
platform or not. If LMCE is opted-in by the platform, guest kernel can
access IA32_MCG_EXT_CTL to enable/disable LMCE.
Handle MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL for TDX guests to avoid
failure when a guest accesses them with TDG.VP.VMCALL<MSR> on #VE. E.g.,
Linux guest will treat the failure as a #GP(0).
Userspace VMM may not opt-in LMCE by default, e.g., QEMU disables it by
default, "-cpu lmce=on" is needed in QEMU command line to opt-in it.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rework changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-11-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 27 Feb 2025 01:20:10 +0000 (09:20 +0800)]
KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
Morph PV RDMSR/WRMSR hypercall to EXIT_REASON_MSR_{READ,WRITE} and
wire up KVM backend functions.
For complete_emulated_msr() callback, instead of injecting #GP on error,
implement tdx_complete_emulated_msr() to set return code on error. Also
set return value on MSR read according to the values from kvm x86
registers.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20250227012021.
1778144-10-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 27 Feb 2025 01:20:09 +0000 (09:20 +0800)]
KVM: TDX: Implement callbacks for MSR operations
Add functions to implement MSR related callbacks, .set_msr(), .get_msr(),
and .has_emulated_msr(), for preparation of handling hypercalls from TDX
guest for PV RDMSR and WRMSR. Ignore KVM_REQ_MSR_FILTER_CHANGED for TDX.
There are three classes of MSR virtualization for TDX.
- Non-configurable: TDX module directly virtualizes it. VMM can't configure
it, the value set by KVM_SET_MSRS is ignored.
- Configurable: TDX module directly virtualizes it. VMM can configure it at
VM creation time. The value set by KVM_SET_MSRS is used.
- #VE case: TDX guest would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>
and VMM handles the MSR hypercall. The value set by KVM_SET_MSRS is used.
For the MSRs belonging to the #VE case, the TDX module injects #VE to the
TDX guest upon RDMSR or WRMSR. The exact list of such MSRs is defined in
TDX Module ABI Spec.
Upon #VE, the TDX guest may call TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>,
which are defined in GHCI (Guest-Host Communication Interface) so that the
host VMM (e.g. KVM) can virtualize the MSRs.
TDX doesn't allow VMM to configure interception of MSR accesses. Ignore
KVM_REQ_MSR_FILTER_CHANGED for TDX guest. If the userspace has set any
MSR filters, it will be applied when handling
TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}> in a later patch.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20250227012021.
1778144-9-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 27 Feb 2025 01:20:08 +0000 (09:20 +0800)]
KVM: x86: Move KVM_MAX_MCE_BANKS to header file
Move KVM_MAX_MCE_BANKS to header file so that it can be used for TDX in
a future patch.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 27 Feb 2025 01:20:07 +0000 (09:20 +0800)]
KVM: TDX: Handle TDX PV HLT hypercall
Handle TDX PV HLT hypercall and the interrupt status due to it.
TDX guest status is protected, KVM can't get the interrupt status
of TDX guest and it assumes interrupt is always allowed unless TDX
guest calls TDVMCALL with HLT, which passes the interrupt blocked flag.
If the guest halted with interrupt enabled, also query pending RVI by
checking bit 0 of TD_VCPU_STATE_DETAILS_NON_ARCH field via a seamcall.
Update vt_interrupt_allowed() for TDX based on interrupt blocked flag
passed by HLT TDVMCALL. Do not wakeup TD vCPU if interrupt is blocked
for VT-d PI.
For NMIs, KVM cannot determine the NMI blocking status for TDX guests,
so KVM always assumes NMIs are not blocked. In the unlikely scenario
where a guest invokes the PV HLT hypercall within an NMI handler, this
could result in a spurious wakeup. The guest should implement the PV
HLT hypercall within a loop if it truly requires no interruptions, since
NMI could be unblocked by an IRET due to an exception occurring before
the PV HLT is executed in the NMI handler.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-7-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 27 Feb 2025 01:20:06 +0000 (09:20 +0800)]
KVM: TDX: Handle TDX PV CPUID hypercall
Handle TDX PV CPUID hypercall for the CPUIDs virtualized by VMM
according to TDX Guest Host Communication Interface (GHCI).
For TDX, most CPUID leaf/sub-leaf combinations are virtualized by
the TDX module while some trigger #VE. On #VE, TDX guest can issue
TDG.VP.VMCALL<INSTRUCTION.CPUID> (same value as EXIT_REASON_CPUID)
to request VMM to emulate CPUID operation.
Morph PV CPUID hypercall to EXIT_REASON_CPUID and wire up to the KVM
backend function.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rewrite changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-6-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yan Zhao [Thu, 27 Feb 2025 01:20:05 +0000 (09:20 +0800)]
KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
Kick off all vCPUs and prevent tdh_vp_enter() from executing whenever
tdh_mem_range_block()/tdh_mem_track()/tdh_mem_page_remove() encounters
contention, since the page removal path does not expect error and is less
sensitive to the performance penalty caused by kicking off vCPUs.
Although KVM has protected SEPT zap-related SEAMCALLs with kvm->mmu_lock,
KVM may still encounter TDX_OPERAND_BUSY due to the contention in the TDX
module.
- tdh_mem_track() may contend with tdh_vp_enter().
- tdh_mem_range_block()/tdh_mem_page_remove() may contend with
tdh_vp_enter() and TDCALLs.
Resources SHARED users EXCLUSIVE users
------------------------------------------------------------
TDCS epoch tdh_vp_enter tdh_mem_track
------------------------------------------------------------
SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation)
tdh_mem_range_block
------------------------------------------------------------
SEPT entry tdh_mem_range_block (Host lock)
tdh_mem_page_remove (Host lock)
tdg_mem_page_accept (Guest lock)
tdg_mem_page_attr_rd (Guest lock)
tdg_mem_page_attr_wr (Guest lock)
Use a TDX specific per-VM flag wait_for_sept_zap along with
KVM_REQ_OUTSIDE_GUEST_MODE to kick off vCPUs and prevent them from entering
TD, thereby avoiding the potential contention. Apply the kick-off and no
vCPU entering only after each SEAMCALL busy error to minimize the window of
no TD entry, as the contention due to 0-step mitigation or TDCALLs is
expected to be rare.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20250227012021.
1778144-5-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Sat, 22 Feb 2025 01:47:57 +0000 (09:47 +0800)]
KVM: TDX: Handle EXIT_REASON_OTHER_SMI
Handle VM exit caused by "other SMI" for TDX, by returning back to
userspace for Machine Check System Management Interrupt (MSMI) case or
ignoring it and resume vCPU for non-MSMI case.
For VMX, SMM transition can happen in both VMX non-root mode and VMX
root mode. Unlike VMX, in SEAM root mode (TDX module), all interrupts
are blocked. If an SMI occurs in SEAM non-root mode (TD guest), the SMI
causes VM exit to TDX module, then SEAMRET to KVM. Once it exits to KVM,
SMI is delivered and handled by kernel handler right away.
An SMI can be "I/O SMI" or "other SMI". For TDX, there will be no I/O SMI
because I/O instructions inside TDX guest trigger #VE and TDX guest needs
to use TDVMCALL to request VMM to do I/O emulation.
For "other SMI", there are two cases:
- MSMI case. When BIOS eMCA MCE-SMI morphing is enabled, the #MC occurs in
TDX guest will be delivered as an MSMI. It causes an
EXIT_REASON_OTHER_SMI VM exit with MSMI (bit 0) set in the exit
qualification. On VM exit, TDX module checks whether the "other SMI" is
caused by an MSMI or not. If so, TDX module marks TD as fatal,
preventing further TD entries, and then completes the TD exit flow to KVM
with the TDH.VP.ENTER outputs indicating TDX_NON_RECOVERABLE_TD. After
TD exit, the MSMI is delivered and eventually handled by the kernel
machine check handler (
7911f145de5f x86/mce: Implement recovery for
errors in TDX/SEAM non-root mode), i.e., the memory page is marked as
poisoned and it won't be freed to the free list when the TDX guest is
terminated. Since the TDX guest is dead, follow other non-recoverable
cases, exit to userspace.
- For non-MSMI case, KVM doesn't need to do anything, just continue TDX
vCPU execution.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20250222014757.897978-17-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yan Zhao [Thu, 27 Feb 2025 01:20:04 +0000 (09:20 +0800)]
KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY
Retry locally in the TDX EPT violation handler for private memory to reduce
the chances for tdh_mem_sept_add()/tdh_mem_page_aug() to contend with
tdh_vp_enter().
TDX EPT violation installs private pages via tdh_mem_sept_add() and
tdh_mem_page_aug(). The two may have contention with tdh_vp_enter() or
TDCALLs.
Resources SHARED users EXCLUSIVE users
------------------------------------------------------------
SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation)
tdh_mem_page_aug
------------------------------------------------------------
SEPT entry tdh_mem_sept_add (Host lock)
tdh_mem_page_aug (Host lock)
tdg_mem_page_accept (Guest lock)
tdg_mem_page_attr_rd (Guest lock)
tdg_mem_page_attr_wr (Guest lock)
Though the contention between tdh_mem_sept_add()/tdh_mem_page_aug() and
TDCALLs may be removed in future TDX module, their contention with
tdh_vp_enter() due to 0-step mitigation still persists.
The TDX module may trigger 0-step mitigation in SEAMCALL TDH.VP.ENTER,
which works as follows:
0. Each TDH.VP.ENTER records the guest RIP on TD entry.
1. When the TDX module encounters a VM exit with reason EPT_VIOLATION, it
checks if the guest RIP is the same as last guest RIP on TD entry.
-if yes, it means the EPT violation is caused by the same instruction
that caused the last VM exit.
Then, the TDX module increases the guest RIP no-progress count.
When the count increases from 0 to the threshold (currently 6),
the TDX module records the faulting GPA into a
last_epf_gpa_list.
-if no, it means the guest RIP has made progress.
So, the TDX module resets the RIP no-progress count and the
last_epf_gpa_list.
2. On the next TDH.VP.ENTER, the TDX module (after saving the guest RIP on
TD entry) checks if the last_epf_gpa_list is empty.
-if yes, TD entry continues without acquiring the lock on the SEPT tree.
-if no, it triggers the 0-step mitigation by acquiring the exclusive
lock on SEPT tree, walking the EPT tree to check if all page
faults caused by the GPAs in the last_epf_gpa_list have been
resolved before continuing TD entry.
Since KVM TDP MMU usually re-enters guest whenever it exits to userspace
(e.g. for KVM_EXIT_MEMORY_FAULT) or encounters a BUSY, it is possible for a
tdh_vp_enter() to be called more than the threshold count before a page
fault is addressed, triggering contention when tdh_vp_enter() attempts to
acquire exclusive lock on SEPT tree.
Retry locally in TDX EPT violation handler to reduce the count of invoking
tdh_vp_enter(), hence reducing the possibility of its contention with
tdh_mem_sept_add()/tdh_mem_page_aug(). However, the 0-step mitigation and
the contention are still not eliminated due to KVM_EXIT_MEMORY_FAULT,
signals/interrupts, and cases when one instruction faults more GFNs than
the threshold count.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20250227012021.
1778144-4-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Sat, 22 Feb 2025 01:47:56 +0000 (09:47 +0800)]
KVM: TDX: Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT exits for TDX.
NMI Handling: Just like the VMX case, NMI remains blocked after exiting
from TDX guest for NMI-induced exits [*]. Handle NMI-induced exits for
TDX guests in the same way as they are handled for VMX guests, i.e.,
handle NMI in tdx_vcpu_enter_exit() by calling the vmx_handle_nmi()
helper.
Interrupt and Exception Handling: Similar to the VMX case, external
interrupts and exceptions (machine check is the only exception type
KVM handles for TDX guests) are handled in the .handle_exit_irqoff()
callback.
For other exceptions, because TDX guest state is protected, exceptions in
TDX guests can't be intercepted. TDX VMM isn't supposed to handle these
exceptions. If unexpected exception occurs, exit to userspace with
KVM_EXIT_EXCEPTION.
For external interrupt, increase the statistics, same as the VMX case.
[*]: Some old TDX modules have a bug which makes NMI unblocked after
exiting from TDX guest for NMI-induced exits. This could potentially
lead to nested NMIs: a new NMI arrives when KVM is manually calling the
host NMI handler. This is an architectural violation, but it doesn't
have real harm until FRED is enabled together with TDX (for non-FRED,
the host NMI handler can handle nested NMIs). Given this is rare to
happen and has no real harm, ignore this for the initial TDX support.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20250222014757.897978-16-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yan Zhao [Thu, 27 Feb 2025 01:20:03 +0000 (09:20 +0800)]
KVM: TDX: Detect unexpected SEPT violations due to pending SPTEs
Detect SEPT violations that occur when an SEPT entry is in PENDING state
while the TD is configured not to receive #VE on SEPT violations.
A TD guest can be configured not to receive #VE by setting SEPT_VE_DISABLE
to 1 in tdh_mng_init() or modifying pending_ve_disable to 1 in TDCS when
flexible_pending_ve is permitted. In such cases, the TDX module will not
inject #VE into the TD upon encountering an EPT violation caused by an SEPT
entry in the PENDING state. Instead, TDX module will exit to VMM and set
extended exit qualification type to PENDING_EPT_VIOLATION and exit
qualification bit 6:3 to 0.
Since #VE will not be injected to such TDs, they are not able to be
notified to accept a GPA. TD accessing before accepting a private GPA
is regarded as an error within the guest.
Detect such guest error by inspecting the (extended) exit qualification
bits and make such VM dead.
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-3-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Sat, 22 Feb 2025 01:47:55 +0000 (09:47 +0800)]
KVM: VMX: Add a helper for NMI handling
Add a helper to handles NMI exit.
TDX handles the NMI exit the same as VMX case. Add a helper to share the
code with TDX, expose the helper in common.h.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014757.897978-15-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 27 Feb 2025 01:20:02 +0000 (09:20 +0800)]
KVM: TDX: Handle EPT violation/misconfig exit
For TDX, on EPT violation, call common __vmx_handle_ept_violation() to
trigger x86 MMU code; on EPT misconfiguration, bug the VM since it
shouldn't happen.
EPT violation due to instruction fetch should never be triggered from
shared memory in TDX guest. If such EPT violation occurs, treat it as
broken hardware.
EPT misconfiguration shouldn't happen on neither shared nor secure EPT for
TDX guests.
- TDX module guarantees no EPT misconfiguration on secure EPT. Per TDX
module v1.5 spec section 9.4 "Secure EPT Induced TD Exits":
"By design, since secure EPT is fully controlled by the TDX module, an
EPT misconfiguration on a private GPA indicates a TDX module bug and is
handled as a fatal error."
- For shared EPT, the MMIO caching optimization, which is the only case
where current KVM configures EPT entries to generate EPT
misconfiguration, is implemented in a different way for TDX guests. KVM
configures EPT entries to non-present value without suppressing #VE bit.
It causes #VE in the TDX guest and the guest will call TDG.VP.VMCALL to
request MMIO emulation.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
[binbin: rework changelog]
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250227012021.
1778144-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Binbin Wu [Sat, 22 Feb 2025 01:47:54 +0000 (09:47 +0800)]
KVM: VMX: Move emulation_required to struct vcpu_vt
Move emulation_required from struct vcpu_vmx to struct vcpu_vt so that
vmx_handle_exit_irqoff() can be reused by TDX code.
No functional change intended.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014757.897978-14-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Sat, 22 Feb 2025 01:47:53 +0000 (09:47 +0800)]
KVM: TDX: Add methods to ignore virtual apic related operation
TDX protects TDX guest APIC state from VMM. Implement access methods of
TDX guest vAPIC state to ignore them or return zero.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014757.897978-13-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Sat, 22 Feb 2025 01:47:52 +0000 (09:47 +0800)]
KVM: TDX: Force APICv active for TDX guest
Force APICv active for TDX guests in KVM because APICv is always enabled
by TDX module.
From the view of KVM, whether APICv state is active or not is decided by:
1. APIC is hw enabled
2. VM and vCPU have no inhibit reasons set.
After TDX vCPU init, APIC is set to x2APIC mode. KVM_SET_{SREGS,SREGS2} are
rejected due to has_protected_state for TDs and guest_state_protected
for TDX vCPUs are set. Reject KVM_{GET,SET}_LAPIC from userspace since
migration is not supported yet, so that userspace cannot disable APIC.
For various APICv inhibit reasons:
- APICV_INHIBIT_REASON_DISABLED is impossible after checking enable_apicv
in tdx_bringup(). If !enable_apicv, TDX support will be disabled.
- APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED is impossible since x2APIC is
mandatory, KVM emulates APIC_ID as read-only for x2APIC mode. (Note:
APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED could be set if the memory
allocation fails for KVM apic_map.)
- APICV_INHIBIT_REASON_HYPERV is impossible since TDX doesn't support
HyperV guest yet.
- APICV_INHIBIT_REASON_ABSENT is impossible since in-kernel LAPIC is
checked in tdx_vcpu_create().
- APICV_INHIBIT_REASON_BLOCKIRQ is impossible since TDX doesn't support
KVM_SET_GUEST_DEBUG.
- APICV_INHIBIT_REASON_APIC_ID_MODIFIED is impossible since x2APIC is
mandatory.
- APICV_INHIBIT_REASON_APIC_BASE_MODIFIED is impossible since KVM rejects
userspace to set APIC base.
- The rest inhibit reasons are relevant only to AMD's AVIC, including
APICV_INHIBIT_REASON_NESTED, APICV_INHIBIT_REASON_IRQWIN,
APICV_INHIBIT_REASON_PIT_REINJ, APICV_INHIBIT_REASON_SEV, and
APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED.
(For APICV_INHIBIT_REASON_PIT_REINJ, similar to AVIC, KVM can't intercept
EOI for TDX guests neither, but KVM enforces KVM_IRQCHIP_SPLIT for TDX
guests, which eliminates the in-kernel PIT.)
Implement vt_refresh_apicv_exec_ctrl() to call KVM_BUG_ON() if APICv is
disabled for TDX guests.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014757.897978-12-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Binbin Wu [Sat, 22 Feb 2025 01:47:51 +0000 (09:47 +0800)]
KVM: TDX: Enforce KVM_IRQCHIP_SPLIT for TDX guests
Enforce KVM_IRQCHIP_SPLIT for TDX guests to disallow in-kernel I/O APIC
while in-kernel local APIC is needed.
APICv is always enabled by TDX module and TDX Module doesn't allow the
hypervisor to modify the EOI-bitmap, i.e. all EOIs are accelerated and
never trigger exits. Level-triggered interrupts and other things depending
on EOI VM-Exit can't be faithfully emulated in KVM. Also, the lazy check
of pending APIC EOI for RTC edge-triggered interrupts, which was introduced
as a workaround when EOI cannot be intercepted, doesn't work for TDX either
because kvm_apic_pending_eoi() checks vIRR and vISR, but both values are
invisible in KVM.
If the guest induces generation of a level-triggered interrupt, the VMM is
left with the choice of dropping the interrupt, sending it as-is, or
converting it to an edge-triggered interrupt. Ditto for KVM. All of those
options will make the guest unhappy. There's no architectural behavior KVM
can provide that's better than sending the interrupt and hoping for the
best.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014757.897978-11-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Sat, 22 Feb 2025 01:47:50 +0000 (09:47 +0800)]
KVM: TDX: Always block INIT/SIPI
Always block INIT and SIPI events for the TDX guest because the TDX module
doesn't provide API for VMM to inject INIT IPI or SIPI.
TDX defines its own vCPU creation and initialization sequence including
multiple seamcalls. Also, it's only allowed during TD build time.
Given that TDX guest is para-virtualized to boot BSP/APs, normally there
shouldn't be any INIT/SIPI event for TDX guest. If any, three options to
handle them:
1. Always block INIT/SIPI request.
2. (Silently) ignore INIT/SIPI request during delivery.
3. Return error to guest TDs somehow.
Choose option 1 for simplicity. Since INIT and SIPI are always blocked,
INIT handling and the OP vcpu_deliver_sipi_vector() won't be called, no
need to add new interface or helper function for INIT/SIPI delivery.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014757.897978-10-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Sat, 22 Feb 2025 01:47:49 +0000 (09:47 +0800)]
KVM: TDX: Handle SMI request as !CONFIG_KVM_SMM
Handle SMI request as what KVM does for CONFIG_KVM_SMM=n, i.e. return
-ENOTTY, and add KVM_BUG_ON() to SMI related OPs for TD.
TDX doesn't support system-management mode (SMM) and system-management
interrupt (SMI) in guest TDs. Because guest state (vCPU state, memory
state) is protected, it must go through the TDX module APIs to change
guest state. However, the TDX module doesn't provide a way for VMM to
inject SMI into guest TD or a way for VMM to switch guest vCPU mode into
SMM.
MSR_IA32_SMBASE will not be emulated for TDX guest, -ENOTTY will be
returned when SMI is requested.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014757.897978-9-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Sat, 22 Feb 2025 01:47:48 +0000 (09:47 +0800)]
KVM: TDX: Implement methods to inject NMI
Inject NMI to TDX guest by setting the PEND_NMI TDVPS field to 1, i.e. make
the NMI pending in the TDX module. If there is a further pending NMI in
KVM, collapse it to the one pending in the TDX module.
VMM can request the TDX module to inject a NMI into a TDX vCPU by setting
the PEND_NMI TDVPS field to 1. Following that, VMM can call TDH.VP.ENTER
to run the vCPU and the TDX module will attempt to inject the NMI as soon
as possible.
KVM has the following 3 cases to inject two NMIs when handling simultaneous
NMIs and they need to be injected in a back-to-back way. Otherwise, OS
kernel may fire a warning about the unknown NMI [1]:
K1. One NMI is being handled in the guest and one NMI pending in KVM.
KVM requests NMI window exit to inject the pending NMI.
K2. Two NMIs are pending in KVM.
KVM injects the first NMI and requests NMI window exit to inject the
second NMI.
K3. A previous NMI needs to be rejected and one NMI pending in KVM.
KVM first requests force immediate exit followed by a VM entry to
complete the NMI rejection. Then, during the force immediate exit, KVM
requests NMI window exit to inject the pending NMI.
For TDX, PEND_NMI TDVPS field is a 1-bit field, i.e. KVM can only pend one
NMI in the TDX module. Also, the vCPU state is protected, KVM doesn't know
the NMI blocking states of TDX vCPU, KVM has to assume NMI is always
unmasked and allowed. When KVM sees PEND_NMI is 1 after a TD exit, it
means the previous NMI needs to be re-injected.
Based on KVM's NMI handling flow, there are following 6 cases:
In NMI handler TDX module KVM
T1. No PEND_NMI=0 1 pending NMI
T2. No PEND_NMI=0 2 pending NMIs
T3. No PEND_NMI=1 1 pending NMI
T4. Yes PEND_NMI=0 1 pending NMI
T5. Yes PEND_NMI=0 2 pending NMIs
T6. Yes PEND_NMI=1 1 pending NMI
K1 is mapped to T4.
K2 is mapped to T2 or T5.
K3 is mapped to T3 or T6.
Note: KVM doesn't know whether NMI is blocked by a NMI or not, case T5 and
T6 can happen.
When handling pending NMI in KVM for TDX guest, what KVM can do is to add a
pending NMI in TDX module when PEND_NMI is 0. T1 and T4 can be handled by
this way. However, TDX doesn't allow KVM to request NMI window exit
directly, if PEND_NMI is already set and there is still pending NMI in KVM,
the only way KVM could try is to request a force immediate exit. But for
case T5 and T6, force immediate exit will result in infinite loop because
force immediate exit makes it no progress in the NMI handler, so that the
pending NMI in the TDX module can never be injected.
Considering on X86 bare metal, multiple NMIs could collapse into one NMI,
e.g. when NMI is blocked by SMI. It's OS's responsibility to poll all NMI
sources in the NMI handler to avoid missing handling of some NMI events.
Based on that, for the above 3 cases (K1-K3), only case K1 must inject the
second NMI because the guest NMI handler may have already polled some of
the NMI sources, which could include the source of the pending NMI, the
pending NMI must be injected to avoid the lost of NMI. For case K2 and K3,
the guest OS will poll all NMI sources (including the sources caused by the
second NMI and further NMI collapsed) when the delivery of the first NMI,
KVM doesn't have the necessity to inject the second NMI.
To handle the NMI injection properly for TDX, there are two options:
- Option 1: Modify the KVM's NMI handling common code, to collapse the
second pending NMI for K2 and K3.
- Option 2: Do it in TDX specific way. When the previous NMI is still
pending in the TDX module, i.e. it has not been delivered to TDX guest
yet, collapse the pending NMI in KVM into the previous one.
This patch goes with option 2 because it is simple and doesn't impact other
VM types. Option 1 may need more discussions.
This is the first need to access vCPU scope metadata in the "management"
class. Make needed accessors available.
[1] https://lore.kernel.org/all/
1317409584-23662-5-git-send-email-dzickus@redhat.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20250222014757.897978-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Sat, 22 Feb 2025 01:42:25 +0000 (09:42 +0800)]
KVM: TDX: Handle TDX PV MMIO hypercall
Handle TDX PV MMIO hypercall when TDX guest calls TDVMCALL with the
leaf #VE.RequestMMIO (same value as EXIT_REASON_EPT_VIOLATION) according
to TDX Guest Host Communication Interface (GHCI) spec.
For TDX guests, VMM is not allowed to access vCPU registers and the private
memory, and the code instructions must be fetched from the private memory.
So MMIO emulation implemented for non-TDX VMs is not possible for TDX
guests.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest. The #VE handling is supposed to emulate the MMIO
instruction inside the guest and convert it into a TDVMCALL with the
leaf #VE.RequestMMIO, which equals to EXIT_REASON_EPT_VIOLATION.
The requested MMIO address must be in shared GPA space. The shared bit
is stripped after check because the existing code for MMIO emulation is
not aware of the shared bit.
The MMIO GPA shouldn't have a valid memslot, also the attribute of the GPA
should be shared. KVM could do the checks before exiting to userspace,
however, even if KVM does the check, there still will be race conditions
between the check in KVM and the emulation of MMIO access in userspace due
to a memslot hotplug, or a memory attribute conversion. If userspace
doesn't check the attribute of the GPA and the attribute happens to be
private, it will not pose a security risk or cause an MCE, but it can lead
to another issue. E.g., in QEMU, treating a GPA with private attribute as
shared when it falls within RAM's range can result in extra memory
consumption during the emulation to the access to the HVA of the GPA.
There are two options: 1) Do the check both in KVM and userspace. 2) Do
the check only in QEMU. This patch chooses option 2, i.e. KVM omits the
memslot and attribute checks, and expects userspace to do the checks.
Similar to normal MMIO emulation, try to handle the MMIO in kernel first,
if kernel can't support it, forward the request to userspace. Export
needed symbols used for MMIO handling.
Fragments handling is not needed for TDX PV MMIO because GPA is provided,
if a MMIO access crosses page boundary, it should be continuous in GPA.
Also, the size is limited to 1, 2, 4, 8 bytes. No further split needed.
Allow cross page access because no extra handling needed after checking
both start and end GPA are shared GPAs.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20250222014225.897298-10-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Sat, 22 Feb 2025 01:47:47 +0000 (09:47 +0800)]
KVM: TDX: Wait lapic expire when timer IRQ was injected
Call kvm_wait_lapic_expire() when POSTED_INTR_ON is set and the vector
for LVTT is set in PIR before TD entry.
KVM always assumes a timer IRQ was injected if APIC state is protected.
For TDX guest, APIC state is protected and KVM injects timer IRQ via posted
interrupt. To avoid unnecessary wait calls, only call
kvm_wait_lapic_expire() when a timer IRQ was injected, i.e., POSTED_INTR_ON
is set and the vector for LVTT is set in PIR.
Add a helper to test PIR.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014757.897978-7-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Sat, 22 Feb 2025 01:42:24 +0000 (09:42 +0800)]
KVM: TDX: Handle TDX PV port I/O hypercall
Emulate port I/O requested by TDX guest via TDVMCALL with leaf
Instruction.IO (same value as EXIT_REASON_IO_INSTRUCTION) according to
TDX Guest Host Communication Interface (GHCI).
All port I/O instructions inside the TDX guest trigger the #VE exception.
On #VE triggered by I/O instructions, TDX guest can call TDVMCALL with
leaf Instruction.IO to request VMM to emulate I/O instructions.
Similar to normal port I/O emulation, try to handle the port I/O in kernel
first, if kernel can't support it, forward the request to userspace.
Note string I/O operations are not supported in TDX. Guest should unroll
them before calling the TDVMCALL.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20250222014225.897298-9-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Sat, 22 Feb 2025 01:47:46 +0000 (09:47 +0800)]
KVM: x86: Assume timer IRQ was injected if APIC state is protected
If APIC state is protected, i.e. the vCPU is a TDX guest, assume a timer
IRQ was injected when deciding whether or not to busy wait in the "timer
advanced" path. The "real" vIRR is not readable/writable, so trying to
query for a pending timer IRQ will return garbage.
Note, TDX can scour the PIR if it wants to be more precise and skip the
"wait" call entirely.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014757.897978-6-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Binbin Wu [Sat, 22 Feb 2025 01:42:23 +0000 (09:42 +0800)]
KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError>
Convert TDG.VP.VMCALL<ReportFatalError> to KVM_EXIT_SYSTEM_EVENT with
a new type KVM_SYSTEM_EVENT_TDX_FATAL and forward it to userspace for
handling.
TD guest can use TDG.VP.VMCALL<ReportFatalError> to report the fatal
error it has experienced. This hypercall is special because TD guest
is requesting a termination with the error information, KVM needs to
forward the hypercall to userspace anyway, KVM doesn't do parsing or
conversion, it just dumps the 16 general-purpose registers to userspace
and let userspace decide what to do.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014225.897298-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Sat, 22 Feb 2025 01:47:45 +0000 (09:47 +0800)]
KVM: TDX: Implement non-NMI interrupt injection
Implement non-NMI interrupt injection for TDX via posted interrupt.
As CPU state is protected and APICv is enabled for the TDX guest, TDX
supports non-NMI interrupt injection only by posted interrupt. Posted
interrupt descriptors (PIDs) are allocated in shared memory, KVM can
update them directly. If target vCPU is in non-root mode, send posted
interrupt notification to the vCPU and hardware will sync PIR to vIRR
atomically. Otherwise, kick it to pick up the interrupt from PID. To
post pending interrupts in the PID, KVM can generate a self-IPI with
notification vector prior to TD entry.
Since the guest status of TD vCPU is protected, assume interrupt is
always allowed. Ignore the code path for event injection mechanism or
LAPIC emulation for TDX.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20250222014757.897978-5-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Binbin Wu [Sat, 22 Feb 2025 01:42:22 +0000 (09:42 +0800)]
KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>
Convert TDG.VP.VMCALL<MapGPA> to KVM_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE and forward it to userspace for handling.
MapGPA is used by TDX guest to request to map a GPA range as private
or shared memory. It needs to exit to userspace for handling. KVM has
already implemented a similar hypercall KVM_HC_MAP_GPA_RANGE, which will
exit to userspace with exit reason KVM_EXIT_HYPERCALL. Do sanity checks,
convert TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE and forward the request
to userspace.
To prevent a TDG.VP.VMCALL<MapGPA> call from taking too long, the MapGPA
range is split into 2MB chunks and check interrupt pending between chunks.
This allows for timely injection of interrupts and prevents issues with
guest lockup detection. TDX guest should retry the operation for the
GPA starting at the address specified in R11 when the TDVMCALL return
TDVMCALL_RETRY as status code.
Note userspace needs to enable KVM_CAP_EXIT_HYPERCALL with
KVM_HC_MAP_GPA_RANGE bit set for TD VM.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014225.897298-7-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Sat, 22 Feb 2025 01:47:44 +0000 (09:47 +0800)]
KVM: VMX: Move posted interrupt delivery code to common header
Move posted interrupt delivery code to common header so that TDX can
leverage it.
No functional change intended.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Message-ID: <
20250222014757.897978-4-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Sat, 22 Feb 2025 01:42:21 +0000 (09:42 +0800)]
KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL
Handle KVM hypercall for TDX according to TDX Guest-Host Communication
Interface (GHCI) specification.
The TDX GHCI specification defines the ABI for the guest TD to issue
hypercalls. When R10 is non-zero, it indicates the TDG.VP.VMCALL is
vendor-specific. KVM uses R10 as KVM hypercall number and R11-R14
as 4 arguments, while the error code is returned in R10.
Morph the TDG.VP.VMCALL with KVM hypercall to EXIT_REASON_VMCALL and
marshall r10~r14 from vp_enter_args to the appropriate x86 registers for
KVM hypercall handling.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014225.897298-6-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Sat, 22 Feb 2025 01:47:43 +0000 (09:47 +0800)]
KVM: TDX: Disable PI wakeup for IPIv
Disable PI wakeup for IPI virtualization (IPIv) case for TDX.
When a vCPU is being scheduled out, notification vector is switched and
pi_wakeup_handler() is enabled when the vCPU has interrupt enabled and
posted interrupt is used to wake up the vCPU.
For VMX, a blocked vCPU can be the target of posted interrupts when using
IPIv or VT-d PI. TDX doesn't support IPIv, disable PI wakeup for IPIv.
Also, since the guest status of TD vCPU is protected, assume interrupt is
always enabled for TD. (PV HLT hypercall is not support yet, TDX guest
tells VMM whether HLT is called with interrupt disabled or not.)
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014757.897978-3-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Sat, 22 Feb 2025 01:42:20 +0000 (09:42 +0800)]
KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)
Add a place holder and related helper functions for preparation of
TDG.VP.VMCALL handling.
The TDX module specification defines TDG.VP.VMCALL API (TDVMCALL for short)
for the guest TD to call hypercall to VMM. When the guest TD issues a
TDVMCALL, the guest TD exits to VMM with a new exit reason. The arguments
from the guest TD and returned values from the VMM are passed in the guest
registers. The guest RCX register indicates which registers are used.
Define helper functions to access those registers.
A new VMX exit reason TDCALL is added to indicate the exit is due to
TDVMCALL from the guest TD. Define the TDCALL exit reason and add a place
holder to handle such exit.
Some leafs of TDCALL will be morphed to another VMX exit reason instead of
EXIT_REASON_TDCALL, add a helper tdcall_to_vmx_exit_reason() as a place
holder to do the conversion.
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Message-ID: <
20250222014225.897298-5-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Sat, 22 Feb 2025 01:47:42 +0000 (09:47 +0800)]
KVM: TDX: Add support for find pending IRQ in a protected local APIC
Add flag and hook to KVM's local APIC management to support determining
whether or not a TDX guest has a pending IRQ. For TDX vCPUs, the virtual
APIC page is owned by the TDX module and cannot be accessed by KVM. As a
result, registers that are virtualized by the CPU, e.g. PPR, cannot be
read or written by KVM. To deliver interrupts for TDX guests, KVM must
send an IRQ to the CPU on the posted interrupt notification vector. And
to determine if TDX vCPU has a pending interrupt, KVM must check if there
is an outstanding notification.
Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is
protected to short-circuit the various other flows that try to pull an
IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is
pending, KVM can't do anything based on _which_ IRQ is pending.
Intentionally omit sanity checks from other flows, e.g. PPR update, so as
not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM
and userspace will never reach those flows for TDX guests, but reaching
them is not fatal if something does go awry.
For the TD exits not due to HLT TDCALL, skip checking RVI pending in
tdx_protected_apic_has_interrupt(). Except for the guest being stupid
(e.g., non-HLT TDCALL in an interrupt shadow), it's not even possible to
have an interrupt in RVI that is fully unmasked. There is no any CPU flows
that modify RVI in the middle of instruction execution. I.e. if RVI is
non-zero, then either the interrupt has been pending since before the TD
exit, or the instruction caused the TD exit is in an STI/SS shadow. KVM
doesn't care about STI/SS shadows outside of the HALTED case. And if the
interrupt was pending before TD exit, then it _must_ be blocked, otherwise
the interrupt would have been serviced at the instruction boundary.
For the HLT TDCALL case, it will be handled in a future patch when HLT
TDCALL is supported.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014757.897978-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Wed, 29 Jan 2025 09:59:01 +0000 (11:59 +0200)]
KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior
Add a flag KVM_DEBUGREG_AUTO_SWITCH to skip saving/restoring guest
DRs.
TDX-SEAM unconditionally saves/restores guest DRs on TD exit/enter,
and resets DRs to architectural INIT state on TD exit. Use the new
flag KVM_DEBUGREG_AUTO_SWITCH to indicate that KVM doesn't need to
save/restore guest DRs. KVM still needs to restore host DRs after TD
exit if there are active breakpoints in the host, which is covered by
the existing code.
MOV-DR exiting is always cleared for TDX guests, so the handler for DR
access is never called, and KVM_DEBUGREG_WONT_EXIT is never set. Add
a warning if both KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_AUTO_SWITCH
are set.
Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: rework changelog]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20241210004946.
3718496-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20250129095902.16391-13-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Thu, 6 Mar 2025 18:27:04 +0000 (13:27 -0500)]
KVM: TDX: Add a place holder to handle TDX VM exit
Introduce the wiring for handling TDX VM exits by implementing the
callbacks .get_exit_info(), .get_entry_info(), and .handle_exit().
Additionally, add error handling during the TDX VM exit flow, and add a
place holder to handle various exit reasons.
Store VMX exit reason and exit qualification in struct vcpu_vt for TDX,
so that TDX/VMX can use the same helpers to get exit reason and exit
qualification. Store extended exit qualification and exit GPA info in
struct vcpu_tdx because they are used by TDX code only.
Contention Handling: The TDH.VP.ENTER operation may contend with TDH.MEM.*
operations due to secure EPT or TD EPOCH. If the contention occurs,
the return value will have TDX_OPERAND_BUSY set, prompting the vCPU to
attempt re-entry into the guest with EXIT_FASTPATH_EXIT_HANDLED,
not EXIT_FASTPATH_REENTER_GUEST, so that the interrupts pending during
IN_GUEST_MODE can be delivered for sure. Otherwise, the requester of
KVM_REQ_OUTSIDE_GUEST_MODE may be blocked endlessly.
Error Handling:
- TDX_SW_ERROR: This includes #UD caused by SEAMCALL instruction if the
CPU isn't in VMX operation, #GP caused by SEAMCALL instruction when TDX
isn't enabled by the BIOS, and TDX_SEAMCALL_VMFAILINVALID when SEAM
firmware is not loaded or disabled.
- TDX_ERROR: This indicates some check failed in the TDX module, preventing
the vCPU from running.
- Failed VM Entry: Exit to userspace with KVM_EXIT_FAIL_ENTRY. Handle it
separately before handling TDX_NON_RECOVERABLE because when off-TD debug
is not enabled, TDX_NON_RECOVERABLE is set.
- TDX_NON_RECOVERABLE: Set by the TDX module when the error is
non-recoverable, indicating that the TDX guest is dead or the vCPU is
disabled.
A special case is triple fault, which also sets TDX_NON_RECOVERABLE but
exits to userspace with KVM_EXIT_SHUTDOWN, aligning with the VMX case.
- Any unhandled VM exit reason will also return to userspace with
KVM_EXIT_INTERNAL_ERROR.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Message-ID: <
20250222014225.897298-4-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Adrian Hunter [Wed, 29 Jan 2025 09:59:00 +0000 (11:59 +0200)]
KVM: TDX: Save and restore IA32_DEBUGCTL
Save the IA32_DEBUGCTL MSR before entering a TDX VCPU and restore it
afterwards. The TDX Module preserves bits 1, 12, and 14, so if no
other bits are set, no restore is done.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <
20250129095902.16391-12-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Binbin Wu [Sat, 22 Feb 2025 01:42:18 +0000 (09:42 +0800)]
KVM: x86: Move pv_unhalted check out of kvm_vcpu_has_events()
Move pv_unhalted check out of kvm_vcpu_has_events(), check pv_unhalted
explicitly when handling PV unhalt and expose kvm_vcpu_has_events().
kvm_vcpu_has_events() returns true if pv_unhalted is set, and pv_unhalted
is only cleared on transitions to KVM_MP_STATE_RUNNABLE. If the guest
initiates a spurious wakeup, pv_unhalted could be left set in perpetuity.
Currently, this is not problematic because kvm_vcpu_has_events() is only
called when handling PV unhalt. However, if kvm_vcpu_has_events() is used
for other purposes in the future, it could return the unexpected results.
Export kvm_vcpu_has_events() for its usage in broader contexts.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20250222014225.897298-3-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Adrian Hunter [Wed, 29 Jan 2025 09:58:59 +0000 (11:58 +0200)]
KVM: TDX: Disable support for TSX and WAITPKG
Support for restoring IA32_TSX_CTRL MSR and IA32_UMWAIT_CONTROL MSR is not
yet implemented, so disable support for TSX and WAITPKG for now. Clear the
associated CPUID bits returned by KVM_TDX_CAPABILITIES, and return an error
if those bits are set in KVM_TDX_INIT_VM.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <
20250129095902.16391-11-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Binbin Wu [Sat, 22 Feb 2025 01:42:17 +0000 (09:42 +0800)]
KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs
Have ____kvm_emulate_hypercall() read the GPRs instead of passing them
in via the macro.
When emulating KVM hypercalls via TDVMCALL, TDX will marshall registers of
TDVMCALL ABI into KVM's x86 registers to match the definition of KVM
hypercall ABI _before_ ____kvm_emulate_hypercall() gets called. Therefore,
____kvm_emulate_hypercall() can just read registers internally based on KVM
hypercall ABI, and those registers can be removed from the
__kvm_emulate_hypercall() macro.
Also, op_64_bit can be determined inside ____kvm_emulate_hypercall(),
remove it from the __kvm_emulate_hypercall() macro as well.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <
20250222014225.897298-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Wed, 29 Jan 2025 09:58:58 +0000 (11:58 +0200)]
KVM: TDX: restore user ret MSRs
Several MSRs are clobbered on TD exit that are not used by Linux while
in ring 0. Ensure the cached value of the MSR is updated on vcpu_put,
and the MSRs themselves before returning to ring 3.
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20250129095902.16391-10-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Chao Gao [Wed, 29 Jan 2025 09:58:57 +0000 (11:58 +0200)]
KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr
Several MSRs are constant and only used in userspace(ring 3). But VMs may
have different values. KVM uses kvm_set_user_return_msr() to switch to
guest's values and leverages user return notifier to restore them when the
kernel is to return to userspace. To eliminate unnecessary wrmsr, KVM also
caches the value it wrote to an MSR last time.
TDX module unconditionally resets some of these MSRs to architectural INIT
state on TD exit. It makes the cached values in kvm_user_return_msrs are
inconsistent with values in hardware. This inconsistency needs to be
fixed. Otherwise, it may mislead kvm_on_user_return() to skip restoring
some MSRs to the host's values. kvm_set_user_return_msr() can help correct
this case, but it is not optimal as it always does a wrmsr. So, introduce
a variation of kvm_set_user_return_msr() to update cached values and skip
that wrmsr.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20250129095902.16391-9-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Wed, 29 Jan 2025 09:58:56 +0000 (11:58 +0200)]
KVM: TDX: restore host xsave state when exit from the guest TD
On exiting from the guest TD, xsave state is clobbered; restore it.
Do not use kvm_load_host_xsave_state(), as it relies on vcpu->arch
to find out whether other KVM_RUN code has loaded guest state into
XCR0/PKRU/XSS or not. In the case of TDX, the exit values are known
independent of the guest CR0 and CR4, and in fact the latter are not
available.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <
20250129095902.16391-8-adrian.hunter@intel.com>
[Rewrite to not use kvm_load_host_xsave_state. - Paolo]
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Wed, 29 Jan 2025 09:58:55 +0000 (11:58 +0200)]
KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
On entering/exiting TDX vcpu, preserved or clobbered CPU state is different
from the VMX case. Add TDX hooks to save/restore host/guest CPU state.
Save/restore kernel GS base MSR.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20250129095902.16391-7-adrian.hunter@intel.com>
Reviewed-by: Xiayao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Wed, 29 Jan 2025 09:58:54 +0000 (11:58 +0200)]
KVM: TDX: Implement TDX vcpu enter/exit path
Implement callbacks to enter/exit a TDX VCPU by calling tdh_vp_enter().
Ensure the TDX VCPU is in a correct state to run.
Do not pass arguments from/to vcpu->arch.regs[] unconditionally. Instead,
marshall state to/from the appropriate x86 registers only when needed,
i.e., to handle some TDVMCALL sub-leaves following KVM's ABI to leverage
the existing code.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <
20250129095902.16391-6-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Binbin Wu [Fri, 14 Mar 2025 18:06:48 +0000 (14:06 -0400)]
KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct
Move common fields of struct vcpu_vmx and struct vcpu_tdx to struct
vcpu_vt, to share the code between VMX/TDX as much as possible and to make
TDX exit handling more VMX like.
No functional change intended.
[Adrian: move code that depends on struct vcpu_vmx back to vmx.h]
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/Z1suNzg2Or743a7e@google.com
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <
20250129095902.16391-5-adrian.hunter@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yan Zhao [Mon, 17 Feb 2025 08:56:42 +0000 (16:56 +0800)]
KVM: TDX: Handle SEPT zap error due to page add error in premap
Move the handling of SEPT zap errors caused by unsuccessful execution of
tdh_mem_page_add() in KVM_TDX_INIT_MEM_REGION from
tdx_sept_drop_private_spte() to tdx_sept_zap_private_spte(). Introduce a
new helper function tdx_is_sept_zap_err_due_to_premap() to detect this
specific error.
During the IOCTL KVM_TDX_INIT_MEM_REGION, KVM premaps leaf SPTEs in the
mirror page table before the corresponding entry in the private page table
is successfully installed by tdh_mem_page_add(). If an error occurs during
the invocation of tdh_mem_page_add(), a mismatch between the mirror and
private page tables results in SEAMCALLs for SEPT zap returning the error
code TDX_EPT_ENTRY_STATE_INCORRECT.
The error TDX_EPT_WALK_FAILED is not possible because, during
KVM_TDX_INIT_MEM_REGION, KVM only premaps leaf SPTEs after successfully
mapping non-leaf SPTEs. Unlike leaf SPTEs, there is no mismatch in non-leaf
PTEs between the mirror and private page tables. Therefore, during zap,
SEAMCALLs should find an empty leaf entry in the private EPT, leading to
the error TDX_EPT_ENTRY_STATE_INCORRECT instead of TDX_EPT_WALK_FAILED.
Since tdh_mem_range_block() is always invoked before tdh_mem_page_remove(),
move the handling of SEPT zap errors from tdx_sept_drop_private_spte() to
tdx_sept_zap_private_spte(). In tdx_sept_zap_private_spte(), return 0 for
errors due to premap to skip executing other SEAMCALLs for zap, which are
unnecessary. Return 1 to indicate no other errors, allowing the execution
of other zap SEAMCALLs to continue.
The failure of tdh_mem_page_add() is uncommon and has not been observed in
real workloads. Currently, this failure is only hypothetically triggered by
skipping the real SEAMCALL and faking the add error in the SEAMCALL
wrapper. Additionally, without this fix, there will be no host crashes or
other severe issues.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20250217085642.19696-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Kai Huang [Thu, 21 Nov 2024 20:14:40 +0000 (22:14 +0200)]
x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest
Intel TDX protects guest VM's from malicious host and certain physical
attacks. TDX introduces a new operation mode, Secure Arbitration Mode
(SEAM) to isolate and protect guest VM's. A TDX guest VM runs in SEAM and,
unlike VMX, direct control and interaction with the guest by the host VMM
is not possible. Instead, Intel TDX Module, which also runs in SEAM,
provides a SEAMCALL API.
The SEAMCALL that provides the ability to enter a guest is TDH.VP.ENTER.
The TDX Module processes TDH.VP.ENTER, and enters the guest via VMX
VMLAUNCH/VMRESUME instructions. When a guest VM-exit requires host VMM
interaction, the TDH.VP.ENTER SEAMCALL returns to the host VMM (KVM).
Add tdh_vp_enter() to wrap the SEAMCALL invocation of TDH.VP.ENTER;
tdh_vp_enter() needs to be noinstr because VM entry in KVM is noinstr
as well, which is for two reasons:
* marking the area as CT_STATE_GUEST via guest_state_enter_irqoff() and
guest_state_exit_irqoff()
* IRET must be avoided between VM-exit and NMI handling, in order to
avoid prematurely releasing the NMI inhibit.
TDH.VP.ENTER is different from other SEAMCALLs in several ways: it
uses more arguments, and after it returns some host state may need to be
restored. Therefore tdh_vp_enter() uses __seamcall_saved_ret() instead of
__seamcall_ret(); since it is the only caller of __seamcall_saved_ret(),
it can be made noinstr also.
TDH.VP.ENTER arguments are passed through General Purpose Registers (GPRs).
For the special case of the TD guest invoking TDG.VP.VMCALL, nearly any GPR
can be used, as well as XMM0 to XMM15. Notably, RBP is not used, and Linux
mandates the TDX Module feature NO_RBP_MOD, which is enforced elsewhere.
Additionally, XMM registers are not required for the existing Guest
Hypervisor Communication Interface and are handled by existing KVM code
should they be modified by the guest.
There are 2 input formats and 5 output formats for TDH.VP.ENTER arguments.
Input #1 : Initial entry or following a previous async. TD Exit
Input #2 : Following a previous TDCALL(TDG.VP.VMCALL)
Output #1 : On Error (No TD Entry)
Output #2 : Async. Exits with a VMX Architectural Exit Reason
Output #3 : Async. Exits with a non-VMX TD Exit Status
Output #4 : Async. Exits with Cross-TD Exit Details
Output #5 : On TDCALL(TDG.VP.VMCALL)
Currently, to keep things simple, the wrapper function does not attempt
to support different formats, and just passes all the GPRs that could be
used. The GPR values are held by KVM in the area set aside for guest
GPRs. KVM code uses the guest GPR area (vcpu->arch.regs[]) to set up for
or process results of tdh_vp_enter().
Therefore changing tdh_vp_enter() to use more complex argument formats
would also alter the way KVM code interacts with tdh_vp_enter().
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <
20241121201448.36170-2-adrian.hunter@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 19 Feb 2025 12:43:51 +0000 (07:43 -0500)]
KVM: TDX: Skip updating CPU dirty logging request for TDs
Wrap vmx_update_cpu_dirty_logging so as to ignore requests to update
CPU dirty logging for TDs, as basic TDX does not support the PML feature.
Invoking vmx_update_cpu_dirty_logging() for TDs would cause an incorrect
access to a kvm_vmx struct for a TDX VM, so block that before it happens.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yan Zhao [Fri, 24 Jan 2025 01:58:26 +0000 (17:58 -0800)]
KVM: x86: Make cpu_dirty_log_size a per-VM value
Make cpu_dirty_log_size (CPU's dirty log buffer size) a per-VM value and
set the per-VM cpu_dirty_log_size only for normal VMs when PML is enabled.
Do not set it for TDs.
Until now, cpu_dirty_log_size was a system-wide value that is used for
all VMs and is set to the PML buffer size when PML was enabled in VMX.
However, PML is not currently supported for TDs, though PML remains
available for normal VMs as long as the feature is supported by hardware
and enabled in VMX.
Making cpu_dirty_log_size a per-VM value allows it to be ther PML buffer
size for normal VMs and 0 for TDs. This allows functions like
kvm_arch_sync_dirty_log() and kvm_mmu_update_cpu_dirty_logging() to
determine if PML is supported, in order to kick off vCPUs or request them
to update CPU dirty logging status (turn on/off PML in VMCS).
This fixes an issue first reported in [1], where QEMU attaches an
emulated VGA device to a TD; note that KVM_MEM_LOG_DIRTY_PAGES
still works if the corresponding has no flag KVM_MEM_GUEST_MEMFD.
KVM then invokes kvm_mmu_update_cpu_dirty_logging() and from there
vmx_update_cpu_dirty_logging(), which incorrectly accesses a kvm_vmx
struct for a TDX VM.
Reported-by: ANAND NARSHINHA PATIL <Anand.N.Patil@ibm.com>
Reported-by: Pedro Principeza <pedro.principeza@canonical.com>
Reported-by: Farrah Chen <farrah.chen@intel.com>
Closes: https://github.com/canonical/tdx/issues/202
Link: https://github.com/canonical/tdx/issues/202
Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yan Zhao [Mon, 13 Jan 2025 03:08:59 +0000 (11:08 +0800)]
KVM: x86/mmu: Add parameter "kvm" to kvm_mmu_page_ad_need_write_protect()
Add a parameter "kvm" to kvm_mmu_page_ad_need_write_protect() and its
caller tdp_mmu_need_write_protect().
This is a preparation to make cpu_dirty_log_size a per-VM value rather than
a system-wide value.
No function changes expected.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yan Zhao [Mon, 13 Jan 2025 03:08:41 +0000 (11:08 +0800)]
KVM: Add parameter "kvm" to kvm_cpu_dirty_log_size() and its callers
Add a parameter "kvm" to kvm_cpu_dirty_log_size() and down to its callers:
kvm_dirty_ring_get_rsvd_entries(), kvm_dirty_ring_alloc().
This is a preparation to make cpu_dirty_log_size a per-VM value rather than
a system-wide value.
No function changes expected.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:38:58 +0000 (15:38 +0800)]
KVM: TDX: Handle vCPU dissociation
Handle vCPUs dissociations by invoking SEAMCALL TDH.VP.FLUSH which flushes
the address translation caches and cached TD VMCS of a TD vCPU in its
associated pCPU.
In TDX, a vCPUs can only be associated with one pCPU at a time, which is
done by invoking SEAMCALL TDH.VP.ENTER. For a successful association, the
vCPU must be dissociated from its previous associated pCPU.
To facilitate vCPU dissociation, introduce a per-pCPU list
associated_tdvcpus. Add a vCPU into this list when it's loaded into a new
pCPU (i.e. when a vCPU is loaded for the first time or migrated to a new
pCPU).
vCPU dissociations can happen under below conditions:
- On the op hardware_disable is called.
This op is called when virtualization is disabled on a given pCPU, e.g.
when hot-unplug a pCPU or machine shutdown/suspend.
In this case, dissociate all vCPUs from the pCPU by iterating its
per-pCPU list associated_tdvcpus.
- On vCPU migration to a new pCPU.
Before adding a vCPU into associated_tdvcpus list of the new pCPU,
dissociation from its old pCPU is required, which is performed by issuing
an IPI and executing SEAMCALL TDH.VP.FLUSH on the old pCPU.
On a successful dissociation, the vCPU will be removed from the
associated_tdvcpus list of its previously associated pCPU.
- On tdx_mmu_release_hkid() is called.
TDX mandates that all vCPUs must be disassociated prior to the release of
an hkid. Therefore, dissociation of all vCPUs is a must before executing
the SEAMCALL TDH.MNG.VPFLUSHDONE and subsequently freeing the hkid.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20241112073858.22312-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Wed, 4 Sep 2024 03:07:50 +0000 (20:07 -0700)]
KVM: TDX: Finalize VM initialization
Add a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand,
KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization.
Documentation for the API is added in another patch:
"Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)"
For the purpose of attestation, a measurement must be made of the TDX VM
initial state. This is referred to as TD Measurement Finalization, and
uses SEAMCALL TDH.MR.FINALIZE, after which:
1. The VMM adding TD private pages with arbitrary content is no longer
allowed
2. The TDX VM is runnable
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <
20240904030751.117579-21-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:38:37 +0000 (15:38 +0800)]
KVM: TDX: Add an ioctl to create initial guest memory
Add a new ioctl for the user space VMM to initialize guest memory with the
specified memory contents.
Because TDX protects the guest's memory, the creation of the initial guest
memory requires a dedicated TDX module API, TDH.MEM.PAGE.ADD(), instead of
directly copying the memory contents into the guest's memory in the case of
the default VM type.
Define a new subcommand, KVM_TDX_INIT_MEM_REGION, of vCPU-scoped
KVM_MEMORY_ENCRYPT_OP. Check if the GFN is already pre-allocated, assign
the guest page in Secure-EPT, copy the initial memory contents into the
guest memory, and encrypt the guest memory. Optionally, extend the memory
measurement of the TDX guest.
The ioctl uses the vCPU file descriptor because of the TDX module's
requirement that the memory is added to the S-EPT (via TDH.MEM.SEPT.ADD)
prior to initialization (TDH.MEM.PAGE.ADD). Accessing the MMU in turn
requires a vCPU file descriptor, just like for KVM_PRE_FAULT_MEMORY. In
fact, the post-populate callback is able to reuse the same logic used by
KVM_PRE_FAULT_MEMORY, so that userspace can do everything with a single
ioctl.
Note that this is the only way to invoke TDH.MEM.SEPT.ADD before the TD
in finalized, as userspace cannot use KVM_PRE_FAULT_MEMORY at that
point. This ensures that there cannot be pages in the S-EPT awaiting
TDH.MEM.PAGE.ADD, which would be treated incorrectly as spurious by
tdp_mmu_map_handle_target_level() (KVM would see the SPTE as PRESENT,
but the corresponding S-EPT entry will be !PRESENT).
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
- KVM_BUG_ON() for kvm_tdx->nr_premapped (Paolo)
- Use tdx_operand_busy()
- Merge first patch in SEPT SEAMCALL retry series in to this base
(Paolo)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rick Edgecombe [Tue, 12 Nov 2024 07:38:27 +0000 (15:38 +0800)]
KVM: x86/mmu: Export kvm_tdp_map_page()
In future changes coco specific code will need to call kvm_tdp_map_page()
from within their respective gmem_post_populate() callbacks. Export it
so this can be done from vendor specific code. Since kvm_mmu_reload()
will be needed for this operation, export its callee kvm_mmu_load() as
well.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20241112073827.22270-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yan Zhao [Thu, 20 Feb 2025 10:27:27 +0000 (18:27 +0800)]
KVM: x86/mmu: Bail out kvm_tdp_map_page() when VM dead
Bail out of the loop in kvm_tdp_map_page() when a VM is dead. Otherwise,
kvm_tdp_map_page() may get stuck in the kernel loop when there's only one
vCPU in the VM (or if the other vCPUs are not executing ioctls), even if
fatal errors have occurred.
kvm_tdp_map_page() is called by the ioctl KVM_PRE_FAULT_MEMORY or the TDX
ioctl KVM_TDX_INIT_MEM_REGION. It loops in the kernel whenever RET_PF_RETRY
is returned. In the TDP MMU, kvm_tdp_mmu_map() always returns RET_PF_RETRY,
regardless of the specific error code from tdp_mmu_set_spte_atomic(),
tdp_mmu_link_sp(), or tdp_mmu_split_huge_page(). While this is acceptable
in general cases where the only possible error code from these functions is
-EBUSY, TDX introduces an additional error code, -EIO, due to SEAMCALL
errors.
Since this -EIO error is also a fatal error, check for VM dead in the
kvm_tdp_map_page() to avoid unnecessary retries until a signal is pending.
The error -EIO is uncommon and has not been observed in real workloads.
Currently, it is only hypothetically triggered by bypassing the real
SEAMCALL and faking an error in the SEAMCALL wrapper.
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20250220102728.24546-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:38:16 +0000 (15:38 +0800)]
KVM: TDX: Implement hook to get max mapping level of private pages
Implement hook private_max_mapping_level for TDX to let TDP MMU core get
max mapping level of private pages.
The value is hard coded to 4K for no huge page support for now.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20241112073816.22256-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:38:04 +0000 (15:38 +0800)]
KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
Implement hooks in TDX to propagate changes of mirror page table to private
EPT, including changes for page table page adding/removing, guest page
adding/removing.
TDX invokes corresponding SEAMCALLs in the hooks.
- Hook link_external_spt
propagates adding page table page into private EPT.
- Hook set_external_spte
tdx_sept_set_private_spte() in this patch only handles adding of guest
private page when TD is finalized.
Later patches will handle the case of adding guest private pages before
TD finalization.
- Hook free_external_spt
It is invoked when page table page is removed in mirror page table, which
currently must occur at TD tear down phase, after hkid is freed.
- Hook remove_external_spte
It is invoked when guest private page is removed in mirror page table,
which can occur when TD is active, e.g. during shared <-> private
conversion and slot move/deletion.
This hook is ensured to be triggered before hkid is freed, because
gmem fd is released along with all private leaf mappings zapped before
freeing hkid at VM destroy.
TDX invokes below SEAMCALLs sequentially:
1) TDH.MEM.RANGE.BLOCK (remove RWX bits from a private EPT entry),
2) TDH.MEM.TRACK (increases TD epoch)
3) TDH.MEM.PAGE.REMOVE (remove the private EPT entry and untrack the
guest page).
TDH.MEM.PAGE.REMOVE can't succeed without TDH.MEM.RANGE.BLOCK and
TDH.MEM.TRACK being called successfully.
SEAMCALL TDH.MEM.TRACK is called in function tdx_track() to enforce that
TLB tracking will be performed by TDX module for private EPT.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
- Remove TDX_ERROR_SEPT_BUSY and Add tdx_operand_busy() helper (Binbin)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:37:53 +0000 (15:37 +0800)]
KVM: TDX: Handle TLB tracking for TDX
Handle TLB tracking for TDX by introducing function tdx_track() for private
memory TLB tracking and implementing flush_tlb* hooks to flush TLBs for
shared memory.
Introduce function tdx_track() to do TLB tracking on private memory, which
basically does two things: calling TDH.MEM.TRACK to increase TD epoch and
kicking off all vCPUs. The private EPT will then be flushed when each vCPU
re-enters the TD. This function is unused temporarily in this patch and
will be called on a page-by-page basis on removal of private guest page in
a later patch.
In earlier revisions, tdx_track() relied on an atomic counter to coordinate
the synchronization between the actions of kicking off vCPUs, incrementing
the TD epoch, and the vCPUs waiting for the incremented TD epoch after
being kicked off.
However, the core MMU only actually needs to call tdx_track() while
aleady under a write mmu_lock. So this sychnonization can be made to be
unneeded. vCPUs are kicked off only after the successful execution of
TDH.MEM.TRACK, eliminating the need for vCPUs to wait for TDH.MEM.TRACK
completion after being kicked off. tdx_track() is therefore able to send
requests KVM_REQ_OUTSIDE_GUEST_MODE rather than KVM_REQ_TLB_FLUSH.
Hooks for flush_remote_tlb and flush_remote_tlbs_range are not necessary
for TDX, as tdx_track() will handle TLB tracking of private memory on
page-by-page basis when private guest pages are removed. There is no need
to invoke tdx_track() again in kvm_flush_remote_tlbs() even after changes
to the mirrored page table.
For hooks flush_tlb_current and flush_tlb_all, which are invoked during
kvm_mmu_load() and vcpu load for normal VMs, let VMM to flush all EPTs in
the two hooks for simplicity, since TDX does not depend on the two
hooks to notify TDX module to flush private EPT in those cases.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20241112073753.22228-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:37:43 +0000 (15:37 +0800)]
KVM: TDX: Set per-VM shadow_mmio_value to 0
Set per-VM shadow_mmio_value to 0 for TDX.
With enable_mmio_caching on, KVM installs MMIO SPTEs for TDs. To correctly
configure MMIO SPTEs, TDX requires the per-VM shadow_mmio_value to be set
to 0. This is necessary to override the default value of the suppress VE
bit in the SPTE, which is 1, and to ensure value 0 in RWX bits.
For MMIO SPTE, the spte value changes as follows:
1. initial value (suppress VE bit is set)
2. Guest issues MMIO and triggers EPT violation
3. KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
4. Guest MMIO resumes. It triggers VE exception in guest TD
5. Guest VE handler issues TDG.VP.VMCALL<MMIO>
6. KVM handles MMIO
7. Guest VE handler resumes its execution after MMIO instruction
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20241112073743.22214-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:37:30 +0000 (15:37 +0800)]
KVM: x86/mmu: Add setter for shadow_mmio_value
Future changes will want to set shadow_mmio_value from TDX code. Add a
helper to setter with a name that makes more sense from that context.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[split into new patch]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20241112073730.22200-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:37:20 +0000 (15:37 +0800)]
KVM: TDX: Require TDP MMU, mmio caching and EPT A/D bits for TDX
Disable TDX support when TDP MMU or mmio caching or EPT A/D bits aren't
supported.
As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU
support for TDX isn't implemented.
TDX requires KVM mmio caching. Without mmio caching, KVM will go to MMIO
emulation without installing SPTEs for MMIOs. However, TDX guest is
protected and KVM would meet errors when trying to emulate MMIOs for TDX
guest during instruction decoding. So, TDX guest relies on SPTEs being
installed for MMIOs, which are with no RWX bits and with VE suppress bit
unset, to inject VE to TDX guest. The TDX guest would then issue TDVMCALL
in the VE handler to perform instruction decoding and have host do MMIO
emulation.
TDX also relies on EPT A/D bits as EPT A/D bits have been supported in all
CPUs since Haswell. Relying on it can avoid RWX bits being masked out in
the mirror page table for prefaulted entries.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
Requested by Sean at [1].
[1] https://lore.kernel.org/kvm/Zva4aORxE9ljlMNe@google.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:36:13 +0000 (15:36 +0800)]
KVM: TDX: Set gfn_direct_bits to shared bit
Make the direct root handle memslot GFNs at an alias with the TDX shared
bit set.
For TDX shared memory, the memslot GFNs need to be mapped at an alias with
the shared bit set. These shared mappings will be mapped on the KVM MMU's
"direct" root. The direct root has it's mappings shifted by applying
"gfn_direct_bits" as a mask. The concept of "GPAW" (guest physical address
width) determines the location of the shared bit. So set gfn_direct_bits
based on this, to map shared memory at the proper GPA.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20241112073613.22100-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Tue, 12 Nov 2024 07:36:01 +0000 (15:36 +0800)]
KVM: TDX: Add load_mmu_pgd method for TDX
TDX uses two EPT pointers, one for the private half of the GPA space and
one for the shared half. The private half uses the normal EPT_POINTER vmcs
field, which is managed in a special way by the TDX module. For TDX, KVM is
not allowed to operate on it directly. The shared half uses a new
SHARED_EPT_POINTER field and will be managed by the conventional MMU
management operations that operate directly on the EPT root. This means for
TDX the .load_mmu_pgd() operation will need to know to use the
SHARED_EPT_POINTER field instead of the normal one. Add a new wrapper in
x86 ops for load_mmu_pgd() that either directs the write to the existing
vmx implementation or a TDX one.
tdx_load_mmu_pgd() is so much simpler than vmx_load_mmu_pgd() since for the
TDX mode of operation, EPT will always be used and KVM does not need to be
involved in virtualization of CR3 behavior. So tdx_load_mmu_pgd() can
simply write to SHARED_EPT_POINTER.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20241112073601.22084-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:35:50 +0000 (15:35 +0800)]
KVM: TDX: Add accessors VMX VMCS helpers
TDX defines SEAMCALL APIs to access TDX control structures corresponding to
the VMX VMCS. Introduce helper accessors to hide its SEAMCALL ABI details.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20241112073551.22070-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rick Edgecombe [Tue, 12 Nov 2024 07:35:39 +0000 (15:35 +0800)]
KVM: VMX: Teach EPT violation helper about private mem
Teach EPT violation helper to check shared mask of a GPA to find out
whether the GPA is for private memory.
When EPT violation is triggered after TD accessing a private GPA, KVM will
exit to user space if the corresponding GFN's attribute is not private.
User space will then update GFN's attribute during its memory conversion
process. After that, TD will re-access the private GPA and trigger EPT
violation again. Only with GFN's attribute matches to private, KVM will
fault in private page, map it in mirrored TDP root, and propagate changes
to private EPT to resolve the EPT violation.
Relying on GFN's attribute tracking xarray to determine if a GFN is
private, as for KVM_X86_SW_PROTECTED_VM, may lead to endless EPT
violations.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20241112073539.22056-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Tue, 12 Nov 2024 07:35:28 +0000 (15:35 +0800)]
KVM: VMX: Split out guts of EPT violation to common/exposed function
The difference of TDX EPT violation is how to retrieve information, GPA,
and exit qualification. To share the code to handle EPT violation, split
out the guts of EPT violation handler so that VMX/TDX exit handler can call
it after retrieving GPA and exit qualification.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <
20241112073528.22042-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yan Zhao [Tue, 12 Nov 2024 07:35:15 +0000 (15:35 +0800)]
KVM: x86/mmu: Do not enable page track for TD guest
Fail kvm_page_track_write_tracking_enabled() if VM type is TDX to make the
external page track user fail in kvm_page_track_register_notifier() since
TDX does not support write protection and hence page track.
No need to fail KVM internal users of page track (i.e. for shadow page),
because TDX is always with EPT enabled and currently TDX module does not
emulate and send VMLAUNCH/VMRESUME VMExits to VMM.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Cc: Yuan Yao <yuan.yao@linux.intel.com>
Message-ID: <
20241112073515.22028-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:34:57 +0000 (15:34 +0800)]
KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU
Export a function to walk down the TDP without modifying it and simply
check if a GPA is mapped.
Future changes will support pre-populating TDX private memory. In order to
implement this KVM will need to check if a given GFN is already
pre-populated in the mirrored EPT. [1]
There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within
the KVM MMU that almost does what is required. However, to make sense of
the results, MMU internal PTE helpers are needed. Refactor the code to
provide a helper that can be used outside of the KVM MMU code.
Refactoring the KVM page fault handler to support this lookup usage was
also considered, but it was an awkward fit.
kvm_tdp_mmu_gpa_is_mapped() is based on a diff by Paolo Bonzini.
Link: https://lore.kernel.org/kvm/ZfBkle1eZFfjPI8l@google.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <
20241112073457.22011-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rick Edgecombe [Tue, 12 Nov 2024 07:34:26 +0000 (15:34 +0800)]
KVM: x86/mmu: Implement memslot deletion for TDX
Update attr_filter field to zap both private and shared mappings for TDX
when memslot is deleted.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20241112073426.21997-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:37:08 +0000 (15:37 +0800)]
x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial contents
The TDX module measures the TD during the build process and saves the
measurement in TDCS.MRTD to facilitate TD attestation of the initial
contents of the TD. Wrap the SEAMCALL TDH.MR.EXTEND with tdh_mr_extend()
and TDH.MR.FINALIZE with tdh_mr_finalize() to enable the host kernel to
assist the TDX module in performing the measurement.
The measurement in TDCS.MRTD is a SHA-384 digest of the build process.
SEAMCALLs TDH.MNG.INIT and TDH.MEM.PAGE.ADD initialize and contribute to
the MRTD digest calculation.
The caller of tdh_mr_extend() should break the TD private page into chunks
of size TDX_EXTENDMR_CHUNKSIZE and invoke tdh_mr_extend() to add the page
content into the digest calculation. Failures are possible with
TDH.MR.EXTEND (e.g., due to SEPT walking). The caller of tdh_mr_extend()
can check the function return value and retrieve extended error information
from the function output parameters.
Calling tdh_mr_finalize() completes the measurement. The TDX module then
turns the TD into the runnable state. Further TDH.MEM.PAGE.ADD and
TDH.MR.EXTEND calls will fail.
TDH.MR.FINALIZE may fail due to errors such as the TD having no vCPUs or
contentions. Check function return value when calling tdh_mr_finalize() to
determine the exact reason for failure. Take proper locks on the caller's
side to avoid contention failures, or handle the BUSY error in specific
ways (e.g., retry). Return the SEAMCALL error code directly to the caller.
Do not attempt to handle it in the core kernel.
[Kai: Switched from generic seamcall export]
[Yan: Re-wrote the changelog]
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20241112073709.22171-1-yan.y.zhao@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Zhiming Hu [Wed, 19 Feb 2025 14:02:51 +0000 (09:02 -0500)]
KVM: TDX: Register TDX host key IDs to cgroup misc controller
TDX host key IDs (HKID) are limit resources in a machine, and the misc
cgroup lets the machine owner track their usage and limits the possibility
of abusing them outside the owner's control.
The cgroup v2 miscellaneous subsystem was introduced to control the
resource of AMD SEV & SEV-ES ASIDs. Likewise introduce HKIDs as a misc
resource.
Signed-off-by: Zhiming Hu <zhiming.hu@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:36:58 +0000 (15:36 +0800)]
x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page
TDX architecture introduces the concept of private GPA vs shared GPA,
depending on the GPA.SHARED bit. The TDX module maintains a single Secure
EPT (S-EPT or SEPT) tree per TD to translate TD's private memory accessed
using a private GPA. Wrap the SEAMCALL TDH.MEM.PAGE.REMOVE with
tdh_mem_page_remove() and TDH_PHYMEM_PAGE_WBINVD with
tdh_phymem_page_wbinvd_hkid() to unmap a TD private page from the SEPT,
remove the TD private page from the TDX module and flush cache lines to
memory after removal of the private page.
Callers should specify "GPA" and "level" when calling tdh_mem_page_remove()
to indicate to the TDX module which TD private page to unmap and remove.
TDH.MEM.PAGE.REMOVE may fail, and the caller of tdh_mem_page_remove() can
check the function return value and retrieve extended error information
from the function output parameters. Follow the TLB tracking protocol
before calling tdh_mem_page_remove() to remove a TD private page to avoid
SEAMCALL failure.
After removing a TD's private page, the TDX module does not write back and
invalidate cache lines associated with the page and the page's keyID (i.e.,
the TD's guest keyID). Therefore, provide tdh_phymem_page_wbinvd_hkid() to
allow the caller to pass in the TD's guest keyID and invoke
TDH_PHYMEM_PAGE_WBINVD to perform this action.
Before reusing the page, the host kernel needs to map the page with keyID 0
and invoke movdir64b() to convert the TD private page to a normal shared
page.
TDH.MEM.PAGE.REMOVE and TDH_PHYMEM_PAGE_WBINVD may meet contentions inside
the TDX module for TDX's internal resources. To avoid staying in SEAM mode
for too long, TDX module will return a BUSY error code to the kernel
instead of spinning on the locks. The caller may need to handle this error
in specific ways (e.g., retry). The wrappers return the SEAMCALL error code
directly to the caller. Don't attempt to handle it in the core kernel.
[Kai: Switched from generic seamcall export]
[Yan: Re-wrote the changelog]
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20241112073658.22157-1-yan.y.zhao@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiaoyao Li [Wed, 30 Oct 2024 19:00:38 +0000 (12:00 -0700)]
KVM: x86/mmu: Taking guest pa into consideration when calculate tdp level
For TDX, the maxpa (CPUID.0x80000008.EAX[7:0]) is fixed as native and
the max_gpa (CPUID.0x80000008.EAX[23:16]) is configurable and used
to configure the EPT level and GPAW.
Use max_gpa to determine the TDP level.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 14 Jan 2025 21:44:26 +0000 (16:44 -0500)]
x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking
TDX module defines a TLB tracking protocol to make sure that no logical
processor holds any stale Secure EPT (S-EPT or SEPT) TLB translations for a
given TD private GPA range. After a successful TDH.MEM.RANGE.BLOCK,
TDH.MEM.TRACK, and kicking off all vCPUs, TDX module ensures that the
subsequent TDH.VP.ENTER on each vCPU will flush all stale TLB entries for
the specified GPA ranges in TDH.MEM.RANGE.BLOCK. Wrap the
TDH.MEM.RANGE.BLOCK with tdh_mem_range_block() and TDH.MEM.TRACK with
tdh_mem_track() to enable the kernel to assist the TDX module in TLB
tracking management.
The caller of tdh_mem_range_block() needs to specify "GPA" and "level" to
request the TDX module to block the subsequent creation of TLB translation
for a GPA range. This GPA range can correspond to a SEPT page or a TD
private page at any level.
Contentions and errors are possible with the SEAMCALL TDH.MEM.RANGE.BLOCK.
Therefore, the caller of tdh_mem_range_block() needs to check the function
return value and retrieve extended error info from the function output
params.
Upon TDH.MEM.RANGE.BLOCK success, no new TLB entries will be created for
the specified private GPA range, though the existing TLB translations may
still persist. TDH.MEM.TRACK will then advance the TD's epoch counter to
ensure TDX module will flush TLBs in all vCPUs once the vCPUs re-enter
the TD. TDH.MEM.TRACK will fail to advance TD's epoch counter if there
are vCPUs still running in non-root mode at the previous TD epoch counter.
So to ensure private GPA translations are flushed, callers must first call
tdh_mem_range_block(), then tdh_mem_track(), and lastly send IPIs to kick
all the vCPUs and force them to re-enter, thus triggering the TLB flush.
Don't export a single operation and instead export functions that just
expose the block and track operations; this is for a couple reasons:
1. The vCPU kick should use KVM's functionality for doing this, which can better
target sending IPIs to only the minimum required pCPUs.
2. tdh_mem_track() doesn't need to be executed if a vCPU has not entered a TD,
which is information only KVM knows.
3. Leaving the operations separate will allow for batching many
tdh_mem_range_block() calls before a tdh_mem_track(). While this batching will
not be done initially by KVM, it demonstrates that keeping mem block and track
as separate operations is a generally good design.
Contentions are also possible in TDH.MEM.TRACK. For example, TDH.MEM.TRACK
may contend with TDH.VP.ENTER when advancing the TD epoch counter.
tdh_mem_track() does not provide the retries for the caller. Callers can
choose to avoid contentions or retry on their own.
[Kai: Switched from generic seamcall export]
[Yan: Re-wrote the changelog]
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20241112073648.22143-1-yan.y.zhao@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiaoyao Li [Wed, 30 Oct 2024 19:00:37 +0000 (12:00 -0700)]
KVM: x86: Introduce KVM_TDX_GET_CPUID
Implement an IOCTL to allow userspace to read the CPUID bit values for a
configured TD.
The TDX module doesn't provide the ability to set all CPUID bits. Instead
some are configured indirectly, or have fixed values. But it does allow
for the final resulting CPUID bits to be read. This information will be
useful for userspace to understand the configuration of the TD, and set
KVM's copy via KVM_SET_CPUID2.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
- Fix subleaf mask check (Binbin)
- Search all possible sub-leafs (Francesco Lavra)
- Reduce off-by-one error sensitve code (Francesco, Xiaoyao)
- Handle buffers too small from userspace (Xiaoyao)
- Read max CPUID from TD instead of using fixed values (Xiaoyao)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:36:36 +0000 (15:36 +0800)]
x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages
TDX architecture introduces the concept of private GPA vs shared GPA,
depending on the GPA.SHARED bit. The TDX module maintains a Secure EPT
(S-EPT or SEPT) tree per TD to translate TD's private memory accessed
using a private GPA. Wrap the SEAMCALL TDH.MEM.PAGE.ADD with
tdh_mem_page_add() and TDH.MEM.PAGE.AUG with tdh_mem_page_aug() to add TD
private pages and map them to the TD's private GPAs in the SEPT.
Callers of tdh_mem_page_add() and tdh_mem_page_aug() allocate and provide
normal pages to the wrappers, who further pass those pages to the TDX
module. Before passing the pages to the TDX module, tdh_mem_page_add() and
tdh_mem_page_aug() perform a CLFLUSH on the page mapped with keyID 0 to
ensure that any dirty cache lines don't write back later and clobber TD
memory or control structures. Don't worry about the other MK-TME keyIDs
because the kernel doesn't use them. The TDX docs specify that this flush
is not needed unless the TDX module exposes the CLFLUSH_BEFORE_ALLOC
feature bit. Do the CLFLUSH unconditionally for two reasons: make the
solution simpler by having a single path that can handle both
!CLFLUSH_BEFORE_ALLOC and CLFLUSH_BEFORE_ALLOC cases. Avoid wading into any
correctness uncertainty by going with a conservative solution to start.
Call tdh_mem_page_add() to add a private page to a TD during the TD's build
time (i.e., before TDH.MR.FINALIZE). Specify which GPA the 4K private page
will map to. No need to specify level info since TDH.MEM.PAGE.ADD only adds
pages at 4K level. To provide initial contents to TD, provide an additional
source page residing in memory managed by the host kernel itself (encrypted
with a shared keyID). The TDX module will copy the initial contents from
the source page in shared memory into the private page after mapping the
page in the SEPT to the specified private GPA. The TDX module allows the
source page to be the same page as the private page to be added. In that
case, the TDX module converts and encrypts the source page as a TD private
page.
Call tdh_mem_page_aug() to add a private page to a TD during the TD's
runtime (i.e., after TDH.MR.FINALIZE). TDH.MEM.PAGE.AUG supports adding
huge pages. Specify which GPA the private page will map to, along with
level info embedded in the lower bits of the GPA. The TDX module will
recognize the added page as the TD's private page after the TD's acceptance
with TDCALL TDG.MEM.PAGE.ACCEPT.
tdh_mem_page_add() and tdh_mem_page_aug() may fail. Callers can check
function return value and retrieve extended error info from the function
output parameters.
The TDX module has many internal locks. To avoid staying in SEAM mode for
too long, SEAMCALLs returns a BUSY error code to the kernel instead of
spinning on the locks. Depending on the specific SEAMCALL, the caller
may need to handle this error in specific ways (e.g., retry). Therefore,
return the SEAMCALL error code directly to the caller. Don't attempt to
handle it in the core kernel.
[Kai: Switched from generic seamcall export]
[Yan: Re-wrote the changelog]
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20241112073636.22129-1-yan.y.zhao@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Wed, 30 Oct 2024 19:00:36 +0000 (12:00 -0700)]
KVM: TDX: Do TDX specific vcpu initialization
TD guest vcpu needs TDX specific initialization before running. Repurpose
KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command
KVM_TDX_INIT_VCPU, and implement the callback for it.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
- Fix comment: https://lore.kernel.org/kvm/Z36OYfRW9oPjW8be@google.com/
(Sean)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Isaku Yamahata [Tue, 12 Nov 2024 07:36:24 +0000 (15:36 +0800)]
x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT pages
TDX architecture introduces the concept of private GPA vs shared GPA,
depending on the GPA.SHARED bit. The TDX module maintains a Secure EPT
(S-EPT or SEPT) tree per TD for private GPA to HPA translation. Wrap the
TDH.MEM.SEPT.ADD SEAMCALL with tdh_mem_sept_add() to provide pages to the
TDX module for building a TD's SEPT tree. (Refer to these pages as SEPT
pages).
Callers need to allocate and provide a normal page to tdh_mem_sept_add(),
which then passes the page to the TDX module via the SEAMCALL
TDH.MEM.SEPT.ADD. The TDX module then installs the page into SEPT tree and
encrypts this SEPT page with the TD's guest keyID. The kernel cannot use
the SEPT page until after reclaiming it via TDH.MEM.SEPT.REMOVE or
TDH.PHYMEM.PAGE.RECLAIM.
Before passing the page to the TDX module, tdh_mem_sept_add() performs a
CLFLUSH on the page mapped with keyID 0 to ensure that any dirty cache
lines don't write back later and clobber TD memory or control structures.
Don't worry about the other MK-TME keyIDs because the kernel doesn't use
them. The TDX docs specify that this flush is not needed unless the TDX
module exposes the CLFLUSH_BEFORE_ALLOC feature bit. Do the CLFLUSH
unconditionally for two reasons: make the solution simpler by having a
single path that can handle both !CLFLUSH_BEFORE_ALLOC and
CLFLUSH_BEFORE_ALLOC cases. Avoid wading into any correctness uncertainty
by going with a conservative solution to start.
Callers should specify "GPA" and "level" for the TDX module to install the
SEPT page at the specified position in the SEPT. Do not include the root
page level in "level" since TDH.MEM.SEPT.ADD can only add non-root pages to
the SEPT. Ensure "level" is between 1 and 3 for a 4-level SEPT or between 1
and 4 for a 5-level SEPT.
Call tdh_mem_sept_add() during the TD's build time or during the TD's
runtime. Check for errors from the function return value and retrieve
extended error info from the function output parameters.
The TDX module has many internal locks. To avoid staying in SEAM mode for
too long, SEAMCALLs returns a BUSY error code to the kernel instead of
spinning on the locks. Depending on the specific SEAMCALL, the caller
may need to handle this error in specific ways (e.g., retry). Therefore,
return the SEAMCALL error code directly to the caller. Don't attempt to
handle it in the core kernel.
TDH.MEM.SEPT.ADD effectively manages two internal resources of the TDX
module: it installs page table pages in the SEPT tree and also updates the
TDX module's page metadata (PAMT). Don't add a wrapper for the matching
SEAMCALL for removing a SEPT page (TDH.MEM.SEPT.REMOVE) because KVM, as the
only in-kernel user, will only tear down the SEPT tree when the TD is being
torn down. When this happens it can just do other operations that reclaim
the SEPT pages for the host kernels to use, update the PAMT and let the
SEPT get trashed.
[Kai: Switched from generic seamcall export]
[Yan: Re-wrote the changelog]
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <
20241112073624.22114-1-yan.y.zhao@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>