linux-2.6-microblaze.git
3 months agoKVM: SVM: Add support for CR0 write traps for an SEV-ES guest
Tom Lendacky [Thu, 10 Dec 2020 17:09:56 +0000 (11:09 -0600)]
KVM: SVM: Add support for CR0 write traps for an SEV-ES guest

For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.

SEV-ES support introduces new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.

Add support to track the value of the guest CR0 register using the control
register write trap so that the hypervisor understands the guest operating
mode.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <182c9baf99df7e40ad9617ff90b84542705ef0d7.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Add support for EFER write traps for an SEV-ES guest
Tom Lendacky [Thu, 10 Dec 2020 17:09:55 +0000 (11:09 -0600)]
KVM: SVM: Add support for EFER write traps for an SEV-ES guest

For SEV-ES guests, the interception of EFER write access is not
recommended. EFER interception occurs prior to EFER being modified and
the hypervisor is unable to modify EFER itself because the register is
located in the encrypted register state.

SEV-ES support introduces a new EFER write trap. This trap provides
intercept support of an EFER write after it has been modified. The new
EFER value is provided in the VMCB EXITINFO1 field, allowing the
hypervisor to track the setting of the guest EFER.

Add support to track the value of the guest EFER value using the EFER
write trap so that the hypervisor understands the guest operating mode.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <8993149352a3a87cd0625b3b61bfd31ab28977e1.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Support string IO operations for an SEV-ES guest
Tom Lendacky [Thu, 10 Dec 2020 17:09:54 +0000 (11:09 -0600)]
KVM: SVM: Support string IO operations for an SEV-ES guest

For an SEV-ES guest, string-based port IO is performed to a shared
(un-encrypted) page so that both the hypervisor and guest can read or
write to it and each see the contents.

For string-based port IO operations, invoke SEV-ES specific routines that
can complete the operation using common KVM port IO support.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <9d61daf0ffda496703717218f415cdc8fd487100.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Support MMIO for an SEV-ES guest
Tom Lendacky [Thu, 10 Dec 2020 17:09:53 +0000 (11:09 -0600)]
KVM: SVM: Support MMIO for an SEV-ES guest

For an SEV-ES guest, MMIO is performed to a shared (un-encrypted) page
so that both the hypervisor and guest can read or write to it and each
see the contents.

The GHCB specification provides software-defined VMGEXIT exit codes to
indicate a request for an MMIO read or an MMIO write. Add support to
recognize the MMIO requests and invoke SEV-ES specific routines that
can complete the MMIO operation. These routines use common KVM support
to complete the MMIO operation.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <af8de55127d5bcc3253d9b6084a0144c12307d4d.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Create trace events for VMGEXIT MSR protocol processing
Tom Lendacky [Thu, 10 Dec 2020 17:09:52 +0000 (11:09 -0600)]
KVM: SVM: Create trace events for VMGEXIT MSR protocol processing

Add trace events for entry to and exit from VMGEXIT MSR protocol
processing. The vCPU will be common for the trace events. The MSR
protocol processing is guided by the GHCB GPA in the VMCB, so the GHCB
GPA will represent the input and output values for the entry and exit
events, respectively. Additionally, the exit event will contain the
return code for the event.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c5b3b440c3e0db43ff2fc02813faa94fa54896b0.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Create trace events for VMGEXIT processing
Tom Lendacky [Thu, 10 Dec 2020 17:09:48 +0000 (11:09 -0600)]
KVM: SVM: Create trace events for VMGEXIT processing

Add trace events for entry to and exit from VMGEXIT processing. The vCPU
id and the exit reason will be common for the trace events. The exit info
fields will represent the input and output values for the entry and exit
events, respectively.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <25357dca49a38372e8f483753fb0c1c2a70a6898.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Add support for SEV-ES GHCB MSR protocol function 0x100
Tom Lendacky [Thu, 10 Dec 2020 17:09:51 +0000 (11:09 -0600)]
KVM: SVM: Add support for SEV-ES GHCB MSR protocol function 0x100

The GHCB specification defines a GHCB MSR protocol using the lower
12-bits of the GHCB MSR (in the hypervisor this corresponds to the
GHCB GPA field in the VMCB).

Function 0x100 is a request for termination of the guest. The guest has
encountered some situation for which it has requested to be terminated.
The GHCB MSR value contains the reason for the request.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <f3a1f7850c75b6ea4101e15bbb4a3af1a203f1dc.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Add support for SEV-ES GHCB MSR protocol function 0x004
Tom Lendacky [Thu, 10 Dec 2020 17:09:50 +0000 (11:09 -0600)]
KVM: SVM: Add support for SEV-ES GHCB MSR protocol function 0x004

The GHCB specification defines a GHCB MSR protocol using the lower
12-bits of the GHCB MSR (in the hypervisor this corresponds to the
GHCB GPA field in the VMCB).

Function 0x004 is a request for CPUID information. Only a single CPUID
result register can be sent per invocation, so the protocol defines the
register that is requested. The GHCB MSR value is set to the CPUID
register value as per the specification via the VMCB GHCB GPA field.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <fd7ee347d3936e484c06e9001e340bf6387092cd.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Add support for SEV-ES GHCB MSR protocol function 0x002
Tom Lendacky [Thu, 10 Dec 2020 17:09:49 +0000 (11:09 -0600)]
KVM: SVM: Add support for SEV-ES GHCB MSR protocol function 0x002

The GHCB specification defines a GHCB MSR protocol using the lower
12-bits of the GHCB MSR (in the hypervisor this corresponds to the
GHCB GPA field in the VMCB).

Function 0x002 is a request to set the GHCB MSR value to the SEV INFO as
per the specification via the VMCB GHCB GPA field.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c23c163a505290a0d1b9efc4659b838c8c902cbc.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Add initial support for a VMGEXIT VMEXIT
Tom Lendacky [Thu, 10 Dec 2020 17:09:47 +0000 (11:09 -0600)]
KVM: SVM: Add initial support for a VMGEXIT VMEXIT

SEV-ES adds a new VMEXIT reason code, VMGEXIT. Initial support for a
VMGEXIT includes mapping the GHCB based on the guest GPA, which is
obtained from a new VMCB field, and then validating the required inputs
for the VMGEXIT exit reason.

Since many of the VMGEXIT exit reasons correspond to existing VMEXIT
reasons, the information from the GHCB is copied into the VMCB control
exit code areas and KVM register areas. The standard exit handlers are
invoked, similar to standard VMEXIT processing. Before restarting the
vCPU, the GHCB is updated with any registers that have been updated by
the hypervisor.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c6a4ed4294a369bd75c44d03bd7ce0f0c3840e50.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Prepare for SEV-ES exit handling in the sev.c file
Tom Lendacky [Thu, 10 Dec 2020 17:09:46 +0000 (11:09 -0600)]
KVM: SVM: Prepare for SEV-ES exit handling in the sev.c file

This is a pre-patch to consolidate some exit handling code into callable
functions. Follow-on patches for SEV-ES exit handling will then be able
to use them from the sev.c file.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <5b8b0ffca8137f3e1e257f83df9f5c881c8a96a3.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Cannot re-initialize the VMCB after shutdown with SEV-ES
Tom Lendacky [Thu, 10 Dec 2020 17:09:45 +0000 (11:09 -0600)]
KVM: SVM: Cannot re-initialize the VMCB after shutdown with SEV-ES

When a SHUTDOWN VMEXIT is encountered, normally the VMCB is re-initialized
so that the guest can be re-launched. But when a guest is running as an
SEV-ES guest, the VMSA cannot be re-initialized because it has been
encrypted. For now, just return -EINVAL to prevent a possible attempt at
a guest reset.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <aa6506000f6f3a574de8dbcdab0707df844cb00c.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Do not allow instruction emulation under SEV-ES
Tom Lendacky [Thu, 10 Dec 2020 17:09:44 +0000 (11:09 -0600)]
KVM: SVM: Do not allow instruction emulation under SEV-ES

When a guest is running as an SEV-ES guest, it is not possible to emulate
instructions. Add support to prevent instruction emulation.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <f6355ea3024fda0a3eb5eb99c6b62dca10d792bd.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Prevent debugging under SEV-ES
Tom Lendacky [Thu, 10 Dec 2020 17:09:43 +0000 (11:09 -0600)]
KVM: SVM: Prevent debugging under SEV-ES

Since the guest register state of an SEV-ES guest is encrypted, debugging
is not supported. Update the code to prevent guest debugging when the
guest has protected state.

Additionally, an SEV-ES guest must only and always intercept DR7 reads and
writes. Update set_dr_intercepts() and clr_dr_intercepts() to account for
this.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <8db966fa2f9803d6454ce773863025d0e2e7f3cc.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Add required changes to support intercepts under SEV-ES
Tom Lendacky [Mon, 14 Dec 2020 15:29:50 +0000 (10:29 -0500)]
KVM: SVM: Add required changes to support intercepts under SEV-ES

When a guest is running under SEV-ES, the hypervisor cannot access the
guest register state. There are numerous places in the KVM code where
certain registers are accessed that are not allowed to be accessed (e.g.
RIP, CR0, etc). Add checks to prevent register accesses and add intercept
update support at various points within the KVM code.

Also, when handling a VMGEXIT, exceptions are passed back through the
GHCB. Since the RDMSR/WRMSR intercepts (may) inject a #GP on error,
update the SVM intercepts to handle this for SEV-ES guests.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[Redo MSR part using the .complete_emulated_msr callback. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: x86: introduce complete_emulated_msr callback
Paolo Bonzini [Mon, 14 Dec 2020 15:26:51 +0000 (10:26 -0500)]
KVM: x86: introduce complete_emulated_msr callback

This will be used by SEV-ES to inject MSR failure via the GHCB.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: x86: use kvm_complete_insn_gp in emulating RDMSR/WRMSR
Paolo Bonzini [Mon, 14 Dec 2020 12:44:46 +0000 (07:44 -0500)]
KVM: x86: use kvm_complete_insn_gp in emulating RDMSR/WRMSR

Simplify the four functions that handle {kernel,user} {rd,wr}msr, there
is still some repetition between the two instances of rdmsr but the
whole business of calling kvm_inject_gp and kvm_skip_emulated_instruction
can be unified nicely.

Because complete_emulated_wrmsr now becomes essentially a call to
kvm_complete_insn_gp, remove complete_emulated_msr.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: x86: remove bogus #GP injection
Paolo Bonzini [Mon, 14 Dec 2020 12:59:15 +0000 (07:59 -0500)]
KVM: x86: remove bogus #GP injection

There is no need to inject a #GP from kvm_mtrr_set_msr, kvm_emulate_wrmsr will
handle it.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: x86: Mark GPRs dirty when written
Tom Lendacky [Thu, 10 Dec 2020 17:09:41 +0000 (11:09 -0600)]
KVM: x86: Mark GPRs dirty when written

When performing VMGEXIT processing for an SEV-ES guest, register values
will be synced between KVM and the GHCB. Prepare for detecting when a GPR
has been updated (marked dirty) in order to determine whether to sync the
register to the GHCB.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <7ca2a1cdb61456f2fe9c64193e34d601e395c133.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Add support for the SEV-ES VMSA
Tom Lendacky [Thu, 10 Dec 2020 17:09:40 +0000 (11:09 -0600)]
KVM: SVM: Add support for the SEV-ES VMSA

Allocate a page during vCPU creation to be used as the encrypted VM save
area (VMSA) for the SEV-ES guest. Provide a flag in the kvm_vcpu_arch
structure that indicates whether the guest state is protected.

When freeing a VMSA page that has been encrypted, the cache contents must
be flushed using the MSR_AMD64_VM_PAGE_FLUSH before freeing the page.

[ i386 build warnings ]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <fde272b17eec804f3b9db18c131262fe074015c5.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Add GHCB accessor functions for retrieving fields
Tom Lendacky [Thu, 10 Dec 2020 17:09:39 +0000 (11:09 -0600)]
KVM: SVM: Add GHCB accessor functions for retrieving fields

Update the GHCB accessor functions to add functions for retrieve GHCB
fields by name. Update existing code to use the new accessor functions.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <664172c53a5fb4959914e1a45d88e805649af0ad.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Add support for SEV-ES capability in KVM
Tom Lendacky [Thu, 10 Dec 2020 17:09:38 +0000 (11:09 -0600)]
KVM: SVM: Add support for SEV-ES capability in KVM

Add support to KVM for determining if a system is capable of supporting
SEV-ES as well as determining if a guest is an SEV-ES guest.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <e66792323982c822350e40c7a1cf67ea2978a70b.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: SVM: Remove the call to sev_platform_status() during setup
Tom Lendacky [Thu, 10 Dec 2020 17:09:37 +0000 (11:09 -0600)]
KVM: SVM: Remove the call to sev_platform_status() during setup

When both KVM support and the CCP driver are built into the kernel instead
of as modules, KVM initialization can happen before CCP initialization. As
a result, sev_platform_status() will return a failure when it is called
from sev_hardware_setup(), when this isn't really an error condition.

Since sev_platform_status() doesn't need to be called at this time anyway,
remove the invocation from sev_hardware_setup().

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <618380488358b56af558f2682203786f09a49483.1607620209.git.thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agox86/cpu: Add VM page flush MSR availablility as a CPUID feature
Tom Lendacky [Thu, 10 Dec 2020 17:09:36 +0000 (11:09 -0600)]
x86/cpu: Add VM page flush MSR availablility as a CPUID feature

On systems that do not have hardware enforced cache coherency between
encrypted and unencrypted mappings of the same physical page, the
hypervisor can use the VM page flush MSR (0xc001011e) to flush the cache
contents of an SEV guest page. When a small number of pages are being
flushed, this can be used in place of issuing a WBINVD across all CPUs.

CPUID 0x8000001f_eax[2] is used to determine if the VM page flush MSR is
available. Add a CPUID feature to indicate it is supported and define the
MSR.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <f1966379e31f9b208db5257509c4a089a87d33d0.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM/VMX/SVM: Move kvm_machine_check function to x86.h
Uros Bizjak [Thu, 29 Oct 2020 13:56:00 +0000 (14:56 +0100)]
KVM/VMX/SVM: Move kvm_machine_check function to x86.h

Move kvm_machine_check to x86.h to avoid two exact copies
of the same function in kvm.c and svm.c.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20201029135600.122392-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoMerge tag 'kvm-s390-next-5.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Paolo Bonzini [Sat, 12 Dec 2020 08:58:31 +0000 (03:58 -0500)]
Merge tag 'kvm-s390-next-5.11-1' of git://git./linux/kernel/git/kvms390/linux into HEAD

KVM: s390: Features and Test for 5.11

- memcg accouting for s390 specific parts of kvm and gmap
- selftest for diag318
- new kvm_stat for when async_pf falls back to sync

The selftest even triggers a non-critical bug that is unrelated
to diag318, fix will follow later.

3 months agoKVM: x86: reinstate vendor-agnostic check on SPEC_CTRL cpuid bits
Paolo Bonzini [Thu, 3 Dec 2020 14:40:15 +0000 (09:40 -0500)]
KVM: x86: reinstate vendor-agnostic check on SPEC_CTRL cpuid bits

Until commit e7c587da1252 ("x86/speculation: Use synthetic bits for
IBRS/IBPB/STIBP"), KVM was testing both Intel and AMD CPUID bits before
allowing the guest to write MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD.
Testing only Intel bits on VMX processors, or only AMD bits on SVM
processors, fails if the guests are created with the "opposite" vendor
as the host.

While at it, also tweak the host CPU check to use the vendor-agnostic
feature bit X86_FEATURE_IBPB, since we only care about the availability
of the MSR on the host here and not about specific CPUID bits.

Fixes: e7c587da1252 ("x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP")
Cc: stable@vger.kernel.org
Reported-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: x86: Expose AVX512_FP16 for supported CPUID
Cathy Zhang [Tue, 8 Dec 2020 03:34:41 +0000 (19:34 -0800)]
KVM: x86: Expose AVX512_FP16 for supported CPUID

AVX512_FP16 is supported by Intel processors, like Sapphire Rapids.
It could gain better performance for it's faster compared to FP32
if the precision or magnitude requirements are met. It's availability
is indicated by CPUID.(EAX=7,ECX=0):EDX[bit 23].

Expose it in KVM supported CPUID, then guest could make use of it; no
new registers are used, only new instructions.

Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Message-Id: <20201208033441.28207-3-kyung.min.park@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agox86: Enumerate AVX512 FP16 CPUID feature flag
Kyung Min Park [Tue, 8 Dec 2020 03:34:40 +0000 (19:34 -0800)]
x86: Enumerate AVX512 FP16 CPUID feature flag

Enumerate AVX512 Half-precision floating point (FP16) CPUID feature
flag. Compared with using FP32, using FP16 cut the number of bits
required for storage in half, reducing the exponent from 8 bits to 5,
and the mantissa from 23 bits to 10. Using FP16 also enables developers
to train and run inference on deep learning models fast when all
precision or magnitude (FP32) is not needed.

A processor supports AVX512 FP16 if CPUID.(EAX=7,ECX=0):EDX[bit 23]
is present. The AVX512 FP16 requires AVX512BW feature be implemented
since the instructions for manipulating 32bit masks are associated with
AVX512BW.

The only in-kernel usage of this is kvm passthrough. The CPU feature
flag is shown as "avx512_fp16" in /proc/cpuinfo.

Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Message-Id: <20201208033441.28207-2-kyung.min.park@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoselftests: kvm: Merge user_msr_test into userspace_msr_exit_test
Aaron Lewis [Fri, 4 Dec 2020 17:25:31 +0000 (09:25 -0800)]
selftests: kvm: Merge user_msr_test into userspace_msr_exit_test

Both user_msr_test and userspace_msr_exit_test tests the functionality
of kvm_msr_filter.  Instead of testing this feature in two tests, merge
them together, so there is only one test for this feature.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Message-Id: <20201204172530.2958493-1-aaronlewis@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoselftests: kvm: Test MSR exiting to userspace
Aaron Lewis [Mon, 12 Oct 2020 19:47:16 +0000 (12:47 -0700)]
selftests: kvm: Test MSR exiting to userspace

Add a selftest to test that when the ioctl KVM_X86_SET_MSR_FILTER is
called with an MSR list, those MSRs exit to userspace.

This test uses 3 MSRs to test this:
  1. MSR_IA32_XSS, an MSR the kernel knows about.
  2. MSR_IA32_FLUSH_CMD, an MSR the kernel does not know about.
  3. MSR_NON_EXISTENT, an MSR invented in this test for the purposes of
     passing a fake MSR from the guest to userspace.  KVM just acts as a
     pass through.

Userspace is also able to inject a #GP.  This is demonstrated when
MSR_IA32_XSS and MSR_IA32_FLUSH_CMD are misused in the test.  When this
happens a #GP is initiated in userspace to be thrown in the guest which is
handled gracefully by the exception handling framework introduced earlier
in this series.

Tests for the generic instruction emulator were also added.  For this to
work the module parameter kvm.force_emulation_prefix=1 has to be enabled.
If it isn't enabled the tests will be skipped.

A test was also added to ensure the MSR permission bitmap is being set
correctly by executing reads and writes of MSR_FS_BASE and MSR_GS_BASE
in the guest while alternating which MSR userspace should intercept.  If
the permission bitmap is being set correctly only one of the MSRs should
be coming through at a time, and the guest should be able to read and
write the other one directly.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Alexander Graf <graf@amazon.com>
Message-Id: <20201012194716.3950330-5-aaronlewis@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM/VMX: Use TEST %REG,%REG instead of CMP $0,%REG in vmenter.S
Uros Bizjak [Thu, 29 Oct 2020 14:04:57 +0000 (15:04 +0100)]
KVM/VMX: Use TEST %REG,%REG instead of CMP $0,%REG in vmenter.S

Saves one byte in __vmx_vcpu_run for the same functionality.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20201029140457.126965-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3 months agoKVM: s390: track synchronous pfault events in kvm_stat
Christian Borntraeger [Wed, 25 Nov 2020 09:06:58 +0000 (10:06 +0100)]
KVM: s390: track synchronous pfault events in kvm_stat

Right now we do count pfault (pseudo page faults aka async page faults
start and completion events). What we do not count is, if an async page
fault would have been possible by the host, but it was disabled by the
guest (e.g. interrupts off, pfault disabled, secure execution....).  Let
us count those as well in the pfault_sync counter.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/20201125090658.38463-1-borntraeger@de.ibm.com
3 months agoKVM: selftests: sync_regs test for diag318
Collin Walling [Mon, 7 Dec 2020 15:41:25 +0000 (10:41 -0500)]
KVM: selftests: sync_regs test for diag318

The DIAGNOSE 0x0318 instruction, unique to s390x, is a privileged call
that must be intercepted via SIE, handled in userspace, and the
information set by the instruction is communicated back to KVM.

To test the instruction interception, an ad-hoc handler is defined which
simply has a VM execute the instruction and then userspace will extract
the necessary info. The handler is defined such that the instruction
invocation occurs only once. It is up to the caller to determine how the
info returned by this handler should be used.

The diag318 info is communicated from userspace to KVM via a sync_regs
call. This is tested during a sync_regs test, where the diag318 info is
requested via the handler, then the info is stored in the appropriate
register in KVM via a sync registers call.

If KVM does not support diag318, then the tests will print a message
stating that diag318 was skipped, and the asserts will simply test
against a value of 0.

Signed-off-by: Collin Walling <walling@linux.ibm.com>
Link: https://lore.kernel.org/r/20201207154125.10322-1-walling@linux.ibm.com
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
3 months agos390/gmap: make gmap memcg aware
Christian Borntraeger [Mon, 9 Nov 2020 12:14:35 +0000 (13:14 +0100)]
s390/gmap: make gmap memcg aware

gmap allocations can be attributed to a process.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
3 months agoKVM: s390: Add memcg accounting to KVM allocations
Christian Borntraeger [Fri, 6 Nov 2020 07:34:23 +0000 (08:34 +0100)]
KVM: s390: Add memcg accounting to KVM allocations

Almost all kvm allocations in the s390x KVM code can be attributed to
the process that triggers the allocation (in other words, no global
allocation for other guests). This will help the memcg controller to
make the right decisions.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
3 months agoKVM: x86: ignore SIPIs that are received while not in wait-for-sipi state
Maxim Levitsky [Thu, 3 Dec 2020 14:33:19 +0000 (16:33 +0200)]
KVM: x86: ignore SIPIs that are received while not in wait-for-sipi state

In the commit 1c96dcceaeb3
("KVM: x86: fix apic_accept_events vs check_nested_events"),

we accidently started latching SIPIs that are received while the cpu is not
waiting for them.

This causes vCPUs to never enter a halted state.

Fixes: 1c96dcceaeb3 ("KVM: x86: fix apic_accept_events vs check_nested_events")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20201203143319.159394-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: x86: adjust SEV for commit 7e8e6eed75e
Paolo Bonzini [Mon, 30 Nov 2020 14:39:59 +0000 (09:39 -0500)]
KVM: x86: adjust SEV for commit 7e8e6eed75e

Since the ASID is now stored in svm->asid, pre_sev_run should also place
it there and not directly in the VMCB control area.

Reported-by: Ashish Kalra <Ashish.Kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: nSVM: set fixed bits by hand
Paolo Bonzini [Fri, 27 Nov 2020 17:46:36 +0000 (12:46 -0500)]
KVM: nSVM: set fixed bits by hand

SVM generally ignores fixed-1 bits.  Set them manually so that we
do not end up by mistake without those bits set in struct kvm_vcpu;
it is part of userspace API that KVM always returns value with the
bits set.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agokvm: x86/mmu: Add TDP MMU SPTE changed trace point
Ben Gardon [Tue, 27 Oct 2020 17:59:44 +0000 (10:59 -0700)]
kvm: x86/mmu: Add TDP MMU SPTE changed trace point

Add an extremely verbose trace point to the TDP MMU to log all SPTE
changes, regardless of callstack / motivation. This is useful when a
complete picture of the paging structure is needed or a change cannot be
explained with the other, existing trace points.

Tested: ran the demand paging selftest on an Intel Skylake machine with
all the trace points used by the TDP MMU enabled and observed
them firing with expected values.

This patch can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/3813

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20201027175944.1183301-2-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agokvm: x86/mmu: Add existing trace points to TDP MMU
Ben Gardon [Tue, 27 Oct 2020 17:59:43 +0000 (10:59 -0700)]
kvm: x86/mmu: Add existing trace points to TDP MMU

The TDP MMU was initially implemented without some of the usual
tracepoints found in mmu.c. Correct this discrepancy by adding the
missing trace points to the TDP MMU.

Tested: ran the demand paging selftest on an Intel Skylake machine with
all the trace points used by the TDP MMU enabled and observed
them firing with expected values.

This patch can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/3812

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20201027175944.1183301-1-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: SVM: check CR4 changes against vcpu->arch
Paolo Bonzini [Sun, 15 Nov 2020 14:44:18 +0000 (09:44 -0500)]
KVM: SVM: check CR4 changes against vcpu->arch

Similarly to what vmx/vmx.c does, use vcpu->arch.cr4 to check if CR4
bits PGE, PKE and OSXSAVE have changed.  When switching between VMCB01
and VMCB02, CPUID has to be adjusted every time if CR4.PKE or CR4.OSXSAVE
change; without this patch, instead, CR4 would be checked against the
previous value for L2 on vmentry, and against the previous value for
L1 on vmexit, and CPUID would not be updated.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: SVM: Move asid to vcpu_svm
Cathy Avery [Sun, 11 Oct 2020 18:48:17 +0000 (14:48 -0400)]
KVM: SVM: Move asid to vcpu_svm

KVM does not have separate ASIDs for L1 and L2; either the nested
hypervisor and nested guests share a single ASID, or on older processor
the ASID is used only to implement TLB flushing.

Either way, ASIDs are handled at the VM level.  In preparation
for having different VMCBs passed to VMLOAD/VMRUN/VMSAVE for L1 and
L2, store the current ASID to struct vcpu_svm and only move it to
the VMCB in svm_vcpu_run.  This way, TLB flushes can be applied
no matter which VMCB will be active during the next svm_vcpu_run.

Signed-off-by: Cathy Avery <cavery@redhat.com>
Message-Id: <20201011184818.3609-2-cavery@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agox86/kvm: remove unused macro HV_CLOCK_SIZE
Alex Shi [Fri, 6 Nov 2020 08:39:23 +0000 (16:39 +0800)]
x86/kvm: remove unused macro HV_CLOCK_SIZE

This macro is useless, and could cause gcc warning:
arch/x86/kernel/kvmclock.c:47:0: warning: macro "HV_CLOCK_SIZE" is not
used [-Wunused-macros]
Let's remove it.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <1604651963-10067-1-git-send-email-alex.shi@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: selftests: x86: Set supported CPUIDs on default VM
Andrew Jones [Wed, 11 Nov 2020 12:26:35 +0000 (13:26 +0100)]
KVM: selftests: x86: Set supported CPUIDs on default VM

Almost all tests do this anyway and the ones that don't don't
appear to care. Only vmx_set_nested_state_test assumes that
a feature (VMX) is disabled until later setting the supported
CPUIDs. It's better to disable that explicitly anyway.

Signed-off-by: Andrew Jones <drjones@redhat.com>
Message-Id: <20201111122636.73346-11-drjones@redhat.com>
[Restore CPUID_VMX, or vmx_set_nested_state breaks. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: selftests: Make test skipping consistent
Andrew Jones [Wed, 11 Nov 2020 12:26:36 +0000 (13:26 +0100)]
KVM: selftests: Make test skipping consistent

Signed-off-by: Andrew Jones <drjones@redhat.com>
Message-Id: <20201111122636.73346-12-drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: selftests: Also build dirty_log_perf_test on AArch64
Andrew Jones [Wed, 11 Nov 2020 12:26:34 +0000 (13:26 +0100)]
KVM: selftests: Also build dirty_log_perf_test on AArch64

Signed-off-by: Andrew Jones <drjones@redhat.com>
Message-Id: <20201111122636.73346-10-drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: selftests: Introduce vm_create_[default_]_with_vcpus
Andrew Jones [Wed, 11 Nov 2020 12:26:30 +0000 (13:26 +0100)]
KVM: selftests: Introduce vm_create_[default_]_with_vcpus

Introduce new vm_create variants that also takes a number of vcpus,
an amount of per-vcpu pages, and optionally a list of vcpuids. These
variants will create default VMs with enough additional pages to
cover the vcpu stacks, per-vcpu pages, and pagetable pages for all.
The new 'default' variant uses VM_MODE_DEFAULT, whereas the other
new variant accepts the mode as a parameter.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Andrew Jones <drjones@redhat.com>
Message-Id: <20201111122636.73346-6-drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: selftests: Make vm_create_default common
Andrew Jones [Wed, 11 Nov 2020 12:26:29 +0000 (13:26 +0100)]
KVM: selftests: Make vm_create_default common

The code is almost 100% the same anyway. Just move it to common
and add a few arch-specific macros.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Andrew Jones <drjones@redhat.com>
Message-Id: <20201111122636.73346-5-drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: selftests: always use manual clear in dirty_log_perf_test
Paolo Bonzini [Fri, 13 Nov 2020 16:36:49 +0000 (11:36 -0500)]
KVM: selftests: always use manual clear in dirty_log_perf_test

Nothing sets USE_CLEAR_DIRTY_LOG anymore, so anything it surrounds
is dead code.

However, it is the recommended way to use the dirty page bitmap
for new enough kernel, so use it whenever KVM has the
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 capability.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agokvm: x86: Sink cpuid update into vendor-specific set_cr4 functions
Jim Mattson [Thu, 29 Oct 2020 17:06:48 +0000 (10:06 -0700)]
kvm: x86: Sink cpuid update into vendor-specific set_cr4 functions

On emulated VM-entry and VM-exit, update the CPUID bits that reflect
CR4.OSXSAVE and CR4.PKE.

This fixes a bug where the CPUID bits could continue to reflect L2 CR4
values after emulated VM-exit to L1. It also fixes a related bug where
the CPUID bits could continue to reflect L1 CR4 values after emulated
VM-entry to L2. The latter bug is mainly relevant to SVM, wherein
CPUID is not a required intercept. However, it could also be relevant
to VMX, because the code to conditionally update these CPUID bits
assumes that the guest CPUID and the guest CR4 are always in sync.

Fixes: 8eb3f87d903168 ("KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exit")
Fixes: 2acf923e38fb6a ("KVM: VMX: Enable XSAVE/XRSTOR for guest")
Fixes: b9baba86148904 ("KVM, pkeys: expose CPUID/CR4 to guest")
Reported-by: Abhiroop Dabral <adabral@paloaltonetworks.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Cc: Haozhong Zhang <haozhong.zhang@intel.com>
Cc: Dexuan Cui <dexuan.cui@intel.com>
Cc: Huaitong Han <huaitong.han@intel.com>
Message-Id: <20201029170648.483210-1-jmattson@google.com>

4 months agoselftests: kvm: keep .gitignore add to date
Paolo Bonzini [Fri, 6 Nov 2020 12:39:26 +0000 (07:39 -0500)]
selftests: kvm: keep .gitignore add to date

Add tsc_msrs_test, remove clear_dirty_log_test and alphabetize
everything.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: selftests: Add "-c" parameter to dirty log test
Peter Xu [Thu, 1 Oct 2020 01:22:41 +0000 (21:22 -0400)]
KVM: selftests: Add "-c" parameter to dirty log test

It's only used to override the existing dirty ring size/count.  If
with a bigger ring count, we test async of dirty ring.  If with a
smaller ring count, we test ring full code path.  Async is default.

It has no use for non-dirty-ring tests.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012241.6208-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: selftests: Run dirty ring test asynchronously
Peter Xu [Thu, 1 Oct 2020 01:22:39 +0000 (21:22 -0400)]
KVM: selftests: Run dirty ring test asynchronously

Previously the dirty ring test was working in synchronous way, because
only with a vmexit (with that it was the ring full event) we'll know
the hardware dirty bits will be flushed to the dirty ring.

With this patch we first introduce a vcpu kick mechanism using SIGUSR1,
which guarantees a vmexit and also therefore the flushing of hardware
dirty bits.  Once this is in place, we can keep the vcpu dirty work
asynchronous of the whole collection procedure now.  Still, we need
to be very careful that when reaching the ring buffer soft limit
(KVM_EXIT_DIRTY_RING_FULL) we must collect the dirty bits before
continuing the vcpu.

Further increase the dirty ring size to current maximum to make sure
we torture more on the no-ring-full case, which should be the major
scenario when the hypervisors like QEMU would like to use this feature.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012239.6159-1-peterx@redhat.com>
[Use KVM_SET_SIGNAL_MASK+sigwait instead of a signal handler. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: selftests: Add dirty ring buffer test
Peter Xu [Thu, 1 Oct 2020 01:22:37 +0000 (21:22 -0400)]
KVM: selftests: Add dirty ring buffer test

Add the initial dirty ring buffer test.

The current test implements the userspace dirty ring collection, by
only reaping the dirty ring when the ring is full.

So it's still running synchronously like this:

            vcpu                             main thread

  1. vcpu dirties pages
  2. vcpu gets dirty ring full
     (userspace exit)

                                       3. main thread waits until full
                                          (so hardware buffers flushed)
                                       4. main thread collects
                                       5. main thread continues vcpu

  6. vcpu continues, goes back to 1

We can't directly collects dirty bits during vcpu execution because
otherwise we can't guarantee the hardware dirty bits were flushed when
we collect and we're very strict on the dirty bits so otherwise we can
fail the future verify procedure.  A follow up patch will make this
test to support async just like the existing dirty log test, by adding
a vcpu kick mechanism.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012237.6111-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: selftests: Introduce after_vcpu_run hook for dirty log test
Peter Xu [Thu, 1 Oct 2020 01:22:35 +0000 (21:22 -0400)]
KVM: selftests: Introduce after_vcpu_run hook for dirty log test

Provide a hook for the checks after vcpu_run() completes.  Preparation
for the dirty ring test because we'll need to take care of another
exit reason.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012235.6063-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: Don't allocate dirty bitmap if dirty ring is enabled
Peter Xu [Thu, 1 Oct 2020 01:22:26 +0000 (21:22 -0400)]
KVM: Don't allocate dirty bitmap if dirty ring is enabled

Because kvm dirty rings and kvm dirty log is used in an exclusive way,
Let's avoid creating the dirty_bitmap when kvm dirty ring is enabled.
At the meantime, since the dirty_bitmap will be conditionally created
now, we can't use it as a sign of "whether this memory slot enabled
dirty tracking".  Change users like that to check against the kvm
memory slot flags.

Note that there still can be chances where the kvm memory slot got its
dirty_bitmap allocated, _if_ the memory slots are created before
enabling of the dirty rings and at the same time with the dirty
tracking capability enabled, they'll still with the dirty_bitmap.
However it should not hurt much (e.g., the bitmaps will always be
freed if they are there), and the real users normally won't trigger
this because dirty bit tracking flag should in most cases only be
applied to kvm slots only before migration starts, that should be far
latter than kvm initializes (VM starts).

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012226.5868-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: Make dirty ring exclusive to dirty bitmap log
Peter Xu [Thu, 1 Oct 2020 01:22:24 +0000 (21:22 -0400)]
KVM: Make dirty ring exclusive to dirty bitmap log

There's no good reason to use both the dirty bitmap logging and the
new dirty ring buffer to track dirty bits.  We should be able to even
support both of them at the same time, but it could complicate things
which could actually help little.  Let's simply make it the rule
before we enable dirty ring on any arch, that we don't allow these two
interfaces to be used together.

The big world switch would be KVM_CAP_DIRTY_LOG_RING capability
enablement.  That's where we'll switch from the default dirty logging
way to the dirty ring way.  As long as kvm->dirty_ring_size is setup
correctly, we'll once and for all switch to the dirty ring buffer mode
for the current virtual machine.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012224.5818-1-peterx@redhat.com>
[Change errno from EINVAL to ENXIO. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: X86: Implement ring-based dirty memory tracking
Peter Xu [Thu, 1 Oct 2020 01:22:22 +0000 (21:22 -0400)]
KVM: X86: Implement ring-based dirty memory tracking

This patch is heavily based on previous work from Lei Cao
<lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]

KVM currently uses large bitmaps to track dirty memory.  These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information.  The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another.  However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial.  In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN).  This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.

This patch enables dirty ring for X86 only.  However it should be
easily extended to other archs as well.

[1] https://patchwork.kernel.org/patch/10471409/

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012222.5767-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: Pass in kvm pointer into mark_page_dirty_in_slot()
Peter Xu [Thu, 1 Oct 2020 01:20:34 +0000 (21:20 -0400)]
KVM: Pass in kvm pointer into mark_page_dirty_in_slot()

The context will be needed to implement the kvm dirty ring.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012044.5151-5-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: remove kvm_clear_guest_page
Paolo Bonzini [Fri, 6 Nov 2020 10:25:09 +0000 (05:25 -0500)]
KVM: remove kvm_clear_guest_page

kvm_clear_guest_page is not used anymore after "KVM: X86: Don't track dirty
for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]", except from kvm_clear_guest.
We can just inline it in its sole user.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]
Peter Xu [Thu, 1 Oct 2020 01:20:33 +0000 (21:20 -0400)]
KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]

Originally, we have three code paths that can dirty a page without
vcpu context for X86:

  - init_rmode_identity_map
  - init_rmode_tss
  - kvmgt_rw_gpa

init_rmode_identity_map and init_rmode_tss will be setup on
destination VM no matter what (and the guest cannot even see them), so
it does not make sense to track them at all.

To do this, allow __x86_set_memory_region() to return the userspace
address that just allocated to the caller.  Then in both of the
functions we directly write to the userspace address instead of
calling kvm_write_*() APIs.

Another trivial change is that we don't need to explicitly clear the
identity page table root in init_rmode_identity_map() because no
matter what we'll write to the whole page with 4M huge page entries.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012044.5151-4-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: selftests: test KVM_GET_SUPPORTED_HV_CPUID as a system ioctl
Vitaly Kuznetsov [Tue, 29 Sep 2020 15:09:44 +0000 (17:09 +0200)]
KVM: selftests: test KVM_GET_SUPPORTED_HV_CPUID as a system ioctl

KVM_GET_SUPPORTED_HV_CPUID is now supported as both vCPU and VM ioctl,
test that.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200929150944.1235688-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: x86: hyper-v: allow KVM_GET_SUPPORTED_HV_CPUID as a system ioctl
Vitaly Kuznetsov [Tue, 29 Sep 2020 15:09:43 +0000 (17:09 +0200)]
KVM: x86: hyper-v: allow KVM_GET_SUPPORTED_HV_CPUID as a system ioctl

KVM_GET_SUPPORTED_HV_CPUID is a vCPU ioctl but its output is now
independent from vCPU and in some cases VMMs may want to use it as a system
ioctl instead. In particular, QEMU doesn CPU feature expansion before any
vCPU gets created so KVM_GET_SUPPORTED_HV_CPUID can't be used.

Convert KVM_GET_SUPPORTED_HV_CPUID to 'dual' system/vCPU ioctl with the
same meaning.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200929150944.1235688-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agokvm/eventfd: Drain events from eventfd in irqfd_wakeup()
David Woodhouse [Tue, 27 Oct 2020 13:55:23 +0000 (13:55 +0000)]
kvm/eventfd: Drain events from eventfd in irqfd_wakeup()

Don't allow the events to accumulate in the eventfd counter, drain them
as they are handled.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20201027135523.646811-4-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agovfio/virqfd: Drain events from eventfd in virqfd_wakeup()
David Woodhouse [Tue, 27 Oct 2020 13:55:22 +0000 (13:55 +0000)]
vfio/virqfd: Drain events from eventfd in virqfd_wakeup()

Don't allow the events to accumulate in the eventfd counter, drain them
as they are handled.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20201027135523.646811-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
4 months agoeventfd: Export eventfd_ctx_do_read()
David Woodhouse [Tue, 27 Oct 2020 13:55:21 +0000 (13:55 +0000)]
eventfd: Export eventfd_ctx_do_read()

Where events are consumed in the kernel, for example by KVM's
irqfd_wakeup() and VFIO's virqfd_wakeup(), they currently lack a
mechanism to drain the eventfd's counter.

Since the wait queue is already locked while the wakeup functions are
invoked, all they really need to do is call eventfd_ctx_do_read().

Add a check for the lock, and export it for them.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20201027135523.646811-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agokvm/eventfd: Use priority waitqueue to catch events before userspace
David Woodhouse [Mon, 26 Oct 2020 17:53:25 +0000 (17:53 +0000)]
kvm/eventfd: Use priority waitqueue to catch events before userspace

As far as I can tell, when we use posted interrupts we silently cut off
the events from userspace, if it's listening on the same eventfd that
feeds the irqfd.

I like that behaviour. Let's do it all the time, even without posted
interrupts. It makes it much easier to handle IRQ remapping invalidation
without having to constantly add/remove the fd from the userspace poll
set. We can just leave userspace polling on it, and the bypass will...
well... bypass it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20201026175325.585623-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agosched/wait: Add add_wait_queue_priority()
David Woodhouse [Tue, 27 Oct 2020 14:39:43 +0000 (14:39 +0000)]
sched/wait: Add add_wait_queue_priority()

This allows an exclusive wait_queue_entry to be added at the head of the
queue, instead of the tail as normal. Thus, it gets to consume events
first without allowing non-exclusive waiters to be woken at all.

The (first) intended use is for KVM IRQFD, which currently has
inconsistent behaviour depending on whether posted interrupts are
available or not. If they are, KVM will bypass the eventfd completely
and deliver interrupts directly to the appropriate vCPU. If not, events
are delivered through the eventfd and userspace will receive them when
polling on the eventfd.

By using add_wait_queue_priority(), KVM will be able to consistently
consume events within the kernel without accidentally exposing them
to userspace when they're supposed to be bypassed. This, in turn, means
that userspace doesn't have to jump through hoops to avoid listening
on the erroneously noisy eventfd and injecting duplicate interrupts.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20201027143944.648769-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: x86: emulate wait-for-SIPI and SIPI-VMExit
Yadong Qi [Fri, 6 Nov 2020 06:51:22 +0000 (14:51 +0800)]
KVM: x86: emulate wait-for-SIPI and SIPI-VMExit

Background: We have a lightweight HV, it needs INIT-VMExit and
SIPI-VMExit to wake-up APs for guests since it do not monitor
the Local APIC. But currently virtual wait-for-SIPI(WFS) state
is not supported in nVMX, so when running on top of KVM, the L1
HV cannot receive the INIT-VMExit and SIPI-VMExit which cause
the L2 guest cannot wake up the APs.

According to Intel SDM Chapter 25.2 Other Causes of VM Exits,
SIPIs cause VM exits when a logical processor is in
wait-for-SIPI state.

In this patch:
    1. introduce SIPI exit reason,
    2. introduce wait-for-SIPI state for nVMX,
    3. advertise wait-for-SIPI support to guest.

When L1 hypervisor is not monitoring Local APIC, L0 need to emulate
INIT-VMExit and SIPI-VMExit to L1 to emulate INIT-SIPI-SIPI for
L2. L2 LAPIC write would be traped by L0 Hypervisor(KVM), L0 should
emulate the INIT/SIPI vmexit to L1 hypervisor to set proper state
for L2's vcpu state.

Handle procdure:
Source vCPU:
    L2 write LAPIC.ICR(INIT).
    L0 trap LAPIC.ICR write(INIT): inject a latched INIT event to target
       vCPU.
Target vCPU:
    L0 emulate an INIT VMExit to L1 if is guest mode.
    L1 set guest VMCS, guest_activity_state=WAIT_SIPI, vmresume.
    L0 set vcpu.mp_state to INIT_RECEIVED if (vmcs12.guest_activity_state
       == WAIT_SIPI).

Source vCPU:
    L2 write LAPIC.ICR(SIPI).
    L0 trap LAPIC.ICR write(INIT): inject a latched SIPI event to traget
       vCPU.
Target vCPU:
    L0 emulate an SIPI VMExit to L1 if (vcpu.mp_state == INIT_RECEIVED).
    L1 set CS:IP, guest_activity_state=ACTIVE, vmresume.
    L0 resume to L2.
    L2 start-up.

Signed-off-by: Yadong Qi <yadong.qi@intel.com>
Message-Id: <20200922052343.84388-1-yadong.qi@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20201106065122.403183-1-yadong.qi@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: x86: fix apic_accept_events vs check_nested_events
Paolo Bonzini [Thu, 5 Nov 2020 16:20:49 +0000 (11:20 -0500)]
KVM: x86: fix apic_accept_events vs check_nested_events

vmx_apic_init_signal_blocked is buggy in that it returns true
even in VMX non-root mode.  In non-root mode, however, INITs
are not latched, they just cause a vmexit.  Previously,
KVM was waiting for them to be processed when kvm_apic_accept_events
and in the meanwhile it ate the SIPIs that the processor received.

However, in order to implement the wait-for-SIPI activity state,
KVM will have to process KVM_APIC_SIPI in vmx_check_nested_events,
and it will not be possible anymore to disregard SIPIs in non-root
mode as the code is currently doing.

By calling kvm_x86_ops.nested_ops->check_events, we can force a vmexit
(with the side-effect of latching INITs) before incorrectly injecting
an INIT or SIPI in a guest, and therefore vmx_apic_init_signal_blocked
can do the right thing.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: selftests: Verify supported CR4 bits can be set before KVM_SET_CPUID2
Sean Christopherson [Wed, 7 Oct 2020 01:44:17 +0000 (18:44 -0700)]
KVM: selftests: Verify supported CR4 bits can be set before KVM_SET_CPUID2

Extend the KVM_SET_SREGS test to verify that all supported CR4 bits, as
enumerated by KVM, can be set before KVM_SET_CPUID2, i.e. without first
defining the vCPU model.  KVM is supposed to skip guest CPUID checks
when host userspace is stuffing guest state.

Check the inverse as well, i.e. that KVM rejects KVM_SET_REGS if CR4
has one or more unsupported bits set.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-7-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: x86: Return bool instead of int for CR4 and SREGS validity checks
Sean Christopherson [Wed, 7 Oct 2020 01:44:16 +0000 (18:44 -0700)]
KVM: x86: Return bool instead of int for CR4 and SREGS validity checks

Rework the common CR4 and SREGS checks to return a bool instead of an
int, i.e. true/false instead of 0/-EINVAL, and add "is" to the name to
clarify the polarity of the return value (which is effectively inverted
by this change).

No functional changed intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: x86: Move vendor CR4 validity check to dedicated kvm_x86_ops hook
Sean Christopherson [Wed, 7 Oct 2020 01:44:15 +0000 (18:44 -0700)]
KVM: x86: Move vendor CR4 validity check to dedicated kvm_x86_ops hook

Split out VMX's checks on CR4.VMXE to a dedicated hook, .is_valid_cr4(),
and invoke the new hook from kvm_valid_cr4().  This fixes an issue where
KVM_SET_SREGS would return success while failing to actually set CR4.

Fixing the issue by explicitly checking kvm_x86_ops.set_cr4()'s return
in __set_sregs() is not a viable option as KVM has already stuffed a
variety of vCPU state.

Note, kvm_valid_cr4() and is_valid_cr4() have different return types and
inverted semantics.  This will be remedied in a future patch.

Fixes: 5e1746d6205d ("KVM: nVMX: Allow setting the VMXE bit in CR4")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: SVM: Drop VMXE check from svm_set_cr4()
Sean Christopherson [Wed, 7 Oct 2020 01:44:14 +0000 (18:44 -0700)]
KVM: SVM: Drop VMXE check from svm_set_cr4()

Drop svm_set_cr4()'s explicit check CR4.VMXE now that common x86 handles
the check by incorporating VMXE into the CR4 reserved bits, via
kvm_cpu_caps.  SVM obviously does not set X86_FEATURE_VMX.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: VMX: Drop explicit 'nested' check from vmx_set_cr4()
Sean Christopherson [Wed, 7 Oct 2020 01:44:13 +0000 (18:44 -0700)]
KVM: VMX: Drop explicit 'nested' check from vmx_set_cr4()

Drop vmx_set_cr4()'s explicit check on the 'nested' module param now
that common x86 handles the check by incorporating VMXE into the CR4
reserved bits, via kvm_cpu_caps.  X86_FEATURE_VMX is set in kvm_cpu_caps
(by vmx_set_cpu_caps()), if and only if 'nested' is true.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: VMX: Drop guest CPUID check for VMXE in vmx_set_cr4()
Sean Christopherson [Wed, 7 Oct 2020 01:44:12 +0000 (18:44 -0700)]
KVM: VMX: Drop guest CPUID check for VMXE in vmx_set_cr4()

Drop vmx_set_cr4()'s somewhat hidden guest_cpuid_has() check on VMXE now
that common x86 handles the check by incorporating VMXE into the CR4
reserved bits, i.e. in cr4_guest_rsvd_bits.  This fixes a bug where KVM
incorrectly rejects KVM_SET_SREGS with CR4.VMXE=1 if it's executed
before KVM_SET_CPUID{,2}.

Fixes: 5e1746d6205d ("KVM: nVMX: Allow setting the VMXE bit in CR4")
Reported-by: Stas Sergeev <stsp@users.sourceforge.net>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agokvm: mmu: fix is_tdp_mmu_check when the TDP MMU is not in use
Paolo Bonzini [Sun, 15 Nov 2020 13:55:43 +0000 (08:55 -0500)]
kvm: mmu: fix is_tdp_mmu_check when the TDP MMU is not in use

In some cases where shadow paging is in use, the root page will
be either mmu->pae_root or vcpu->arch.mmu->lm_root.  Then it will
not have an associated struct kvm_mmu_page, because it is allocated
with alloc_page instead of kvm_mmu_alloc_page.

Just return false quickly from is_tdp_mmu_root if the TDP MMU is
not in use, which also includes the case where shadow paging is
enabled.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: SVM: Update cr3_lm_rsvd_bits for AMD SEV guests
Babu Moger [Thu, 12 Nov 2020 22:18:03 +0000 (16:18 -0600)]
KVM: SVM: Update cr3_lm_rsvd_bits for AMD SEV guests

For AMD SEV guests, update the cr3_lm_rsvd_bits to mask
the memory encryption bit in reserved bits.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <160521948301.32054.5783800787423231162.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch
Babu Moger [Thu, 12 Nov 2020 22:17:56 +0000 (16:17 -0600)]
KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch

SEV guests fail to boot on a system that supports the PCID feature.

While emulating the RSM instruction, KVM reads the guest CR3
and calls kvm_set_cr3(). If the vCPU is in the long mode,
kvm_set_cr3() does a sanity check for the CR3 value. In this case,
it validates whether the value has any reserved bits set. The
reserved bit range is 63:cpuid_maxphysaddr(). When AMD memory
encryption is enabled, the memory encryption bit is set in the CR3
value. The memory encryption bit may fall within the KVM reserved
bit range, causing the KVM emulation failure.

Introduce a new field cr3_lm_rsvd_bits in kvm_vcpu_arch which will
cache the reserved bits in the CR3 value. This will be initialized
to rsvd_bits(cpuid_maxphyaddr(vcpu), 63).

If the architecture has any special bits(like AMD SEV encryption bit)
that needs to be masked from the reserved bits, should be cleared
in vendor specific kvm_x86_ops.vcpu_after_set_cpuid handler.

Fixes: a780a3ea628268b2 ("KVM: X86: Fix reserved bits check for MOV to CR3")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <160521947657.32054.3264016688005356563.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: x86: clflushopt should be treated as a no-op by emulation
David Edmondson [Tue, 3 Nov 2020 12:04:00 +0000 (12:04 +0000)]
KVM: x86: clflushopt should be treated as a no-op by emulation

The instruction emulator ignores clflush instructions, yet fails to
support clflushopt. Treat both similarly.

Fixes: 13e457e0eebf ("KVM: x86: Emulator does not decode clflush well")
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20201103120400.240882-1-david.edmondson@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoMerge tag 'kvmarm-fixes-5.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Paolo Bonzini [Fri, 13 Nov 2020 11:28:23 +0000 (06:28 -0500)]
Merge tag 'kvmarm-fixes-5.10-3' of git://git./linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for v5.10, take #3

- Allow userspace to downgrade ID_AA64PFR0_EL1.CSV2
- Inject UNDEF on SCXTNUM_ELx access

4 months agoMerge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Linus Torvalds [Fri, 13 Nov 2020 00:39:58 +0000 (16:39 -0800)]
Merge tag 'fscrypt-for-linus' of git://git./fs/fscrypt/fscrypt

Pull fscrypt fix from Eric Biggers:
 "Fix a regression where new files weren't using inline encryption when
  they should be"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: fix inline encryption not used on new files

4 months agoMerge tag 'gfs2-v5.10-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 13 Nov 2020 00:37:14 +0000 (16:37 -0800)]
Merge tag 'gfs2-v5.10-rc3-fixes' of git://git./linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fixes from Andreas Gruenbacher:
 "Fix jdata data corruption and glock reference leak"

* tag 'gfs2-v5.10-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix case in which ail writes are done to jdata holes
  Revert "gfs2: Ignore journal log writes for jdata holes"
  gfs2: fix possible reference leak in gfs2_check_blk_type

4 months agoMerge tag 'net-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 12 Nov 2020 22:02:04 +0000 (14:02 -0800)]
Merge tag 'net-5.10-rc4' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Current release - regressions:

   - arm64: dts: fsl-ls1028a-kontron-sl28: specify in-band mode for
     ENETC

  Current release - bugs in new features:

   - mptcp: provide rmem[0] limit offset to fix oops

  Previous release - regressions:

   - IPv6: Set SIT tunnel hard_header_len to zero to fix path MTU
     calculations

   - lan743x: correctly handle chips with internal PHY

   - bpf: Don't rely on GCC __attribute__((optimize)) to disable GCSE

   - mlx5e: Fix VXLAN port table synchronization after function reload

  Previous release - always broken:

   - bpf: Zero-fill re-used per-cpu map element

   - fix out-of-order UDP packets when forwarding with UDP GSO fraglists
     turned on:
       - fix UDP header access on Fast/frag0 UDP GRO
       - fix IP header access and skb lookup on Fast/frag0 UDP GRO

   - ethtool: netlink: add missing netdev_features_change() call

   - net: Update window_clamp if SOCK_RCVBUF is set

   - igc: Fix returning wrong statistics

   - ch_ktls: fix multiple leaks and corner cases in Chelsio TLS offload

   - tunnels: Fix off-by-one in lower MTU bounds for ICMP/ICMPv6 replies

   - r8169: disable hw csum for short packets on all chip versions

   - vrf: Fix fast path output packet handling with async Netfilter
     rules"

* tag 'net-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (65 commits)
  lan743x: fix use of uninitialized variable
  net: udp: fix IP header access and skb lookup on Fast/frag0 UDP GRO
  net: udp: fix UDP header access on Fast/frag0 UDP GRO
  devlink: Avoid overwriting port attributes of registered port
  vrf: Fix fast path output packet handling with async Netfilter rules
  cosa: Add missing kfree in error path of cosa_write
  net: switch to the kernel.org patchwork instance
  ch_ktls: stop the txq if reaches threshold
  ch_ktls: tcb update fails sometimes
  ch_ktls/cxgb4: handle partial tag alone SKBs
  ch_ktls: don't free skb before sending FIN
  ch_ktls: packet handling prior to start marker
  ch_ktls: Correction in middle record handling
  ch_ktls: missing handling of header alone
  ch_ktls: Correction in trimmed_len calculation
  cxgb4/ch_ktls: creating skbs causes panic
  ch_ktls: Update cheksum information
  ch_ktls: Correction in finding correct length
  cxgb4/ch_ktls: decrypted bit is not enough
  net/x25: Fix null-ptr-deref in x25_connect
  ...

4 months agoMerge tag 'nfs-for-5.10-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Linus Torvalds [Thu, 12 Nov 2020 21:49:12 +0000 (13:49 -0800)]
Merge tag 'nfs-for-5.10-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Stable fixes:
  - Fix failure to unregister shrinker

  Other fixes:
  - Fix unnecessary locking to clear up some contention
  - Fix listxattr receive buffer size
  - Fix default mount options for nfsroot"

* tag 'nfs-for-5.10-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Remove unnecessary inode lock in nfs_fsync_dir()
  NFS: Remove unnecessary inode locking in nfs_llseek_dir()
  NFS: Fix listxattr receive buffer size
  NFSv4.2: fix failure to unregister shrinker
  nfsroot: Default mount option should ask for built-in NFS version

4 months agoKVM: arm64: Handle SCXTNUM_ELx traps
Marc Zyngier [Tue, 10 Nov 2020 14:13:08 +0000 (14:13 +0000)]
KVM: arm64: Handle SCXTNUM_ELx traps

As the kernel never sets HCR_EL2.EnSCXT, accesses to SCXTNUM_ELx
will trap to EL2. Let's handle that as gracefully as possible
by injecting an UNDEF exception into the guest. This is consistent
with the guest's view of ID_AA64PFR0_EL1.CSV2 being at most 1.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20201110141308.451654-4-maz@kernel.org
4 months agoKVM: arm64: Unify trap handlers injecting an UNDEF
Marc Zyngier [Tue, 10 Nov 2020 14:13:07 +0000 (14:13 +0000)]
KVM: arm64: Unify trap handlers injecting an UNDEF

A large number of system register trap handlers only inject an
UNDEF exeption, and yet each class of sysreg seems to provide its
own, identical function.

Let's unify them all, saving us introducing yet another one later.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20201110141308.451654-3-maz@kernel.org
4 months agoKVM: arm64: Allow setting of ID_AA64PFR0_EL1.CSV2 from userspace
Marc Zyngier [Tue, 10 Nov 2020 14:13:06 +0000 (14:13 +0000)]
KVM: arm64: Allow setting of ID_AA64PFR0_EL1.CSV2 from userspace

We now expose ID_AA64PFR0_EL1.CSV2=1 to guests running on hosts
that are immune to Spectre-v2, but that don't have this field set,
most likely because they predate the specification.

However, this prevents the migration of guests that have started on
a host the doesn't fake this CSV2 setting to one that does, as KVM
rejects the write to ID_AA64PFR0_EL2 on the grounds that it isn't
what is already there.

In order to fix this, allow userspace to set this field as long as
this doesn't result in a promising more than what is already there
(setting CSV2 to 0 is acceptable, but setting it to 1 when it is
already set to 0 isn't).

Fixes: e1026237f9067 ("KVM: arm64: Set CSV2 for guests on hardware unaffected by Spectre-v2")
Reported-by: Peng Liang <liangpeng10@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20201110141308.451654-2-maz@kernel.org
4 months agoMerge tag 'v5.10-rc1' into kvmarm-master/next
Marc Zyngier [Thu, 12 Nov 2020 21:20:43 +0000 (21:20 +0000)]
Merge tag 'v5.10-rc1' into kvmarm-master/next

Linux 5.10-rc1

Signed-off-by: Marc Zyngier <maz@kernel.org>
4 months agoMerge tag 'acpi-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Thu, 12 Nov 2020 19:06:53 +0000 (11:06 -0800)]
Merge tag 'acpi-5.10-rc4' of git://git./linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
 "These are mostly docmentation fixes and janitorial changes plus some
  new device IDs and a new quirk.

  Specifics:

   - Fix documentation regarding GPIO properties (Andy Shevchenko)

   - Fix spelling mistakes in ACPI documentation (Flavio Suligoi)

   - Fix white space inconsistencies in ACPI code (Maximilian Luz)

   - Fix string formatting in the ACPI Generic Event Device (GED) driver
     (Nick Desaulniers)

   - Add Intel Alder Lake device IDs to the ACPI drivers used by the
     Dynamic Platform and Thermal Framework (Srinivas Pandruvada)

   - Add lid-related DMI quirk for Medion Akoya E2228T to the ACPI
     button driver (Hans de Goede)"

* tag 'acpi-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: DPTF: Support Alder Lake
  Documentation: ACPI: fix spelling mistakes
  ACPI: button: Add DMI quirk for Medion Akoya E2228T
  ACPI: GED: fix -Wformat
  ACPI: Fix whitespace inconsistencies
  ACPI: scan: Fix acpi_dma_configure_id() kerneldoc name
  Documentation: firmware-guide: gpio-properties: Clarify initial output state
  Documentation: firmware-guide: gpio-properties: active_low only for GpioIo()
  Documentation: firmware-guide: gpio-properties: Fix factual mistakes

4 months agoMerge tag 'pm-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Linus Torvalds [Thu, 12 Nov 2020 19:03:38 +0000 (11:03 -0800)]
Merge tag 'pm-5.10-rc4' of git://git./linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "Make the intel_pstate driver behave as expected when it operates in
  the passive mode with HWP enabled and the 'powersave' governor on top
  of it"

* tag 'pm-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Take CPUFREQ_GOV_STRICT_TARGET into account
  cpufreq: Add strict_target to struct cpufreq_policy
  cpufreq: Introduce CPUFREQ_GOV_STRICT_TARGET
  cpufreq: Introduce governor flags

4 months agolan743x: fix use of uninitialized variable
Sven Van Asbroeck [Thu, 12 Nov 2020 15:25:13 +0000 (10:25 -0500)]
lan743x: fix use of uninitialized variable

When no devicetree is present, the driver will use an
uninitialized variable.

Fix by initializing this variable.

Fixes: 902a66e08cea ("lan743x: correctly handle chips with internal PHY")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com>
Link: https://lore.kernel.org/r/20201112152513.1941-1-TheSven73@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoMerge branch 'net-udp-fix-fast-frag0-udp-gro'
Jakub Kicinski [Thu, 12 Nov 2020 17:55:58 +0000 (09:55 -0800)]
Merge branch 'net-udp-fix-fast-frag0-udp-gro'

Alexander Lobakin says:

====================
net: udp: fix Fast/frag0 UDP GRO

While testing UDP GSO fraglists forwarding through driver that uses
Fast GRO (via napi_gro_frags()), I was observing lots of out-of-order
iperf packets:

[ ID] Interval           Transfer     Bitrate         Jitter
[SUM]  0.0-40.0 sec  12106 datagrams received out-of-order

Simple switch to napi_gro_receive() or any other method without frag0
shortcut completely resolved them.

I've found two incorrect header accesses in GRO receive callback(s):
 - udp_hdr() (instead of udp_gro_udphdr()) that always points to junk
   in "fast" mode and could probably do this in "regular".
   This was the actual bug that caused all out-of-order delivers;
 - udp{4,6}_lib_lookup_skb() -> ip{,v6}_hdr() (instead of
   skb_gro_network_header()) that potentionally might return odd
   pointers in both modes.

Each patch addresses one of these two issues.

This doesn't cover a support for nested tunnels as it's out of the
subject and requires more invasive changes. It will be handled
separately in net-next series.

Credits:
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Since v4 [0]:
 - split the fix into two logical ones (Willem);
 - replace ternaries with plain ifs to beautify the code (Jakub);
 - drop p->data part to reintroduce it later in abovementioned set.

Since v3 [1]:
 - restore the original {,__}udp{4,6}_lib_lookup_skb() and use
   private versions of them inside GRO code (Willem).

Since v2 [2]:
 - dropped redundant check introduced in v2 as it's performed right
   before (thanks to Eric);
 - udp_hdr() switched to data + off for skbs from list (also Eric);
 - fixed possible malfunction of {,__}udp{4,6}_lib_lookup_skb() with
   Fast/frag0 due to ip{,v6}_hdr() usage (Willem).

Since v1 [3]:
 - added a NULL pointer check for "uh" as suggested by Willem.

[0] https://lore.kernel.org/netdev/Ha2hou5eJPcblo4abjAqxZRzIl1RaLs2Hy0oOAgFs@cp4-web-036.plabs.ch
[1] https://lore.kernel.org/netdev/MgZce9htmEtCtHg7pmWxXXfdhmQ6AHrnltXC41zOoo@cp7-web-042.plabs.ch
[2] https://lore.kernel.org/netdev/0eaG8xtbtKY1dEKCTKUBubGiC9QawGgB3tVZtNqVdY@cp4-web-030.plabs.ch
[3] https://lore.kernel.org/netdev/YazU6GEzBdpyZMDMwJirxDX7B4sualpDG68ADZYvJI@cp4-web-034.plabs.ch
====================

Link: https://lore.kernel.org/r/hjGOh0iCOYyo1FPiZh6TMXcx3YCgNs1T1eGKLrDz8@cp4-web-037.plabs.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonet: udp: fix IP header access and skb lookup on Fast/frag0 UDP GRO
Alexander Lobakin [Wed, 11 Nov 2020 20:45:38 +0000 (20:45 +0000)]
net: udp: fix IP header access and skb lookup on Fast/frag0 UDP GRO

udp{4,6}_lib_lookup_skb() use ip{,v6}_hdr() to get IP header of the
packet. While it's probably OK for non-frag0 paths, this helpers
will also point to junk on Fast/frag0 GRO when all headers are
located in frags. As a result, sk/skb lookup may fail or give wrong
results. To support both GRO modes, skb_gro_network_header() might
be used. To not modify original functions, add private versions of
udp{4,6}_lib_lookup_skb() only to perform correct sk lookups on GRO.

Present since the introduction of "application-level" UDP GRO
in 4.7-rc1.

Misc: replace totally unneeded ternaries with plain ifs.

Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket")
Suggested-by: Willem de Bruijn <willemb@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonet: udp: fix UDP header access on Fast/frag0 UDP GRO
Alexander Lobakin [Wed, 11 Nov 2020 20:45:25 +0000 (20:45 +0000)]
net: udp: fix UDP header access on Fast/frag0 UDP GRO

UDP GRO uses udp_hdr(skb) in its .gro_receive() callback. While it's
probably OK for non-frag0 paths (when all headers or even the entire
frame are already in skb head), this inline points to junk when
using Fast GRO (napi_gro_frags() or napi_gro_receive() with only
Ethernet header in skb head and all the rest in the frags) and breaks
GRO packet compilation and the packet flow itself.
To support both modes, skb_gro_header_fast() + skb_gro_header_slow()
are typically used. UDP even has an inline helper that makes use of
them, udp_gro_udphdr(). Use that instead of troublemaking udp_hdr()
to get rid of the out-of-order delivers.

Present since the introduction of plain UDP GRO in 5.0-rc1.

Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agogfs2: Fix case in which ail writes are done to jdata holes
Bob Peterson [Thu, 12 Nov 2020 16:02:48 +0000 (10:02 -0600)]
gfs2: Fix case in which ail writes are done to jdata holes

Patch b2a846dbef4e ("gfs2: Ignore journal log writes for jdata holes")
tried (unsuccessfully) to fix a case in which writes were done to jdata
blocks, the blocks are sent to the ail list, then a punch_hole or truncate
operation caused the blocks to be freed. In other words, the ail items
are for jdata holes. Before b2a846dbef4e, the jdata hole caused function
gfs2_block_map to return -EIO, which was eventually interpreted as an
IO error to the journal, and then withdraw.

This patch changes function gfs2_get_block_noalloc, which is only used
for jdata writes, so it returns -ENODATA rather than -EIO, and when
-ENODATA is returned to gfs2_ail1_start_one, the error is ignored.
We can safely ignore it because gfs2_ail1_start_one is only called
when the jdata pages have already been written and truncated, so the
ail1 content no longer applies.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
4 months agoRevert "gfs2: Ignore journal log writes for jdata holes"
Bob Peterson [Wed, 11 Nov 2020 17:09:55 +0000 (11:09 -0600)]
Revert "gfs2: Ignore journal log writes for jdata holes"

This reverts commit b2a846dbef4ef54ef032f0f5ee188c609a0278a7.

That commit changed the behavior of function gfs2_block_map to return
-ENODATA in cases where a hole (IOMAP_HOLE) is encountered and create is
false.  While that fixed the intended problem for jdata, it also broke
other callers of gfs2_block_map such as some jdata block reads.  Before
the patch, an encountered hole would be skipped and the buffer seen as
unmapped by the caller.  The patch changed the behavior to return
-ENODATA, which is interpreted as an error by the caller.

The -ENODATA return code should be restricted to the specific case where
jdata holes are encountered during ail1 writes.  That will be done in a
later patch.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
4 months agoMerge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Jakub Kicinski [Thu, 12 Nov 2020 16:47:22 +0000 (08:47 -0800)]
Merge branch '40GbE' of git://git./linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2020-11-10

This series contains updates to i40e and igc drivers and the MAINTAINERS
file.

Slawomir fixes updating VF MAC addresses to fix various issues related
to reporting and setting of these addresses for i40e.

Dan Carpenter fixes a possible used before being initialized issue for
i40e.

Vinicius fixes reporting of netdev stats for igc.

Tony updates repositories for Intel Ethernet Drivers.

* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  MAINTAINERS: Update repositories for Intel Ethernet Drivers
  igc: Fix returning wrong statistics
  i40e, xsk: uninitialized variable in i40e_clean_rx_irq_zc()
  i40e: Fix MAC address setting for a VF via Host/VM
====================

Link: https://lore.kernel.org/r/20201111001955.533210-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agodevlink: Avoid overwriting port attributes of registered port
Parav Pandit [Wed, 11 Nov 2020 03:47:44 +0000 (05:47 +0200)]
devlink: Avoid overwriting port attributes of registered port

Cited commit in fixes tag overwrites the port attributes for the
registered port.

Avoid such error by checking registered flag before setting attributes.

Fixes: 71ad8d55f8e5 ("devlink: Replace devlink_port_attrs_set parameters with a struct")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20201111034744.35554-1-parav@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>